scrapy 파충류 파일 다운로드, 이름 바 꾸 기

scrapy 파일 다운로드 및 이름 바 꾸 기, python 파일 다운로드 및 이름 바 꾸 기
대상: 웹 페이지 다운로드http://www.zimuku.cn/search?q=&t=onlyst&p=1 자막 파일
디자인: scrapy 와 관련 된 파일 다운로드 미들웨어
확장: 그림 다운로드 도 같은 원리
코드: 아래
(1) 파충류 모듈

# coding:utf-8

import sys
import urllib
import os
reload(sys)
sys.setdefaultencoding( "utf-8" )

import scrapy
from w3lib.html import remove_tags
from subtitle_crawler.items import SubCrawlerItem

class SubSpider(scrapy.Spider):
    name = "sub"
    allowed_domains = []
    start_urls = [
            "http://www.zimuku.cn/search?q=&t=onlyst&p=%s" %i for i in range(1,21)
    ]

    def parse(self, response):
        hrefs = response.selector.xpath('//div[contains(@class, "persub")]/h1/a/@href').extract()
        for href in hrefs:
            url = response.urljoin(href)
            # print "processing1: ", url
            yield scrapy.Request(url, callback=self.parse_detail)

    def parse_detail(self, response):
        url = response.selector.xpath('//li[contains(@class, "dlsub")]/div/a/@href').extract()[0]
        print "processing2: ", url
        item = SubtitleCrawlerItem()
        item['file_url'] = [url]
        yield item

(2) item 모듈

import scrapy
class SubCrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    file_url = scrapy.Field()

(3) Pipeline 모듈, 계승 이름 바 꾸 기 모듈

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import scrapy
import os
# from scrapy.pipeline.images import ImagesPipeline
from scrapy.pipelines.files import FilesPipeline
from scrapy.exceptions import DropItem

class MyfilesPipeline(FilesPipeline):

    def get_media_requests(self, item,info):
        for url in item["file_url"]:
            yield scrapy.Request(url)

    def file_path(self, request, response=None, info=None):
        """
             
        """
        path = os.path.join('D:\\result', ''.join( [request.url.replace('//', '_').replace('/', '_').replace(':', '_').replace('.', '_').replace('__','_'), '.zip']))
        return path

(4) settings 모듈 FILES 추가STORE 및 수정 ITEMPIPELINES
FILES_STORE 는 다운로드 경로 입 니 다.

FILES_STORE = 'D:\\result'
ITEM_PIPELINES = {
    # 'subtitle_crawler.pipelines.SubtitleCrawlerPipeline': 300,
    'subtitle_crawler.pipelines.MyfilesPipeline': 400,
    # 'scrapy.pipeline.files.FilesPipeline': 1
}

못 하 는 게 있 으 면 댓 글로 남 겨 주세요.
공식 문서 참조:http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/images.html

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

Python 파충류 (1) - 데이터 세척 및 추출

re 모듈 의 사용 추출, 일치, 교체 추출: findall () 일치: match () 교체: sub () 예: Xpath 의 기본 문법 표현 식 묘사 하 다. 루트 노드 선택 또는 하위 임의의 노드, 위치 고려 ...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

scrapy 파충류 파일 다운로드, 이름 바 꾸 기

좋은 웹페이지 즐겨찾기