scrapy 파충류 의 item pipeline 데이터 저장

프로필
앞의 블 로그 에서 우 리 는 모두 "- o * *. josn" 매개 변 수 를 사용 하여 추출 한 item 데 이 터 를 json 파일 로 출력 합 니 다. 이 매개 변 수 를 추가 하지 않 으 면 추출 한 데 이 터 를 출력 하지 않 습 니 다.사실 Item 이 Spider 에서 수집 되면 Item Pipeline 으로 전 달 됩 니 다. 이 Item Pipeline 구성 요 소 는 정 의 된 순서대로 Item 을 처리 합 니 다.프로젝트 를 만 들 때 scrapy 는 기본 pipelines. py 를 생 성 합 니 다. 예 를 들 어:

vim pipelines.py
class DoubanPipeline(object):
    def process_item(self, item, spider):
        return item

그러나 우 리 는 구체 적 으로 정의 하지 않 았 기 때문에 파충 류 를 집행 하 는 것 은 결 과 를 출력 하지 않 는 다.
다음은 pipeline 을 정의 하여 추출 한 item 을 pipeline 을 통 해 json 파일, mongodb 데이터 베 이 스 를 출력 합 니 다.본 고 는 파충류 가 scrapy 파충류 의 crawlspide 가 콩잎 을 기어 올 라 일주일 가까이 같은 도시 에서 활동 하 는 것 을 예 로 들 면 item, item pipeline 을 업데이트 하면 된다.
\ # \ # json 파일 로 출력 1. 정의 item

vim items.py
def filter_string(x):
	str = x.split(':')
	return str[1].strip()
class tongcheng(scrapy.Item):
        #  
        title = scrapy.Field()
        #  
        time = scrapy.Field()
        #  
        address = scrapy.Field(output_processor=Join())
        #  
        money = scrapy.Field()
        #     
        intrest = scrapy.Field()
        #    
        join = scrapy.Field()

2. 정의 item pipeline

vim pipelines.py
# json    
from scrapy.exporters import JsonItemExporter
# jl    
#from scrapy.exporters import JsonLinesItemExporter
# csv    
#from scrapy.exporters import CsvItemExporter
class tongcheng_pipeline_json(object):
	def open_spider(self, spider):
		#    ， spider    ，       。
		#   tongcheng_pipeline.json  
		self.file = open('tongcheng_pipeline.json', 'wb')
		self.exporter = JsonItemExporter(self.file, encoding='utf-8')
		self.exporter.start_exporting()
	def close_spier(selef, spider):
		#    ， spider    ，       
		self.exporter.finish_exporting()
		self.file.close()
	def process_item(self, item, spider):
		self.exporter.export_item(item)
		return item

3. item pipeline 을 활성화 시 키 려 면 설정 파일 에 활성 화 를 추가 해 야 사용 할 수 있 습 니 다. 따라서 settings. py 를 설정 해 야 합 니 다.

vim settings.py
ITEM_PIPELINES = {
	#      ，       ，     。
    #'douban.pipelines.DoubanPipeline': 300,
    #          pipeline
    'douban.pipelines.tongcheng_pipeline_json': 300,
}

4. 파충류 시동 걸 기

scrapy crawl tongcheng
#        
2018-01-20 10:48:10 [scrapy.middleware] INFO: Enabled item pipelines:
['douban.pipelines.tongcheng_pipeline_json']
....

#  tongcheng_pipeline.json  
cat tongcheng_pipeline.json
[{"money": ["263 "], "address": "                           ", "join": ["69 "], "intrest": ["174 "], "title": ["       《         》   "]},{"money": ["93 - 281 "], "address": "                                 ", "join": ["4 "], "intrest": ["11 "],"title": ["2018          · ·      《  ·  》-  "]}.....]

위 와 같이 파충류 가 설정 파일 의 pipeline 을 호출 하고 추출 한 item 을 tongcheng 에 출력 한 것 을 설명 합 니 다.pipeline. json 파일 입 니 다.
\ # \ # # # 주의 1. settings. py 에 설 치 된 pipeline 은 procject 의 모든 파충류 가 우선 순위 에 따라 기본 으로 호출 됩 니 다. 예 를 들 어:

ITEM_PIPELINES = {
    'douban.pipelines.DoubanPipeline': 300,
    'douban.pipelines.movieTop250_crawlspider_json': 200,
    'douban.pipelines.tongcheng_pipeline_json': 100,
}

우리 가 "scrapy crawl tongcheng" 을 사용 할 때 우선 순위 가 낮은 것 부터 높 은 것, 즉 100, 200, 300 순서에 따라 pipeline 을 호출 합 니 다. 인쇄 정보 에서 볼 수 있 습 니 다.

2018-01-20 10:48:10 [scrapy.middleware] INFO: Enabled item pipelines:
['douban.pipelines.tongcheng_pipeline_json',
douban.pipelines.movieTop250_crawlspider_json,
douban.pipelines.DoubanPipeline
]

2. 서로 다른 spider 바 인 딩 pipeline 은 하나의 procject 에 여러 가지 기능 을 가 진 파충류 가 있 기 때문에 우 리 는 파충 류 를 서로 다른 pipeline 에 연결 하여 추출 한 내용 을 다른 곳 에 저장 해 야 합 니 다.어떻게 실현 합 니까?우 리 는 scrapy 가 실행 되면 서로 다른 프로필 을 호출 할 것 이라는 것 을 알 고 있 습 니 다. 우선 순위 에 따라 높 은 것 부터 낮은 것 까지:

1.Command line options (most precedence)
2.Settings per-spider
3.Project settings module
4.Default settings per-command
5.Default global settings (less precedence

우리 가 사용 하 는 settings. py 는 "Project settings module" 에 속 하기 때문에 우선 순위 가 높 은 프로필 을 사용 하면 바 인 딩 pipeline 을 실현 할 수 있 습 니 다. 예 를 들 어 "Settings per - speder" 입 니 다.

vim tongcheng.py
#     custom_settings  
class TongchengSpider(CrawlSpider):
    name = 'tongcheng'
    allowed_domains = ['douban.com']
    start_urls = ['https://www.douban.com/location/shenzhen/events/week-all']
    custom_settings = {
    	'ITEM_PIPELINES': {
    	    'douban.pipelines.tongcheng_pipeline_json': 300,
    	}
    }
    rules = (
        Rule(LinkExtractor(allow=r'start=10')),
        Rule(LinkExtractor(allow=r'https://www.douban.com/event/\d+/'),callback='parse_item'),
    )   
    
    def parse_item(self, response):
        loader = ItemLoader(item=tongcheng(),selector=response)
        info = loader.nested_xpath('//div[@class="event-info"]')
        info.add_xpath('title','h1[@itemprop="summary"]/text()')
        info.add_xpath('time','div[@class="event-detail"]/ul[@class="calendar-strs"]/li/text()')
        info.add_xpath('address','div[@itemprop="location"]/span[@class="micro-address"]/span[@class="micro-address"]/text()')
        info.add_xpath('money','div[@class="event-detail"]/span[@itemprop="ticketAggregate"]/text()')
        info.add_xpath('intrest','div[@class="interest-attend pl"]/span[1]/text()')
        info.add_xpath('join','div[@class="interest-attend pl"]/span[3]/text()')
        
        yield loader.load_item()

통과 customsettings 우 리 는 tongcheng 를 연결 할 수 있 습 니 다.pipeline_json, settings. py 의 모든 pipeline 을 호출 하지 않도록 합 니 다.
\ # \ # mongodb 로 출력
테스트 이기 때문에 docker 를 사용 하여 mongo 를 설치 하고 실행 합 니 다.
1. docker 설치 mongo

#    
sudo docker search mongo
#    
sudo docker pull mongo
#  mongodb，     27017       27017，             /data/db
sudo docker run --name scrapy-mongodb -p 27017:27017 -v /home/yanggd/docker/mongodb:/data/db -d mongo
#     mongo
sudo docker run -it mongo mongo --host 10.11.2.102

2. 설정 파일 에 데이터베이스 링크 파 라미 터 를 추가

vim ../settings.py
#    
MONGO_HOST = '10.11.2.102'
MONGO_PORT = 27017
MONGO_DB = 'douban'

3. 파이프라인 정의

vim pipelines.py
import pymongo
class tongcheng_pipeline_mongodb(object):
        mongo_collection = "tongcheng"
        def __init__(self, mongo_host, mongo_port, mongo_db):
                self.mongo_host = mongo_host
                self.mongo_port = mongo_port
                self.mongo_db = mongo_db
        @classmethod
        def from_crawler(cls, crawl):
                return cls(
                        mongo_host = crawl.settings.get("MONGO_HOST"),
                        mongo_port = crawl.settings.get("MONGO_PORT"),
                        mongo_db = crawl.settings.get("MONGO_DB")
                )       
        def open_spider(self, spider):
                self.client = pymongo.MongoClient(self.mongo_host, self.mongo_port)
                self.db = self.client[self.mongo_db]
        def close_spider(self, spider):
                self.client.close()
        def process_item(self, item, spider):
                tongchenginfo = dict(item)
                self.db[self.mongo_collection].insert_one(tongchenginfo)
                return item

3. 귀속 파이프라인
procject 에 파충류 가 여러 개 있 기 때문에 custom 을 통 해settings 귀속 pipeline.

vim tongcheng.py
#     custom_settings  
class TongchengSpider(CrawlSpider):
    name = 'tongcheng'
    allowed_domains = ['douban.com']
    start_urls = ['https://www.douban.com/location/shenzhen/events/week-all']
    custom_settings = {
    	'ITEM_PIPELINES': {
   	    'douban.pipelines.tongcheng_pipeline_mongodb': 300,
    	}
    }
    rules = (
        Rule(LinkExtractor(allow=r'start=10')),
        Rule(LinkExtractor(allow=r'https://www.douban.com/event/\d+/'),callback='parse_item'),
    )   
    
    def parse_item(self, response):
        loader = ItemLoader(item=tongcheng(),selector=response)
        info = loader.nested_xpath('//div[@class="event-info"]')
        info.add_xpath('title','h1[@itemprop="summary"]/text()')
        info.add_xpath('time','div[@class="event-detail"]/ul[@class="calendar-strs"]/li/text()')
        info.add_xpath('address','div[@itemprop="location"]/span[@class="micro-address"]/span[@class="micro-address"]/text()')
        info.add_xpath('money','div[@class="event-detail"]/span[@itemprop="ticketAggregate"]/text()')
        info.add_xpath('intrest','div[@class="interest-attend pl"]/span[1]/text()')
        info.add_xpath('join','div[@class="interest-attend pl"]/span[3]/text()')
        
        yield loader.load_item()

4. 데이터베이스 보기

#     mongo
sudo docker run -it mongo mongo --host 10.11.2.102
> show dbs
admin   0.000GB
config  0.000GB
douban  0.000GB
local   0.000GB
> use douban
switched to db douban
> show collections
movietop250
tongcheng
> db.tongcheng.find()
{ "_id" : ObjectId("5a6319a76e85dc5a777131d2"), "join" : [ "69 " ], "intrest" : [ "175 " ], "title" : [ "       《         》   " ], "money" : [ "263 " ], "address" : "                           " }
{ "_id" : ObjectId("5a6319a96e85dc5a777131d3"), "join" : [ "4 " ], "intrest" : [ "11 " ], "title" : [ "2018          · ·      《  ·  》-  " ], "money" : [ "93 - 281 " ], "address" : "                                 " }
{ "_id" : ObjectId("5a6319ab6e85dc5a777131d4"), "join" : [ "7 " ], "intrest" : [ "16 " ], "title" : [ "2018        ·  X   X     《   》-  " ], "money" : [ "93 - 469 " ], "address" : "                                 " }
......

마음 에 드 시 면 제 공식 번호 인 '무뚝뚝 한 아저씨 가 운 비 를 사랑 합 니 다' 를 팔 로 우 해 주세요.

scrapy 파충류 의 item pipeline 데이터 저장

좋은 웹페이지 즐겨찾기