Scrapy-Redis 의 RedisSpider 와 RedisCrawlSpider 상세 설명

지난 장 에서 우 리 는 scrapy-redis 를 이용 하여 경 동 도서 파충류 의 분포 식 배치 와 데 이 터 를 얻 었 다.그러나 다음 과 같은 문제 가 존재 한다.
모든 파충류 인 스 턴 스 가 시 작 될 때 starturls 가 기어 오 르 기 시작 합 니 다.즉,모든 파충류 인 스 턴 스 가 start 를 요청 합 니 다.urls 의 주 소 는 중복 요청 으로 시스템 자원 을 낭비 합 니 다.
이 문 제 를 해결 하기 위해 Scrapy-Redis 는 RedisSpider 와 RedisCrawlSpider 두 개의 파충 류 를 제공 합 니 다.이 두 종류의 Spider 를 계승 하여 시작 할 때 지정 한 Redis 목록 에서 start 를 가 져 올 수 있 습 니 다.urls；임의의 파충류 인 스 턴 스 가 Redis 목록 에서 url 을 가 져 올 때 목록 에서 팝 업 되 기 때문에 다른 파충류 인 스 턴 스 는 이 url 을 반복 해서 읽 을 수 없습니다.Redis 목록 에서 초기 url 로 가 져 오지 않 은 파충류 의 인 스 턴 스 는 start 까지 차단 상태 에 있 습 니 다.urls 목록 에 새 시작 주소 나 Redis 의 Requests 목록 에 처리 할 요청 이 있 습 니 다.
여기 서 우 리 는 인터넷 도서 정 보 를 얻 는 것 을 예 로 들 어 이 두 Spider 의 용법 에 대해 간단 한 예 를 들 었 다.
settings.py 설정 은 다음 과 같 습 니 다.


# -*- coding: utf-8 -*-

BOT_NAME = 'dang_dang'

SPIDER_MODULES = ['dang_dang.spiders']
NEWSPIDER_MODULE = 'dang_dang.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False


######################################################
##############   Scrapy-Redis    ################
######################################################

#   Redis       
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

#      Redis  Requests  
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

#            Redis      
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

#  Requests      Redis，          
SCHEDULER_PERSIST = True

# Requests     ，       
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

#      items   Redis         
ITEM_PIPELINES = {
  'scrapy_redis.pipelines.RedisPipeline': 300
}

RedisSpider 코드 예시


# -*- coding: utf-8 -*-
import scrapy
import re
import urllib
from copy import deepcopy
from scrapy_redis.spiders import RedisSpider


class DangdangSpider(RedisSpider):
  name = 'dangdang'
  allowed_domains = ['dangdang.com']
  redis_key = 'dangdang:book'
  pattern = re.compile(r"(http|https)://category.dangdang.com/cp(.*?).html", re.I)

  # def __init__(self, *args, **kwargs):
  #   #            
  #   domain = kwargs.pop('domain', '')
  #   self.allowed_domains = filter(None, domain.split(','))
  #   super(DangdangSpider, self).__init__(*args, **kwargs)

  def parse(self, response): #            
    #         
    div_list = response.xpath("//div[@class='con flq_body']/div")
    for div in div_list:
      item = {}
      item["b_cate"] = div.xpath("./dl/dt//text()").extract()
      item["b_cate"] = [i.strip() for i in item["b_cate"] if len(i.strip()) > 0]
      #         
      dl_list = div.xpath("./div//dl[@class='inner_dl']")
      for dl in dl_list:
        item["m_cate"] = dl.xpath(".//dt/a/@title").extract_first()
        #         
        a_list = dl.xpath("./dd/a")
        for a in a_list:
          item["s_cate"] = a.xpath("./text()").extract_first()
          item["s_href"] = a.xpath("./@href").extract_first()
          if item["s_href"] is not None and self.pattern.match(item["s_href"]) is not None:
            yield scrapy.Request(item["s_href"], callback=self.parse_book_list,
                       meta={"item": deepcopy(item)})

  def parse_book_list(self, response): #           
    item = response.meta['item']
    li_list = response.xpath("//ul[@class='bigimg']/li")
    for li in li_list:
      item["book_img"] = li.xpath("./a[@class='pic']/img/@src").extract_first()
      if item["book_img"] == "images/model/guan/url_none.png":
        item["book_img"] = li.xpath("./a[@class='pic']/img/@data-original").extract_first()
      item["book_name"] = li.xpath("./p[@class='name']/a/@title").extract_first()
      item["book_desc"] = li.xpath("./p[@class='detail']/text()").extract_first()
      item["book_price"] = li.xpath(".//span[@class='search_now_price']/text()").extract_first()
      item["book_author"] = li.xpath("./p[@class='search_book_author']/span[1]/a/text()").extract_first()
      item["book_publish_date"] = li.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first()
      if item["book_publish_date"] is not None:
        item["book_publish_date"] = item["book_publish_date"].replace('/', '')
      item["book_press"] = li.xpath("./p[@class='search_book_author']/span[3]/a/text()").extract_first()
      yield deepcopy(item)

    #        
    next_url = response.xpath("//li[@class='next']/a/@href").extract_first()
    if next_url is not None:
      next_url = urllib.parse.urljoin(response.url, next_url)
      yield scrapy.Request(next_url, callback=self.parse_book_list, meta={"item": item})

Redis 의 dangdang:book 키 에 대응 하 는 starturl 목록 이 비어 있 을 때 DangdangSpider 파충 류 를 시작 하면 차단 상태 대기 목록 에 데이터 가 삽 입 됩 니 다.콘 솔 알림 내용 은 다음 과 같 습 니 다.
2019-05-08 14:02:53 [scrapy.core.engine] INFO: Spider opened
2019-05-08 14:02:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-08 14:02:53 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
이 때 starturls 목록 에 파충류 의 초기 주 소 를 삽입 하고 Redis 목록 에 데 이 터 를 삽입 하면 다음 명령 을 사용 할 수 있 습 니 다.


lpush dangdang:book http://book.dangdang.com/

명령 이 실 행 된 후 잠시 기다 리 면 DangdangSpider 는 데 이 터 를 얻 기 시작 합 니 다.얻 은 데이터 구 조 는 다음 그림 과 같 습 니 다.

RedisCrawlSpider 코드 예시


# -*- coding: utf-8 -*-
import scrapy
import re
import urllib
from copy import deepcopy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisCrawlSpider


class DangdangCrawler(RedisCrawlSpider):
  name = 'dangdang2'
  allowed_domains = ['dangdang.com']
  redis_key = 'dangdang:book'
  pattern = re.compile(r"(http|https)://category.dangdang.com/cp(.*?).html", re.I)

  rules = (
    Rule(LinkExtractor(allow=r'(http|https)://category.dangdang.com/cp(.*?).html'), callback='parse_book_list',
       follow=False),
  )

  def parse_book_list(self, response): #           
    item = {}
    item['book_list_page'] = response._url
    li_list = response.xpath("//ul[@class='bigimg']/li")
    for li in li_list:
      item["book_img"] = li.xpath("./a[@class='pic']/img/@src").extract_first()
      if item["book_img"] == "images/model/guan/url_none.png":
        item["book_img"] = li.xpath("./a[@class='pic']/img/@data-original").extract_first()
      item["book_name"] = li.xpath("./p[@class='name']/a/@title").extract_first()
      item["book_desc"] = li.xpath("./p[@class='detail']/text()").extract_first()
      item["book_price"] = li.xpath(".//span[@class='search_now_price']/text()").extract_first()
      item["book_author"] = li.xpath("./p[@class='search_book_author']/span[1]/a/text()").extract_first()
      item["book_publish_date"] = li.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first()
      if item["book_publish_date"] is not None:
        item["book_publish_date"] = item["book_publish_date"].replace('/', '')
      item["book_press"] = li.xpath("./p[@class='search_book_author']/span[3]/a/text()").extract_first()
      yield deepcopy(item)

    #        
    next_url = response.xpath("//li[@class='next']/a/@href").extract_first()
    if next_url is not None:
      next_url = urllib.parse.urljoin(response.url, next_url)
      yield scrapy.Request(next_url, callback=self.parse_book_list)

Dangdang Spider 파충류 와 유사 합 니 다.Dangdang Crawler 는 초기 기어 오 르 기 주 소 를 얻 지 못 할 때 도 대기 상태 로 막 힙 니 다.starturls 목록 에 주소 가 있 으 면 기어 오 르 기 시작 합 니 다.기어 오 르 는 데이터 구 조 는 다음 그림 과 같 습 니 다.

스 크 래 피-레 디 스 의 레 디 스 스파이 더 와 레 디 스 크롤 스파이 더 에 대한 자세 한 설명 은 여기까지 입 니 다.스 크 래 피-레 디 스 의 레 디 스 스파이 더 와 레 디 스 크롤 스파이 더 에 관 한 더 많은 내용 은 저희 의 이전 글 을 검색 하거나 아래 의 관련 글 을 계속 찾 아 보 세 요.앞으로 도 많은 관심 부탁드립니다!

Scrapy-Redis 의 RedisSpider 와 RedisCrawlSpider 상세 설명

좋은 웹페이지 즐겨찾기