파충류Python 네트워크 파충류의 Scrapy 프레임워크(CrawlSpider)

끌어들이다
질문: 파충류 프로그램을 통해'괴백'전역 데이터 뉴스 데이터를 얻으려면 몇 가지 실현 방법이 있습니까?
방법1: Scrapy 프레임워크에서 Spider의 반복 기어오르기를 기반으로 수행(Request 모듈 변환 parse 방법).
방법2: CrawlSpider의 자동 기어오르기를 바탕으로 실현(더욱 간결하고 효율적).
오늘의 개요

CrawlSpider 소개

CrawlSpider 사용

CrawlSpider 파충류 파일 기반 생성

링크 추출기

규칙 해석기

오늘 상세한 상황
하나.소개
Crawl Spider는 사실 Spider의 하위 클래스로 Spider의 특성과 기능을 계승하는 것 외에 자신만의 더욱 강력한 특성과 기능을 파생시킨다.그 중에서 가장 현저한 기능은 바로 "LinkExtractors 링크 추출기"입니다. Spider는 모든 파충류의 기본 클래스입니다. 그 설계 원칙은 start_url 목록에서 웹 페이지를 추출하기 위한 것입니다. 추출된 웹 페이지에서 추출한 URL을 계속 추출하는 작업은 Crawl Spider를 사용하는 것이 더 적합합니다.
2.활용단어참조
1. scrapy 프로젝트 만들기: scrapy startproject projectName
2. 파충류 파일 만들기:scrapy genspider -t crawl spiderName www.xx.com
- 이 명령은 이전의 명령에 비해 "-tcrawl"이 많습니다. 생성된 파충류 파일은 CrawlSpider 클래스를 기반으로 하고 더 이상 Spider 클래스가 아니라는 것을 의미합니다.
3. 생성된 파충류 파일 관찰

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutidemoSpider(CrawlSpider):
    name = 'choutiDemo'
    #allowed_domains = ['www.chouti.com']
    start_urls = ['http://www.chouti.com/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

- 2, 3행: CrawlSpider 관련 모듈 가져오기
- 7행: 이 파충류 프로그램이 CrawlSpider 클래스에 기반하고 있음을 나타냅니다.
- 12, 13, 14 행: 링크 규칙 추출
- 16행: 분석 방법
CrawlSpider 클래스와 Spider 클래스의 가장 큰 차이점은 CrawlSpider에 rules 속성이 하나 더 생겼다는 것입니다. 그 역할은'추출 동작'을 정의하는 것입니다. rules에는 하나 이상의 Rule 대상이 포함될 수 있으며, Rule 대상에는 LinkExtractor 대상이 포함됩니다.
3.1 LinkExtractor: 말 그대로 링크 추출기입니다.
LinkExtractor(
allow=r'Items/', # 괄호에 있는'정칙 표현식'값을 충족시키면 추출되고, 비어 있으면 모두 일치합니다.
deny=xxx,#정규 표현식을 만족시키면 추출되지 않습니다.
restrict_xpaths = xxx, # xpath 표현식 충족 값이 추출됩니다.
restrict_css=xxx, # css 표현식을 만족시키는 값은 추출됩니다
deny_domains = xxx, # 추출되지 않은 링크의domains.

- 역할: response에서 규칙에 맞는 링크를 추출합니다.
3.2 Rule: 규칙 해석기.링크 추출기에서 추출한 링크에 따라 지정한 규칙에 따라 분석기 링크 웹 페이지의 내용을 추출합니다.
Rule(LinkExtractor(allow=r’Items/’), callback=‘parse_item’, follow=True)
- 매개변수 설명:
매개 변수 1: 링크 추출기 지정
매개 변수 2: 규칙 해석기 해석 데이터의 규칙을 지정합니다 (리셋 함수)
매개 변수 3: 링크 추출기를 링크 추출기에서 추출한 링크 페이지에 계속 사용할지 여부.callback이 None이면 매개 변수 3의 기본값은true입니다.
3.3 rules=(): 서로 다른 규칙 해석기를 지정합니다.Rule 객체는 추출 규칙을 나타냅니다.
3.4 CrawlSpider 전체 기어오르기 프로세스:
a) 파충류 파일은 우선 시작 URL에 따라 이 URL의 웹 페이지 내용을 가져옵니다.
b) 링크 추출기는 지정한 추출 규칙에 따라 단계 a의 웹 내용에 있는 링크를 추출합니다
c) 규칙 해석기는 지정한 해석 규칙에 따라 링크 추출기에서 추출한 링크의 웹 내용을 지정한 규칙에 따라 해석한다
d) 분석 데이터를 item에 봉인하여 파이프에 제출하여 영구화 저장
4. 단순 코드 실전 응용
4.1 괴사 백과 괴도 판의 모든 페이지 데이터

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class CrawldemoSpider(CrawlSpider):
    name = 'qiubai'
    #allowed_domains = ['www.qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/pic/']

    # ： url url
    link = LinkExtractor(allow=r'/pic/page/\d+\?') #s= 
    link1 = LinkExtractor(allow=r'/pic/$')# 
    #rules （ )
    rules = (
        # ： （ ） 
        Rule(link, callback='parse_item', follow=True),
        Rule(link1, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response)

4.2 파충류 파일:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from qiubaiBycrawl.items import QiubaibycrawlItem
import re
class QiubaitestSpider(CrawlSpider):
    name = 'qiubaiTest'
    # url
    start_urls = ['http://www.qiushibaike.com/']

    # ， 
    page_link = LinkExtractor(allow=r'/8hr/page/\d+/')
    
    rules = (
        # ， callback 
        Rule(page_link, callback='parse_item', follow=True),
    )

    # 
    def parse_item(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        
        for div in div_list:
            # item
            item = QiubaibycrawlItem()
            # xpath 
            item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('
')
            # xpath 
            item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('
')

            yield item # item

4.2 item 파일:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QiubaibycrawlItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field() # 
    content = scrapy.Field() #

4.3 파이핑 파일:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class QiubaibycrawlPipeline(object):
    
    def __init__(self):
        self.fp = None
        
    def open_spider(self,spider):
        print(' ')
        self.fp = open('./data.txt','w')
        
    def process_item(self, item, spider):
        # item 
        self.fp.write(item['author']+':'+item['content']+'
')
        return item
    
    def close_spider(self,spider):
        print(' ')
        self.fp.close()
ocess_item(self, item, spider):
        # item 
        self.fp.write(item['author']+':'+item['content']+'
')
        return item
    
    def close_spider(self,spider):
        print(' ')
        self.fp.close()

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

다양한 언어의 JSON

JSON은 Javascript 표기법을 사용하여 데이터 구조를 레이아웃하는 데이터 형식입니다. 그러나 Javascript가 코드에서 이러한 구조를 나타낼 수 있는 유일한 언어는 아닙니다. 저는 일반적으로 '객체'{}...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

파충류Python 네트워크 파충류의 Scrapy 프레임워크(CrawlSpider)

파충류Python 네트워크 파충류의 Scrapy 프레임워크(CrawlSpider)

좋은 웹페이지 즐겨찾기