Python scrapy 소주 중고 주택 거래 데이터

8113 단어 python scrapy 기어오르다

프로젝트 요구 사항
Scrapy 를 사용 하여 체인 홈 페이지 에서 소주 시 중고 주택 거래 데 이 터 를 추출 하여 CSV 파일 에 저장 합 니 다.
요청:
주택 면적,총가격 과 단 가 는 구체 적 인 숫자 만 필요 하고 단위 명칭 은 필요 없다.
필드 가 완전 하지 않 은 집 데 이 터 를 삭제 합 니 다.만약 어떤 집 이 방향 을 바 꾸 면'데이터 없 음'이 표 시 됩 니 다.삭제 해 야 합 니 다.
CSV 파일 에 저 장 된 데 이 터 는 필드 는 다음 과 같은 순서 로 배열 해 야 한다.집 이름,주택 평형,건축 면적,주택 방향,인 테 리 어 상황,엘리베이터,주택 총가격,주택 단가,주택 재산권 이 있 는 지 없 는 지.
프로젝트 분석
흐름 도
在这里插入图片描述

콘 솔 을 통 해 모든 주택 정 보 는 하나의 얼 에 있 는 모든 li 에 한 집의 정 보 를 저장 하 는 것 을 발견 했다.
在这里插入图片描述

필요 한 필드 를 찾 았 습 니 다.여 기 는 집 이름 을 예 로 들 면 블 로 거들 은 Liux 로 캡 처 를 해서 그림 을 표시 할 수 없습니다.이 부분 이 바로 가장 중간 에 있 는'경산 장미 원'입 니 다.
다른 필드 는 일일이 열거 하지 않 는 것 과 유사 하 다.
필요 한 데 이 터 를 얻 은 후에 엘리베이터 의 배치 상황 이 없 기 때문에 상세 한 페이지,즉 제목 을 클릭 한 후에 들 어 가 는 페이지 가 필요 합 니 다.
제목 클릭
在这里插入图片描述

안에 필요 한 정보 가 있 습 니 다.
在这里插入图片描述

자세 한 페이지 URL 캡 처
在这里插入图片描述

상세 페이지 데이터 분석 진행
在这里插入图片描述

해당 위 치 를 찾 아 데 이 터 를 캡 처 합 니 다.
3.프로그램 작성
프로젝트 를 만 들 었 습 니 다.말 하지 않 겠 습 니 다.
1.item 작성(데이터 저장)


import scrapy
class LianjiaHomeItem(scrapy.Item):
     name = scrapy.Field() #   
     type = scrapy.Field()  #   
     area = scrapy.Field()  #   
     direction = scrapy.Field()  #  
     fitment = scrapy.Field()  #     
     elevator = scrapy.Field()  #     
     total_price = scrapy.Field()  #   
     unit_price = scrapy.Field()  #

2.spider 작성(데이터 캡 처)


from scrapy import Request
from scrapy.spiders import Spider
from lianjia_home.items import LianjiaHomeItem

class HomeSpider(Spider):
    name = "home"
    current_page=1 #   

    def start_requests(self): #    
        url="https://su.lianjia.com/ershoufang/"
        yield Request(url=url)

    def parse(self, response): #    
        list_selctor=response.xpath("//li/div[@class='info clear']")
        for one_selector in list_selctor:
            try:
                #    
                name=one_selector.xpath("//div[@class='flood']/div[@class='positionInfo']/a/text()").extract_first()
                #    
                other=one_selector.xpath("//div[@class='address']/div[@class='houseInfo']/text()").extract_first()
                other_list=other.split("|")
                type=other_list[0].strip(" ")#  
                area = other_list[1].strip(" ") #  
                direction=other_list[2].strip(" ") #  
                fitment=other_list[3].strip(" ") #  
                price_list=one_selector.xpath("div[@class='priceInfo']//span/text()")
                #   
                total_price=price_list[0].extract()
                #   
                unit_price=price_list[1].extract()

                item=LianjiaHomeItem()
                item["name"]=name.strip(" ")
                item["type"]=type
                item["area"] = area
                item["direction"] = direction
                item["fitment"] = fitment
                item["total_price"] = total_price
                item["unit_price"] = unit_price

            #     
                url = one_selector.xpath("div[@class='title']/a/@href").extract_first()
                yield Request(url=url,
                              meta={"item":item}, # item    v  
                              callback=self.property_parse) #     
            except:
                print("error")

        #     
            self.current_page+=1
            if self.current_page<=100:
                next_url="https://su.lianjia.com/ershoufang/pg%d"%self.current_page
                yield Request(url=next_url)


    def property_parse(self,response):#   
        #    
        elevator=response.xpath("//div[@class='base']/div[@class='content']/ul/li[last()]/text()").extract_first()
        item=response.meta["item"]
        item["elevator"]=elevator
        yield item

3.pipelines 작성(데이터 처리)


import re
from scrapy.exceptions import DropItem
class LianjiaHomePipeline:#     
    def process_item(self, item, spider):
        #  
        item["area"]=re.findall("\d+\.?\d*",item["area"])[0] #       
        #  
        item["unit_price"] = re.findall("\d+\.?\d*", item["unit_price"])[0] #       

        #         ，   
        if item["direction"] =="    ":
            raise DropItem("   ，  ：%s"%item)

        return item

class CSVPipeline(object):
    file=None
    index=0 #csv      
    def open_spider(self,spider): #     ，  csv  
        self.file=open("home.csv","a",encoding="utf=8")

    def process_item(self, item, spider):#       。
        if self.index ==0:
            column_name="name,type,area,direction,fitment,elevator,total_price,unit_price
"
            self.file.write(column_name)#          
            self.index=1

        home_str=item["name"]+","+item["type"]+","+item["area"]+","+item["direction"]+","+item["fitment"]+","+item["elevator"]+","+item["total_price"]+","+item["unit_price"]+"
"
        self.file.write(home_str) #       

        return item

    def close_soider(self,spider):#       csv
        self.file.close()

4.설정 작성(파충류 설정)
여 기 는 수정 할 부분 만 적 혀 있어 요.


USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
#      
ROBOTSTXT_OBEY = False #   robots  
ITEM_PIPELINES = {
    'lianjia_home.pipelines.LianjiaHomePipeline': 300,
    #       
    'lianjia_home.pipelines.CSVPipeline': 400
    #        
    #            
}

이 내용 들 은 settings 에서 기본적으로 닫 혀 있 습 니 다.설명 할\#를 제거 하면 열 수 있 습 니 다.
5.start(명령 줄 대신)작성


from scrapy import cmdline

cmdline.execute("scrapy crawl home" .split())

결과 도 두 장 을 동봉 하 다.
在这里插入图片描述

총결산
이번 프로젝트 는 간단 한 데이터 세척 을 추가 해 전체적인 데이터 캡 처 에 새로운 난이 도 를 높이 지 않 았 다.
파 이 썬 스 크 래 피 가 소주 중고 주택 거래 데 이 터 를 기어 오 르 는 것 에 관 한 이 글 은 여기까지 소개 되 었 습 니 다.더 많은 관련 스 크 래 피 가 중고 주택 거래 데 이 터 를 기어 오 르 는 내용 은 우리 의 이전 글 을 검색 하거나 아래 의 관련 글 을 계속 조회 하 시기 바 랍 니 다.앞으로 많은 응원 바 랍 니 다!

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

로마 숫자를 정수로 또는 그 반대로 변환

그 중 하나는 로마 숫자를 정수로 변환하는 함수를 만드는 것이었고 두 번째는 그 반대를 수행하는 함수를 만드는 것이었습니다. 문자만 포함합니다'I', 'V', 'X', 'L', 'C', 'D', 'M' ; 문자열이 ...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

Python pandas 를 사용 하여 CSV 파일 을 읽 으 려 면 무엇 을 주의해 야 합 니까?

Python 에서 pip 도구 설치 및 사용

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다