AWS lambda + scrapy로 정기적으로 서버리스 스크래핑 1

18947 단어 람다 파이썬 Scrapy 스크래핑

첫 투고!
사실은 서버리스까지 하나의 기사에 넣고 싶었지만 늦지 않았다.
그래서 이번에는 스크래핑 편이됩니다.

하고 싶은 일

정기적으로 정보가 업데이트되는 웹 페이지를 자동으로 스크래핑하고 싶다!

목표

Yahoo!날씨(도쿄)의 데이터를 6시간 간격으로 취득.

방법

Python + Scrapy + AWSlambda + CroudWatchEvents 당으로 갈 수 있을까요?

우선 해보자

우선 스크래핑에서

아래 절차에 따라 크롤링, 스크래핑 부분을 만듭니다.

Scrapy 설치

Scrapy project 만들기

spider 만들기

실행

1. Scrapy 설치

$ python3 -V
Python 3.7.4

$ pip3 install scrapy
...
Successfully installed

$ scrapy version
Scrapy 1.8.0

2. Scrapy project 만들기

명령을 입력한 계층 구조에 프로젝트 폴더가 생성됩니다.

$ scrapy startproject yahoo_weather_crawl
New Scrapy project 'yahoo_weather_crawl'

$ ls
yahoo_weather_crawl

이번에는 yahoo 날씨의이 부분을 얻으려고합니다.

발표 일시, 일자, 날씨, 기온, 강수 확률을 주워 보겠습니다.

Scrapy는 커멘드 라인 쉘이 있어, 커멘드를 입력해 취득 대상이 제대로 잡혀 있는지 확인하는 것이 가능하므로, 일단 그것으로 확인하면서 진행해 보겠습니다.

검색 대상을 xpath로 지정합니다.
xpath는 google chrome의 개발자 도구 (F12를 누르면 나오는 사람)에서 쉽게 얻을 수 있습니다.

이번에 취득한 발표 일시의 xpath는 이하//*[@id="week"]/p
이것을 response에서 뽑아 보겠습니다.


# scrapy shellの起動
$ scrapy shell https://weather.yahoo.co.jp/weather/jp/13/4410.html

>>> announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first()
>>> announcement_date
'2019年11月29日  18時00分発表'

text()를 지정하면 본문만 검색할 수 있습니다.
자세한 내용은 참고문헌 참조.

우선 일시는 취해졌으므로, 다른 것도 마찬가지로 취득해 갑시다.

다른 정보는 table 태그 안에 있기 때문에, 한 번 table의 내용을 모두 취득합니다.


>>> table = response.xpath('//*[@id="yjw_week"]/table')

이제 id="yjw_week" 의 테이블 태그에 있는 요소를 가져올 수 있습니다.
여기에서 각 요소를 가져옵니다.


# 日付
>>> date = table.xpath('//tr[1]/td[2]/small/text()').extract_first()
>>> date
'12月1日'

# 天気
>>> weather = table.xpath('//tr[2]/td[2]/small/text()').extract_first()
>>> weather
'曇時々晴'

# 気温
>>> temperature = table.xpath('//tr[3]/td[2]/small/font/text()').extract()
>>> temperature
['14', '5']

# 降水確率
>>> rainy_percent = table.xpath('//tr[4]/td[2]/small/text()').extract_first()
>>> rainy_percent
'20'

이제 각각의 취득 방법을 알았으므로,
Spider(처리의 메인 부분)를 작성해 갑니다.

3. spider 만들기

방금 만든 프로젝트 폴더의 구성은 다음과 같습니다.


.
├── scrapy.cfg
└── yahoo_weather_crawl
    ├── __init__.py
    ├── __pycache__
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        └── __pycache__

우선은 취득하는 items를 정의해 둡니다.

items.py


import scrapy

class YahooWeatherCrawlItem(scrapy.Item):
    announcement_date = scrapy.Field()  # 発表日時
    date = scrapy.Field()               # 日付
    weather = scrapy.Field()            # 天気
    temperature = scrapy.Field()        # 気温
    rainy_percent = scrapy.Field()      # 降水確率

그런 다음 spider 본문을 spiders 폴더에 만듭니다.

spider/weather_spider.py

# -*- coding: utf-8 -*-
import scrapy
from yahoo_weather_crawl.items import YahooWeatherCrawlItem

# spider
class YahooWeatherSpider(scrapy.Spider):

    name = "yahoo_weather_crawler"
    allowed_domains = ['weather.yahoo.co.jp']
    start_urls = ["https://weather.yahoo.co.jp/weather/jp/13/4410.html"]

    # レスポンスに対する抽出処理
    def parse(self, response):
        # 発表日時
        yield YahooWeatherCrawlItem(announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first())
        table = response.xpath('//*[@id="yjw_week"]/table')

        # 日付ループ
        for day in range(2, 7):

            yield YahooWeatherCrawlItem(
                # データ抽出
                date=table.xpath('//tr[1]/td[%d]/small/text()' % day).extract_first(),
                weather=table.xpath('//tr[2]/td[%d]/small/text()' % day).extract_first(),
                temperature=table.xpath('//tr[3]/td[%d]/small/font/text()' % day).extract(),
                rainy_percent=table.xpath('//tr[4]/td[%d]/small/text()' % day).extract_first(),
                )

4. 막상 실행!

scrapy crawl yahoo_weather_crawler

2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'announcement_date': '2019年12月1日  17時00分発表'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': '12月3日',
 'rainy_percent': '10',
 'temperature': ['17', '10'],
 'weather': '晴れ'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': '12月4日',
 'rainy_percent': '0',
 'temperature': ['15', '4'],
 'weather': '晴れ'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': '12月5日',
 'rainy_percent': '0',
 'temperature': ['14', '4'],
 'weather': '晴時々曇'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': '12月6日',
 'rainy_percent': '10',
 'temperature': ['11', '4'],
 'weather': '曇り'}
2019-12-01 20:17:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://weather.yahoo.co.jp/weather/jp/13/4410.html>
{'date': '12月7日',
 'rainy_percent': '30',
 'temperature': ['9', '3'],
 'weather': '曇り'}

잘 잡히는 것 같습니다!
모처럼이므로 파일로 출력합시다.

파일에 출력할 때는 디폴트라고 일본어가 깨져 버리기 때문에,
settings.py에 인코딩 설정을 추가합니다.

settings.py

FEED_EXPORT_ENCODING='utf-8'

$ scrapy crawl yahoo_weather_crawler -o weather_data.json
...

weather_data.json

[
{"announcement_date": "2019年12月1日  17時00分発表"},
{"date": "12月3日", "weather": "晴れ", "temperature": ["17", "10"], "rainy_percent": "10"},
{"date": "12月4日", "weather": "晴れ", "temperature": ["15", "4"], "rainy_percent": "0"},
{"date": "12月5日", "weather": "晴時々曇", "temperature": ["14", "4"], "rainy_percent": "0"},
{"date": "12月6日", "weather": "曇り", "temperature": ["11", "4"], "rainy_percent": "10"},
{"date": "12月7日", "weather": "曇り", "temperature": ["9", "3"], "rainy_percent": "30"}
]

출력할 수 있었습니다!

다음번에는 이 처리와 AWS를 조합해 서버리스로 움직여 보려고 합니다.

참고문헌

Scrapy 1.8 documentation
htps // c c. sc 등 py. 오 rg / 엔 / ㅁ st / 어서 x. HTML
10분 안에 이해 Scrapy
htps : // 코 m / 짱 / ms / f4df85에 b73b18d902739
Scrapy로 웹 스크래핑
htps : // 이 m / 아 mtk / / ms / 4c1172c932264 아 941b4

Reference

이 문제에 관하여(AWS lambda + scrapy로 정기적으로 서버리스 스크래핑 1), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/rkhcx/items/a0102bebcfc687f30eb2

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

vRO 로그에서 클러터를 빠르게 제거

vRO에서 사용 가능한 Node.js 모듈 목록

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다