Python 및 Selenium으로 동시 웹 스크레이퍼 구축

12516 단어 selenium webscraping python

웹 스크래핑 및 크롤링은 웹사이트에서 자동으로 데이터를 추출하는 프로세스입니다. 웹 사이트는 HTML로 구축되므로 이러한 웹 사이트의 데이터는 구조화되지 않을 수 있지만 HTML 태그, ID 및 클래스의 도움으로 구조를 찾을 수 있습니다. 이 기사에서는 멀티스레딩과 같은 Concurrent 방법을 사용하여 이 프로세스의 속도를 높이는 방법을 알아봅니다. 동시성은 두 개 이상의 작업을 겹치고 병렬로 실행하는 것을 의미합니다. 이러한 작업은 동시에 진행되어 실행 시간이 줄어들어 Web Scraper가 빨라집니다.

이 기사에서는 주어진 sample website 에서 모든 인용문, 저자 및 태그를 스크랩하는 봇을 생성한 다음 멀티스레딩 및 멀티프로세싱을 사용하여 동시 처리합니다.

필요한 모듈 🔌 :

Python: Python 설정, link

셀레늄: 셀레늄 설정
pip를 사용하여 셀레늄 설치

pip install selenium

Webdriver-manager: 헤드리스 브라우저 드라이버를 가져오고 관리합니다.
pip를 사용하여 Webdriver-manager 설치

pip install webdriver-manager

기본 설정 🗂 :

scraper라는 하위 폴더와 함께 web-scraper라는 디렉토리를 생성하겠습니다. 스크레이퍼 폴더에 도우미 기능을 배치합니다.

디렉토리를 생성하기 위해 다음 명령을 사용할 수 있습니다.

mkdir -p web-scraper/scraper

폴더를 생성한 후 아래 명령을 사용하여 몇 개의 파일을 생성해야 합니다. 코드를 작성하기 위해 scrapeQuotes.py에는 기본 스크레이퍼 코드가 포함되어 있습니다.

cd web-scraper && touch scrapeQuotes.py

반면 scraper.py에는 헬퍼 함수가 포함되어 있습니다.

cd scraper && touch scraper.py __init__.py

그런 다음 폴더 구조는 다음과 같을 것입니다. 선택한 텍스트 편집기에서 엽니다.

스크레이퍼 폴더에서 scraper.py 편집:

이제 봇을 위한 도우미 함수를 작성해 보겠습니다.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    return webdriver.Chrome(
        service=ChromeService(ChromeDriverManager().install()),
        options=options,
    )

def persist_data(data, filename):
    # this function writes data in the file
    try:
        file = open(filename, "a")
        file.writelines([f"{line}\n" for line in data])
    except Exception as e:
        return e
    finally:
        file.close()
    return True

scrapeQuotes.py 📝 편집:

스크래핑의 경우 이 웹사이트의 견적을 스크랩하려면 식별자가 필요합니다. 각 인용문에는 인용문이라는 고유한 클래스가 있음을 알 수 있습니다.

모든 요소를 가져오려면 By.CLASS_NAME 식별자를 사용합니다.

 def scrape_quotes(self):
        quotes_list = []
        quotes = WebDriverWait(self.driver, 20).until(
            EC.visibility_of_all_elements_located((By.CLASS_NAME, "quote"))
        )
        for quote in quotes:
            quotes_list.append(self.clean_output(quote))
        self.close_driver()
        return quotes_list

몇 가지 기능이 추가된 완전한 스크립트:

load_page: load_page는 URL을 받아 웹 페이지를 로드하는 함수입니다

scrape_quotes: scrape_quotes는 클래스 이름을 사용하여 모든 따옴표를 가져옵니다

close_driver: 이 함수는 견적을 받은 후 셀레늄 세션을 닫습니다.

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

from scraper.scraper import get_driver, persist_data

class ScrapeQuotes:
    def __init__ (self, url):
        self.url = url
        self.driver = get_driver()

    def load_page(self):
        self.driver.get(self.url)

    def scrape_quotes(self):
        quotes_list = []
        quotes = WebDriverWait(self.driver, 20).until(
            EC.visibility_of_all_elements_located((By.CLASS_NAME, "quote"))
        )
        for quote in quotes:
            quotes_list.append(self.clean_output(quote))
        self.close_driver()
        return quotes_list

    def clean_output(self, quote):
        raw_quote = quote.text.strip().split("\n")
        raw_quote[0] = raw_quote[0].replace("", '"')
        raw_quote[0] = raw_quote[0].replace("", '"')
        raw_quote[1] = raw_quote[1].replace("by", "")
        raw_quote[1] = raw_quote[1].replace("(about)", "")
        raw_quote[2] = raw_quote[2].replace("Tags: ", "")
        return ",".join(raw_quote)

    def close_driver(self):
        self.driver.close()

def main(tag):
    scrape = ScrapeQuotes("https://quotes.toscrape.com/tag/" + tag)
    scrape.load_page()
    quotes = scrape.scrape_quotes()
    persist_data(quotes, "quotes.csv")

if __name__ == " __main__":
    tags = ["love", "truth", "books", "life", "inspirational"]
    for tag in tags:
        main(tag)

계산시간 :

일반 웹 스크레이퍼와 동시성 웹 스크레이퍼의 차이점을 확인하기 위해 시간 명령으로 다음 스크레이퍼를 실행합니다.

time -p /usr/bin/python3 scrapeQuotes.py

여기에서 스크립트가 22.84초가 걸린 것을 볼 수 있습니다.

멀티스레딩 구성:

우리는 concurrent.futures 라이브러리를 사용하기 위해 스크립트를 다시 작성할 것입니다. 이 라이브러리에서 메인 함수를 병렬로 실행하기 위한 스레드 풀을 생성하는 데 사용되는 ThreadPoolExecutor라는 함수를 가져올 것입니다.

    # list to store the threads
    threadList = []

    # initialize the thread pool
    with ThreadPoolExecutor() as executor:
        for tag in tags:
            threadList.append(executor.submit(main, tag))

    # wait for all the threads to complete
    wait(threadList)

이 executor.submit 함수에서 우리의 경우 main인 함수의 이름과 인수를 취하므로 main 함수에 전달해야 합니다. 대기 기능은 모든 작업이 완료될 때까지 실행 흐름을 차단하는 데 사용됩니다.

멀티스레딩 📝 사용을 위한 scrapeQuotes.py 편집:

from concurrent.futures import ThreadPoolExecutor, process, wait

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

from scraper.scraper import get_driver, persist_data

class ScrapeQuotes:
    def __init__ (self, url):
        self.url = url
        self.driver = get_driver()

    def load_page(self):
        self.driver.get(self.url)

    def scrape_quotes(self):
        quotes_list = []
        quotes = WebDriverWait(self.driver, 20).until(
            EC.visibility_of_all_elements_located((By.CLASS_NAME, "quote"))
        )
        for quote in quotes:
            quotes_list.append(self.clean_output(quote))
        self.close_driver()
        return quotes_list

    def clean_output(self, quote):
        raw_quote = quote.text.strip().split("\n")
        raw_quote[0] = raw_quote[0].replace("", '"')
        raw_quote[0] = raw_quote[0].replace("", '"')
        raw_quote[1] = raw_quote[1].replace("by", "")
        raw_quote[1] = raw_quote[1].replace("(about)", "")
        raw_quote[2] = raw_quote[2].replace("Tags: ", "")
        return ",".join(raw_quote)

    def close_driver(self):
        self.driver.close()

def main(tag):
    scrape = ScrapeQuotes("https://quotes.toscrape.com/tag/" + tag)
    scrape.load_page()
    quotes = scrape.scrape_quotes()
    persist_data(quotes, "quotes.csv")

if __name__ == " __main__":
    tags = ["love", "truth", "books", "life", "inspirational"]

    # list to store the threads
    threadList = []

    # initialize the thread pool
    with ThreadPoolExecutor() as executor:
        for tag in tags:
            threadList.append(executor.submit(main, tag))

    # wait for all the threads to complete
    wait(threadList)

계산시간 :

이 스크립트에서 멀티스레딩을 사용하여 실행하는 데 필요한 시간을 계산해 보겠습니다.

time -p /usr/bin/python3 scrapeQuotes.py

여기에서 스크립트가 9.27초가 걸린 것을 볼 수 있습니다.

다중 처리 구성:

Python 다중 처리는 ThreadPoolExecutor와 동일한 인터페이스를 가진 ProcessPoolExecutor에 의해 구현되므로 다중 스레딩 스크립트를 다중 처리 스크립트로 쉽게 변환할 수 있습니다. 멀티프로세싱은 작업의 병렬화로 알려져 있는 반면 멀티스레딩은 작업의 동시성으로 알려져 있습니다.

    # list to store the processes
    processList = []

    # initialize the mutiprocess interface
    with ProcessPoolExecutor() as executor:
        for tag in tags:
            processList.append(executor.submit(main, tag))

    # wait for all the threads to complete
    wait(processList)

Multiprocessing 📝 사용을 위한 scrapeQuotes.py 편집:

from concurrent.futures import ProcessPoolExecutor, wait

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

from scraper.scraper import get_driver, persist_data

class ScrapeQuotes:
    def __init__ (self, url):
        self.url = url
        self.driver = get_driver()

    def load_page(self):
        self.driver.get(self.url)

    def scrape_quotes(self):
        quotes_list = []
        quotes = WebDriverWait(self.driver, 20).until(
            EC.visibility_of_all_elements_located((By.CLASS_NAME, "quote"))
        )
        for quote in quotes:
            quotes_list.append(self.clean_output(quote))
        self.close_driver()
        return quotes_list

    def clean_output(self, quote):
        raw_quote = quote.text.strip().split("\n")
        raw_quote[0] = raw_quote[0].replace("", '"')
        raw_quote[0] = raw_quote[0].replace("", '"')
        raw_quote[1] = raw_quote[1].replace("by", "")
        raw_quote[1] = raw_quote[1].replace("(about)", "")
        raw_quote[2] = raw_quote[2].replace("Tags: ", "")
        return ",".join(raw_quote)

    def close_driver(self):
        self.driver.close()

def main(tag):
    scrape = ScrapeQuotes("https://quotes.toscrape.com/tag/" + tag)
    scrape.load_page()
    quotes = scrape.scrape_quotes()
    persist_data(quotes, "quotes.csv")

if __name__ == " __main__":
    tags = ["love", "truth", "books", "life", "inspirational"]

    # list to store the processes
    processList = []

    # initialize the mutiprocess interface
    with ProcessPoolExecutor() as executor:
        for tag in tags:
            processList.append(executor.submit(main, tag))

    # wait for all the threads to complete
    wait(processList)

계산시간 :

이 스크립트에서 다중 처리를 사용하여 실행하는 데 필요한 시간을 계산해 봅시다.

time -p /usr/bin/python3 scrapeQuotes.py

여기에서 Multiprocessing 스크립트가 8.23초가 걸린 것을 볼 수 있습니다.

Reference

이 문제에 관하여(Python 및 Selenium으로 동시 웹 스크레이퍼 구축), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/itsvinayak/building-a-concurrent-web-scraper-with-python-and-selenium-3e7b

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

JavaScript의 Async/Await 이해

CSS를 사용하여 레이블로 입력 - 자습서

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다