Python의 특정 웹사이트에서 Google 학술 출판물 스크랩

38418 단어 tutorial programming webscraping python

What will be scraped

How filtering works

Prerequisites

Full Code

Links

Outro

스크랩 할 것

필터링 작동 방식

특정 웹사이트로 결과를 필터링하려면 이름에 site:가 포함된 웹사이트에서 발행한 논문으로 검색 결과를 제한하는 <website_name> 연산자를 사용해야 합니다.

이 연산자는 OR 연산자, 즉 site:cabdirect.org OR site:<other_website> 외에도 사용할 수 있습니다. 따라서 검색어는 다음과 같습니다.

search terms site:cabdirect.org OR site:<other_website>

전제 조건

CSS 선택자를 사용한 기본 지식 스크래핑

CSS 선택기는 스타일이 적용되는 마크업 부분을 선언하므로 일치하는 태그 및 속성에서 데이터를 추출할 수 있습니다.

CSS 선택기로 스크랩하지 않은 경우 그것이 무엇인지, 장단점, 웹 스크래핑 관점에서 왜 중요한지, 그리고 가장 일반적인 접근 방식을 보여주는 전용 블로그 게시물how to use CSS selectors when web-scraping이 있습니다. 웹 스크래핑 시 CSS 선택기를 사용합니다.

별도의 가상 환경

요컨대, 동일한 시스템에서 서로 공존할 수 있는 서로 다른 Python 버전을 포함하여 설치된 라이브러리의 독립 세트를 생성하여 라이브러리 또는 Python 버전 충돌을 방지하는 것입니다.

이전에 가상 환경으로 작업한 적이 없다면 내 전용 블로그 게시물Python virtual environments tutorial using Virtualenv and Poetry을 살펴보고 익숙해지십시오.

📌참고: 이것은 이 블로그 게시물에 대한 엄격한 요구 사항이 아닙니다.

라이브러리 설치:

pip install requests parsel

차단될 확률 감소

요청이 차단될 가능성이 있습니다. how to reduce the chance of being blocked while web-scraping을 살펴보십시오. 대부분의 웹사이트에서 차단을 우회하는 11가지 방법이 있습니다.

전체 코드

from parsel import Selector
import requests, json, os


def check_websites(website: list or str):
    if isinstance(website, str):
        return website                                           # cabdirect.org
    elif isinstance(website, list):
        return " OR ".join([f'site:{site}' for site in website]) # site:cabdirect.org OR site:cab.net


def scrape_website_publications(query: str, website: list or str):

    """
    Add a search query and site or multiple websites.

    Following will work:
    ["cabdirect.org", "lololo.com", "brabus.org"] -> list[str]
    ["cabdirect.org"]                             -> list[str]
    "cabdirect.org"                               -> str
    """

    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        "q": f'{query.lower()} {check_websites(website=website)}',  # search query
        "hl": "en",                                                 # language of the search
        "gl": "us"                                                  # country of the search
    }

    # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
    }

    html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
    selector = Selector(html.text)

    publications = []

    # iterate over every element from organic results from the first page and extract the data
    for result in selector.css(".gs_r.gs_scl"):
        title = result.css(".gs_rt").xpath("normalize-space()").get()
        link = result.css(".gs_rt a::attr(href)").get()
        result_id = result.attrib["data-cid"]
        snippet = result.css(".gs_rs::text").get()
        publication_info = result.css(".gs_a").xpath("normalize-space()").get()
        cite_by_link = f'https://scholar.google.com/scholar{result.css(".gs_or_btn.gs_nph+ a::attr(href)").get()}'
        all_versions_link = f'https://scholar.google.com/scholar{result.css("a~ a+ .gs_nph::attr(href)").get()}'
        related_articles_link = f'https://scholar.google.com/scholar{result.css("a:nth-child(4)::attr(href)").get()}'

        publications.append({
            "result_id": result_id,
            "title": title,
            "link": link,
            "snippet": snippet,
            "publication_info": publication_info,
            "cite_by_link": cite_by_link,
            "all_versions_link": all_versions_link,
            "related_articles_link": related_articles_link,
        })

    # print or return the results
    # return publications

    print(json.dumps(publications, indent=2, ensure_ascii=False))


scrape_website_publications(query="biology", website="cabdirect.org")

라이브러리 가져오기 및 함수 정의:

from parsel import Selector
import requests, json, os

website 인수가 list의 str 또는 string인지 확인하는 함수를 만듭니다.

# check if returned website argument is string or a list

def check_websites(website: list or str):
    if isinstance(website, str):
        return website                                           # cabdirect.org
    elif isinstance(website, list):
        return " OR ".join([f'site:{site}' for site in website]) # site:cabdirect.org OR site:cab.com

구문 분석 함수를 정의합니다.

def scrape_website_publications(query: str, website: list or str):
    # further code

암호
설명

query: str/website: list or str
Python에게 query 및 website 인수가 list 또는 strings의 string 유형이어야 함을 알립니다.

검색 쿼리 매개변수를 생성하고 헤더를 요청하고 요청에 전달합니다.

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": f'{query.lower()} site:{website}',  # search query
    "hl": "en",                              # language of the search
    "gl": "us"                               # country of the search
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}

html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

암호
설명

paramsrequests.get()로 dict에 전달된 쿼리 매개변수입니다.
haeders요청 헤더이고 user-agent는 "실제"사용자 방문 역할을 하는 데 사용되므로 웹 사이트(모든 경우가 아님)에서 요청을 차단하지 않습니다. default user-agent requests is user-agent 이 웹 사이트에서 스크립트임을 이해하도록 하기 때문에 python-requests를 통과해야 합니다.

timeout
30초 후에 응답 대기를 중지하도록 요청에 알립니다.

임시list를 만들고 모든 유기적 결과를 반복하고 데이터를 추출합니다.

publications = []

# iterate over every element from organic results from the first page and extract the data
for result in selector.css(".gs_r.gs_scl"):
    title = result.css(".gs_rt").xpath("normalize-space()").get()
    link = result.css(".gs_rt a::attr(href)").get()
    result_id = result.attrib["data-cid"]
    snippet = result.css(".gs_rs::text").get()
    publication_info = result.css(".gs_a").xpath("normalize-space()").get()
    cite_by_link = f'https://scholar.google.com/scholar{result.css(".gs_or_btn.gs_nph+ a::attr(href)").get()}'
    all_versions_link = f'https://scholar.google.com/scholar{result.css("a~ a+ .gs_nph::attr(href)").get()}'
    related_articles_link = f'https://scholar.google.com/scholar{result.css("a:nth-child(4)::attr(href)").get()}'

암호
설명

css(<selector>)
to extarct data from a given CSS selector . 백그라운드에서 parsel는 cssselect를 사용하여 모든 CSS 쿼리를 XPath 쿼리로 변환합니다.
xpath("normalize-space()")
to get blank text nodes as well . 기본적으로 빈 텍스트 노드는 건너뛰어 완전한 출력이 되지 않습니다.
::text/::attr()
HTML 노드의 parsel pseudo-elements to extract text or attribute 데이터입니다.
get()실제 데이터를 얻기 위해

추출된 데이터를 list에 dict 및 return 또는 print 결과로 추가합니다.

publications.append({
    "result_id": result_id,
    "title": title,
    "link": link,
    "snippet": snippet,
    "publication_info": publication_info,
    "cite_by_link": cite_by_link,
    "all_versions_link": all_versions_link,
    "related_articles_link": related_articles_link,
})

# print or return the results
# return publications

print(json.dumps(publications, indent=2, ensure_ascii=False))


# call the function
scrape_website_publications(query="biology", website="cabdirect.org")

출력:

[
  {
    "result_id": "6zRLFbcxtREJ",
    "title": "The biology of mycorrhiza.",
    "link": "https://www.cabdirect.org/cabdirect/abstract/19690600367",
    "snippet": "In the second, revised and extended, edition of this work [cf. FA 20 No. 4264], two new ",
    "publication_info": "JL Harley - The biology of mycorrhiza., 1969 - cabdirect.org",
    "cite_by_link": "https://scholar.google.com/scholar/scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en",
    "all_versions_link": "https://scholar.google.com/scholar/scholar?cluster=1275980731835430123&hl=en&as_sdt=0,5",
    "related_articles_link": "https://scholar.google.com/scholar/scholar?q=related:6zRLFbcxtREJ:scholar.google.com/&scioq=biology+site:cabdirect.org&hl=en&as_sdt=0,5"
  }, ... other results
]

또는 SerpApi에서 Google Scholar Organic Results API을 사용하여 동일한 작업을 수행할 수 있습니다. 무료 요금제가 포함된 유료 API입니다.

차이점은 파서를 처음부터 만들고 유지 관리하고 확장 방법을 파악하고 Google의 차단을 우회하는 방법을 파악하고 어떤 프록시/캡차 공급자가 좋은지 알아낼 필요가 없다는 것입니다.

# pip install google-search-results

import os, json
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl


def serpapi_scrape(query: str, website: str):
    params = {
        # https://docs.python.org/3/library/os.html#os.getenv
        "api_key": os.getenv("API_KEY"), # your serpapi API key
        "engine": "google_scholar",      # search engine
        "q": f"{query} site:{website}",  # search query
        "hl": "en",                      # language
        # "as_ylo": "2017",              # from 2017
        # "as_yhi": "2021",              # to 2021
        "start": "0"                     # first page
    }

    search = GoogleSearch(params)

    publications = []

    publications_is_present = True
    while publications_is_present:
        results = search.get_dict()

        print(f"Currently extracting page #{results.get('serpapi_pagination', {}).get('current')}..")

        for result in results["organic_results"]:
            position = result["position"]
            title = result["title"]
            publication_info_summary = result["publication_info"]["summary"]
            result_id = result["result_id"]
            link = result.get("link")
            result_type = result.get("type")
            snippet = result.get("snippet")

            publications.append({
                "page_number": results.get("serpapi_pagination", {}).get("current"),
                "position": position + 1,
                "result_type": result_type,
                "title": title,
                "link": link,
                "result_id": result_id,
                "publication_info_summary": publication_info_summary,
                "snippet": snippet,
                })


            if "next" in results.get("serpapi_pagination", {}):
                # splits URL in parts as a dict and passes it to a GoogleSearch() class.
                search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
            else:
                papers_is_present = False

    print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))

연결

Code in the online IDE

Google Scholar Organic Results API

가입 |

Feature Request 💫 또는 Bug 🐞 추가

Reference

이 문제에 관하여(Python의 특정 웹사이트에서 Google 학술 출판물 스크랩), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/dmitryzub/scrape-google-scholar-publications-from-a-particular-website-1lh9

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

Docker를 사용하여 PostgreSQL + Go의 개발 환경을 만들려고했습니다.

Python을 사용하여 특정 웹 사이트 내에서 Google 학술 출판물 긁기

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다