Scrape ResearchGate 모든 저자, Python 연구원

30665 단어 tutorial programming webscraping python

스크랩 할 것

전제 조건

CSS 선택자를 사용한 기본 지식 스크래핑

CSS 선택기로 스크랩하지 않은 경우 그것이 무엇인지, 장단점, 웹 스크래핑 관점에서 왜 중요한지, 그리고 가장 일반적인 접근 방식을 보여주는 전용 블로그 게시물how to use CSS selectors when web-scraping이 있습니다. 웹 스크래핑 시 CSS 선택기를 사용합니다.

별도의 가상 환경

이전에 가상 환경으로 작업한 적이 없다면 내 전용 블로그 게시물Python virtual environments tutorial using Virtualenv and Poetry을 살펴보고 익숙해지십시오.

차단될 확률 감소

요청이 차단될 가능성이 있습니다. how to reduce the chance of being blocked while web-scraping을 살펴보십시오. 대부분의 웹사이트에서 차단을 우회하는 11가지 방법이 있습니다.

📌참고: 프록시 없이 이 코드 스니펫을 사용하는 경우 짧은 시간 내에 요청이 차단됩니다. 프록시로 많은 양의 데이터를 스크랩하는 것이 좋습니다.

라이브러리 설치:

pip install parsel playwright

전체 코드

from parsel import Selector
from playwright.sync_api import sync_playwright
import json


def scrape_researchgate_profile(query: str):
    with sync_playwright() as p:

        browser = p.chromium.launch(headless=True, slow_mo=50)
        page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")

        authors = []
        page_num = 1

        while True:
            page.goto(f"https://www.researchgate.net/search/researcher?q={query}&page={page_num}")
            selector = Selector(text=page.content())

            for author in selector.css(".nova-legacy-c-card__body--spacing-inherit"):
                name = author.css(".nova-legacy-v-person-item__title a::text").get()
                thumbnail = author.css(".nova-legacy-v-person-item__image img::attr(src)").get()
                profile_page = f'https://www.researchgate.net/{author.css("a.nova-legacy-c-button::attr(href)").get()}'
                institution = author.css(".nova-legacy-v-person-item__stack-item:nth-child(3) span::text").get()
                department = author.css(".nova-legacy-v-person-item__stack-item:nth-child(4) span").xpath("normalize-space()").get()
                skills = author.css(".nova-legacy-v-person-item__stack-item:nth-child(5) span").xpath("normalize-space()").getall()
                last_publication = author.css(".nova-legacy-v-person-item__info-section-list-item .nova-legacy-e-link--theme-bare::text").get()
                last_publication_link = f'https://www.researchgate.net{author.css(".nova-legacy-v-person-item__info-section-list-item .nova-legacy-e-link--theme-bare::attr(href)").get()}'

                authors.append({
                    "name": name,
                    "profile_page": profile_page,
                    "institution": institution,
                    "department": department,
                    "thumbnail": thumbnail,
                    "last_publication": {
                        "title": last_publication,
                        "link": last_publication_link
                    },
                    "skills": skills,
                })

            print(f"Extracting Page: {page_num}")

            # checks if next page arrow key is greyed out `attr(rel)` (inactive) -> breaks out of the loop
            if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get():
                break
            else:
                # paginate to the next page
                page_num += 1


        print(json.dumps(authors, indent=2, ensure_ascii=False))

        browser.close()


scrape_researchgate_profile(query="coffee")

코드 설명

라이브러리 가져오기:

from parsel import Selector
from playwright.sync_api import sync_playwright
import json

암호
설명

parsel
HTML/XML 문서를 구문 분석합니다. XPath를 지원합니다.

playwright
브라우저 인스턴스로 페이지를 렌더링합니다.
jsonPython 사전을 JSON 문자열로 변환합니다.

함수를 정의하고 context manager::으로 playwright를 엽니다.

def scrape_researchgate_profile(query: str):
    with sync_playwright() as p:
        # ...

암호
설명

query: str파이썬에게 query 가 str 가 되어야 한다고 알려줍니다.

브라우저 인스턴스를 실행하고 new_page를 통과하여 엽니다user-agent.

browser = p.chromium.launch(headless=True, slow_mo=50)
page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")

암호
설명

p.chromium.launch()
Chromium 브라우저 인스턴스를 시작합니다.

headless
기본 값이더라도 헤드리스 모드에서 실행하도록 명시적으로 지시합니다playwright.

slow_mo
실행 속도를 늦추라고 지시합니다playwright.

browser.new_page()
새 페이지를 엽니다. user_agent는 실제 사용자가 브라우저에서 요청을 수행하는 데 사용됩니다. 사용하지 않으면 기본적으로 playwright 값인 None 가 됩니다. Check what's your user-agent .

임시 목록을 추가하고, while 루프를 설정하고, 새 URL을 엽니다.

authors = []

while True:
    page.goto(f"https://www.researchgate.net/search/researcher?q={query}&page={page_num}")
    selector = Selector(text=page.content())
    # ...

암호
설명

goto()전달된 쿼리 및 페이지 매개변수를 사용하여 특정 URL을 요청합니다.
Selector()반환된 HTML 데이터를 page.content()로 전달하고 처리합니다.

각 페이지에서 작성자 결과를 반복하고 데이터를 추출하고append 임시list로:

for author in selector.css(".nova-legacy-c-card__body--spacing-inherit"):
    name = author.css(".nova-legacy-v-person-item__title a::text").get()
    thumbnail = author.css(".nova-legacy-v-person-item__image img::attr(src)").get()
    profile_page = f'https://www.researchgate.net/{author.css("a.nova-legacy-c-button::attr(href)").get()}'
    institution = author.css(".nova-legacy-v-person-item__stack-item:nth-child(3) span::text").get()
    department = author.css(".nova-legacy-v-person-item__stack-item:nth-child(4) span").xpath("normalize-space()").get()
    skills = author.css(".nova-legacy-v-person-item__stack-item:nth-child(5) span").xpath("normalize-space()").getall()
    last_publication = author.css(".nova-legacy-v-person-item__info-section-list-item .nova-legacy-e-link--theme-bare::text").get()
    last_publication_link = f'https://www.researchgate.net{author.css(".nova-legacy-v-person-item__info-section-list-item .nova-legacy-e-link--theme-bare::attr(href)").get()}'

    authors.append({
        "name": name,
        "profile_page": profile_page,
        "institution": institution,
        "department": department,
        "thumbnail": thumbnail,
        "last_publication": {
            "title": last_publication,
            "link": last_publication_link
        },
        "skills": skills,
    })

암호
설명

css()
to parse data from the passed CSS selector(s) . 후드 아래의 모든 CSS query traslates to XPath using csselect package.
::text/::attr(attribute)
to extract textual or attribute data 노드에서.
get()/getall()
to get actual data from a matched node , 또는 get a list of matched data from nodes .
xpath("normalize-space()")빈 텍스트 노드도 구문 분석합니다. 기본적으로 빈 텍스트 노드는 XPath에서 건너뜁니다.

다음 페이지가 있는지 확인하고 페이지를 매깁니다.

# checks if the next page arrow key is greyed out `attr(rel)` (inactive) -> breaks out of the loop
if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get():
    break
else:
    page_num += 1

추출된 데이터 및 close 브라우저 인스턴스 인쇄:

print(json.dumps(authors, indent=2, ensure_ascii=False))

browser.close()

# call the function
scrape_researchgate_profile(query="coffee")

JSON 출력의 일부:

[
  {
    "name": "Marina Ramón-Gonçalves", # first profile
    "profile_page": "https://www.researchgate.net/profile/Marina-Ramon-Goncalves?_sg=VbWMth8Ia1hDG-6tFnNUWm4c8t6xlBHy2Ac-2PdZeBK6CS3nym5PM5OeoSzha90f2B6hpuoyBMwm24U",
    "institution": "Centro Nacional de Investigaciones Metalúrgicas (CENIM)",
    "department": "Reciclado de materiales",
    "thumbnail": "https://i1.rgstatic.net/ii/profile.image/845010970898442-1578477723875_Q64/Marina-Ramon-Goncalves.jpg",
    "last_publication": {
      "title": "Extraction of polyphenols and synthesis of new activated carbon from spent coffe...",
      "link": "https://www.researchgate.netpublication/337577823_Extraction_of_polyphenols_and_synthesis_of_new_activated_carbon_from_spent_coffee_grounds?_sg=2y4OuZz32W46AWcUGmwYbW05QFj3zkS1QR_MVxvKwqJG-abFPLF6cIuaJAO_Mn5juJZWkfEgdBwnA5Q"
    },
    "skills": [
      "Polyphenols",
      "Coffee",
      "Extraction",
      "Antioxidant Activity",
      "Chromatography"
    ]
  }, ... other profiles
  {
    "name": "Kingsten Okka", # last profile
    "profile_page": "https://www.researchgate.net/profile/Kingsten-Okka?_sg=l1w_rzLrAUCRFtoo3Nh2-ZDAaG2t0NX5IHiSV5TF2eOsDdlP8oSuHnGglAm5tU6OFME9wgfyAd-Rnhs",
    "institution": "University of Southern Queensland ",
    "department": "School of Agricultural, Computational and Environmental Sciences",
    "thumbnail": "https://i1.rgstatic.net/ii/profile.image/584138105032704-1516280785714_Q64/Kingsten-Okka.jpg",
    "last_publication": {
      "title": null,
      "link": "https://www.researchgate.netNone"
    },
    "skills": [
      "Agricultural Entomology",
      "Coffee"
    ]
  }
]

연결

GitHub Gist

가입 |

Feature Request 💫 또는 Bug 🐞 추가

Reference

이 문제에 관하여(Scrape ResearchGate 모든 저자, Python 연구원), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/serpapi/scrape-researchgate-all-authors-researchers-in-python-jo0

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

PostgreSQL: WITH 절을 사용하여 문제를 분해하는 예

Chakra-ui를 사용한 Next.js의 반응형 레이아웃

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다