Python에서 ResearchGate 프로필 페이지 긁기

45378 단어 tutorial webscraping programming python

스크랩 할 것

전제 조건

CSS 선택자를 사용한 기본 지식 스크래핑

CSS 선택기는 스타일이 적용되는 마크업 부분을 선언하므로 일치하는 태그 및 속성에서 데이터를 추출할 수 있습니다.

CSS 선택기로 스크랩하지 않은 경우 그것이 무엇인지, 장단점, 웹 스크래핑 관점에서 왜 중요한지, 그리고 가장 일반적인 접근 방식을 보여주는 전용 블로그 게시물how to use CSS selectors when web-scraping이 있습니다. 웹 스크래핑 시 CSS 선택기를 사용합니다.

차단될 확률 감소

요청이 차단될 가능성이 있습니다. how to reduce the chance of being blocked while web-scraping을 살펴보십시오. 대부분의 웹사이트에서 차단을 우회하는 11가지 방법이 있습니다.

라이브러리 설치:

pip install parsel playwright

전체 코드

from parsel import Selector
from playwright.sync_api import sync_playwright
import json, re 


def scrape_researchgate_profile(profile: str):
    with sync_playwright() as p:

        profile_data = {
            "basic_info": {},
            "about": {},
            "co_authors": [],
            "publications": [],
        }

        browser = p.chromium.launch(headless=True, slow_mo=50)
        page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
        page.goto(f"https://www.researchgate.net/profile/{profile}")
        selector = Selector(text=page.content())

        profile_data["basic_info"]["name"] = selector.css(".nova-legacy-e-text.nova-legacy-e-text--size-xxl::text").get()
        profile_data["basic_info"]["institution"] = selector.css(".nova-legacy-v-institution-item__stack-item a::text").get()
        profile_data["basic_info"]["department"] = selector.css(".nova-legacy-e-list__item.nova-legacy-v-institution-item__meta-data-item:nth-child(1)").xpath("normalize-space()").get()
        profile_data["basic_info"]["current_position"] = selector.css(".nova-legacy-e-list__item.nova-legacy-v-institution-item__info-section-list-item").xpath("normalize-space()").get()
        profile_data["basic_info"]["lab"] = selector.css(".nova-legacy-o-stack__item .nova-legacy-e-link--theme-bare b::text").get()

        profile_data["about"]["number_of_publications"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(1)").xpath("normalize-space()").get()).group()
        profile_data["about"]["reads"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(2)").xpath("normalize-space()").get()).group()
        profile_data["about"]["citations"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(3)").xpath("normalize-space()").get()).group()
        profile_data["about"]["introduction"] = selector.css(".nova-legacy-o-stack__item .Linkify").xpath("normalize-space()").get()
        profile_data["about"]["skills"] = selector.css(".nova-legacy-l-flex__item .nova-legacy-e-badge ::text").getall()

        for co_author in selector.css(".nova-legacy-c-card--spacing-xl .nova-legacy-c-card__body--spacing-inherit .nova-legacy-v-person-list-item"):
            profile_data["co_authors"].append({
                "name": co_author.css(".nova-legacy-v-person-list-item__align-content .nova-legacy-e-link::text").get(),
                "link": co_author.css(".nova-legacy-l-flex__item a::attr(href)").get(),
                "avatar": co_author.css(".nova-legacy-l-flex__item .lite-page-avatar img::attr(data-src)").get(),
                "current_institution": co_author.css(".nova-legacy-v-person-list-item__align-content li").xpath("normalize-space()").get()
            })

        for publication in selector.css("#publications+ .nova-legacy-c-card--elevation-1-above .nova-legacy-o-stack__item"):
            profile_data["publications"].append({
                "title": publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get(),
                "date_published": publication.css(".nova-legacy-v-publication-item__meta-data-item span::text").get(),
                "authors": publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall(),
                "publication_type": publication.css(".nova-legacy-e-badge--theme-solid::text").get(),
                "description": publication.css(".nova-legacy-v-publication-item__description::text").get(),
                "publication_link": publication.css(".nova-legacy-c-button-group__item .nova-legacy-c-button::attr(href)").get(),
            })


        print(json.dumps(profile_data, indent=2, ensure_ascii=False))

        browser.close()


scrape_researchgate_profile(profile="Agnis-Stibe")

코드 설명

라이브러리 가져오기:

from parsel import Selector
from playwright.sync_api import sync_playwright
import re, json, time

암호
설명

parsel
HTML/XML 문서를 구문 분석합니다. XPath를 지원합니다.

playwright
브라우저 인스턴스로 페이지를 렌더링합니다.
re데이터의 일부를 정규식과 일치시킵니다.
jsonPython 사전을 JSON 문자열로 변환합니다.

함수를 정의합니다.

def scrape_researchgate_profile(profile: str):
    # ...

암호
설명

profile: str파이썬에게 profile 가 str 가 되어야 한다고 알려줍니다.

context manager이 있는 playwright 열기:

with sync_playwright() as p:
    # ...

추출된 데이터의 구조를 정의합니다.

profile_data = {
    "basic_info": {},
    "about": {},
    "co_authors": [],
    "publications": [],
}

브라우저 인스턴스를 실행하고 페이지를 열고goto HTML/XML 파서에 대한 응답을 전달합니다.

browser = p.chromium.launch(headless=True, slow_mo=50)
page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
page.goto(f"https://www.researchgate.net/profile/{profile}")
selector = Selector(text=page.content())

암호
설명

p.chromium.launch()
Chromium 브라우저 인스턴스를 시작합니다.

headless
기본값인 경우에도 헤드리스 모드에서 실행하도록 명시적으로 지시합니다playwright.

slow_mo
실행 속도를 늦추라고 지시합니다playwright.

browser.new_page()
새 페이지를 엽니다. user_agent는 실제 사용자가 브라우저에서 요청을 수행하는 데 사용됩니다. 사용하지 않으면 기본적으로 playwright 값인 None 가 됩니다. Check what's your user-agent .

업데이트basic_info 사전 키, 새 키 생성 및 추출된 데이터 할당:

profile_data["basic_info"]["name"] = selector.css(".nova-legacy-e-text.nova-legacy-e-text--size-xxl::text").get()
profile_data["basic_info"]["institution"] = selector.css(".nova-legacy-v-institution-item__stack-item a::text").get()
profile_data["basic_info"]["department"] = selector.css(".nova-legacy-e-list__item.nova-legacy-v-institution-item__meta-data-item:nth-child(1)").xpath("normalize-space()").get()
profile_data["basic_info"]["current_position"] = selector.css(".nova-legacy-e-list__item.nova-legacy-v-institution-item__info-section-list-item").xpath("normalize-space()").get()
profile_data["basic_info"]["lab"] = selector.css(".nova-legacy-o-stack__item .nova-legacy-e-link--theme-bare b::text").get()

암호
설명

profile_data["basic_info"]["name"]이전에 생성된basic_info 키에 액세스한 다음 새["name"] 키를 생성하고 추출된 데이터를 할당합니다.
css()
to parse data from the passed CSS selector(s) . 후드 아래의 모든 CSS query traslates to XPath using csselect package.
::text
to extract textual data 노드에서.
get() to get actual data from a matched node
xpath("normalize-space()")빈 텍스트 노드도 구문 분석합니다. 기본적으로 빈 텍스트 노드는 XPath에서 건너뜁니다.

업데이트about 사전 키, 새 키 생성 및 추출된 데이터 할당:

profile_data["about"]["number_of_publications"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(1)").xpath("normalize-space()").get()).group()
profile_data["about"]["reads"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(2)").xpath("normalize-space()").get()).group()
profile_data["about"]["citations"] = re.search(r"\d+", selector.css(".nova-legacy-c-card__body .nova-legacy-o-grid__column:nth-child(3)").xpath("normalize-space()").get()).group()
profile_data["about"]["introduction"] = selector.css(".nova-legacy-o-stack__item .Linkify").xpath("normalize-space()").get()
profile_data["about"]["skills"] = selector.css(".nova-legacy-l-flex__item .nova-legacy-e-badge ::text").getall()

암호
설명

profile_data["basic_info"]["name"]이전에 생성된basic_info 키에 액세스한 다음 새["name"] 키를 생성하고 추출된 데이터를 할당합니다.
re.search(r"\d+", <returned_data_from_parsel>).group()반환된 문자열에서 re.search() 정규식 \d+ 을 통해 숫자 데이터를 추출합니다. group() 은 정규식과 일치하는 하위 문자열을 추출하는 것입니다.
css()
to parse data from the passed CSS selector(s) . 후드 아래의 모든 CSS query traslates to XPath using csselect package.
::text
to extract textual data 노드에서.
get()/getall()
to get actual data from a matched node , 또는 get a list of matched data from nodes .
xpath("normalize-space()")빈 텍스트 노드도 구문 분석합니다. 기본적으로 빈 텍스트 노드는 XPath에서 건너뜁니다.

공동 저자를 반복하고 개별 공동 저자를 추출하고 임시 목록에 추가합니다.

for co_author in selector.css(".nova-legacy-c-card--spacing-xl .nova-legacy-c-card__body--spacing-inherit .nova-legacy-v-person-list-item"):
    profile_data["co_authors"].append({
        "name": co_author.css(".nova-legacy-v-person-list-item__align-content .nova-legacy-e-link::text").get(),
        "link": co_author.css(".nova-legacy-l-flex__item a::attr(href)").get(),
        "avatar": co_author.css(".nova-legacy-l-flex__item .lite-page-avatar img::attr(data-src)").get(),
        "current_institution": co_author.css(".nova-legacy-v-person-list-item__align-content li").xpath("normalize-space()").get()
    })

암호
설명

::attr(attribute)
to extract attribute data 노드에서.

다음은 모든 발행물을 반복하고 개별 발행물을 추출하고 임시 목록에 추가하는 것입니다.

for publication in selector.css("#publications+ .nova-legacy-c-card--elevation-1-above .nova-legacy-o-stack__item"):
    profile_data["publications"].append({
        "title": publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get(),
        "date_published": publication.css(".nova-legacy-v-publication-item__meta-data-item span::text").get(),
        "authors": publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall(),
        "publication_type": publication.css(".nova-legacy-e-badge--theme-solid::text").get(),
        "description": publication.css(".nova-legacy-v-publication-item__description::text").get(),
        "publication_link": publication.css(".nova-legacy-c-button-group__item .nova-legacy-c-button::attr(href)").get(),
    })

추출된 데이터 및 close 브라우저 인스턴스 인쇄:

print(json.dumps(profile_data, indent=2, ensure_ascii=False))

browser.close()


# call function. "profiles" could be a list of authors.
# author name should be with a "-", otherwise ResearchGate doesn't recognize it.
scrape_researchgate_profile(profile="Agnis-Stibe")

JSON 출력의 일부:

{
  "basic_info": {
    "name": "Agnis Stibe",
    "institution": "EM Normandie Business School",
    "department": "Supply Chain Management & Decision Sciences",
    "current_position": "Artificial Inteligence Program Director",
    "lab": "Riga Technical University"
  },
  "about": {
    "number_of_publications": "71",
    "reads": "40",
    "citations": "572",
    "introduction": "4x TEDx speaker, MIT alum, YouTube creator. Globally recognized corporate consultant and scientific advisor at AgnisStibe.com. Provides a science-driven STIBE method and practical tools for hyper-performance. Academic Director on Artificial Intelligence and Professor of Transformation at EM Normandie Business School. Paris Lead of Silicon Valley founded Transformative Technology community. At the renowned Massachusetts Institute of Technology, he established research on Persuasive Cities.",
    "skills": [
      "Social Influence",
      "Behavior Change",
      "Persuasive Design",
      "Motivational Psychology",
      "Artificial Intelligence",
      "Change Management",
      "Business Transformation"
    ]
  },
  "co_authors": [
    {
      "name": "Mina Khan",
      "link": "profile/Mina-Khan-2",
      "avatar": "https://i1.rgstatic.net/ii/profile.image/387771463159814-1469463329918_Q64/Mina-Khan-2.jpg",
      "current_institution": "Massachusetts Institute of Technology"
    }, ... other co-authors
  ],
  "publications": [
    {
      "title": "Change Masters: Using the Transformation Gene to Empower Hyper-Performance at Work",
      "date_published": "May 2020",
      "authors": [
        "Agnis Stibe"
      ],
      "publication_type": "Article",
      "description": "Achieving hyper-performance is an essential aim not only for organizations and societies but also for individuals. Digital transformation is reshaping the workplace so fast that people start falling behind, with their poor attitudes remaining the ultimate obstacle. The alignment of human-machine co-evolution is the only sustainable strategy for the...",
      "publication_link": "https://www.researchgate.net/publication/342716663_Change_Masters_Using_the_Transformation_Gene_to_Empower_Hyper-Performance_at_Work"
    }, ... other publications
  ]
}

가입 |

Feature Request 💫 또는 Bug 🐞 추가

Reference

이 문제에 관하여(Python에서 ResearchGate 프로필 페이지 긁기), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/dmitryzub/scrape-researchgate-profile-page-in-python-50d2

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

Node.js MySQL 대 Node.js MySQL + Sculter.js(어느 것이 더 좋습니까?)

CSS 그리드: 크기 조정 기능

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다