Python으로 Github 사용자 세부 정보를 스크랩합니다.

웹 스크래핑을 배울 때 떠오른 아이디어 중 하나가 Github Scraper입니다.
여기에서는 각 프로세스를 설명하기 위해 최선을 다할 것입니다.

시작하자..

먼저 몇 가지 패키지를 설치해야 합니다.

뷰티플수프

요청

html파서

pip install requests
pip install html5lib
pip install beautifulsoup4

그런 다음 엽니다 https://github.com/yourusername

Devtools를 엽니다.

이것은 내 대시보드와 개발 도구를 열 때 표시되는 것입니다.

web 을 스크랩하는 동안 스크랩하려면 요소의 id, classname 또는 xpath가 필요합니다.

이름, 사용자 이름, 레포 수, 팔로워, 팔로잉 및 프로필 이미지를 스크랩합니다.

import requests
from bs4 import BeautifulSoup
import html5lib

모듈을 가져옵니다.

r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')

웹사이트에 요청하십시오.

beautifulsoup 및 html5lib를 사용하여 r.content에서 응답으로 받은 html을 구문 분석합니다.

여기에서 스크래핑을 시작합니다.

namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()

여기에서 클래스 이름 요소의 모든 요소를 가져옵니다vcard-names pl-2 pl-md-0".

이름 및 사용자 이름은 위 div의 범위 요소에 있습니다.

컨텐츠를 namediv 변수에 할당했습니다.

모든 span 요소를 찾고 (0:name,1:Username)을 선택하고 getText() 함수를 사용하여 텍스트를 가져옵니다.

statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')

여기서도 같은 일이 발생합니다.

Followers,Following,Stargazers는 클래스 이름flex-order-1 flex-md-order-none mt-2 mt-md-0의 내부 요소와 그 내부에 있는 mb-3의 요소입니다.

이를 가져와 요소 변수에 저장합니다.

스팬 내부 가져오기 inside the elements returns a list.

Followers is having index=0
Following is having index=1
Stargazer is having index=2

elements.find_all('a')[2].find('span').getText().strip(' ')

Here we are getting the second index item in a element and then getText() from the span inside it. We are using strip('') to remove unneccesary blank spaces in the result.

u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']

The above code gives the image tag and we are getting the src attribute.

repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()

Here we are getting the no of repos user haves.

That is all you need to scrape user details with python.

소스 코드

import requests
from bs4 import BeautifulSoup
import html5lib
r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')
namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()
statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()

원하는 요소로 이동하고 필요한 요소를 선택하는 프로그램을 만들어야 한다는 생각입니다.

뷰티풀수프 방법을 참조하세요. here

Github를 스크랩하기 위해 pypi 모듈도 만들었습니다. 참조here하고 마음에 들면 별표를 주세요.

의심이 들거나 설명이 필요한 경우 아래에 의견을 말하십시오.

사용자 저장소 세부 정보를 스크랩할 파트 2를 기대해 주세요.

Reference

이 문제에 관하여(Python으로 Github 사용자 세부 정보를 스크랩합니다.), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/fredysomy/scrape-github-users-details-with-python-3ce5

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다