#Day24 - 뷰티풀수프 테이블 스크래핑 방법 및 활용 사례 2부

어제article 에서 뷰티풀수프를 시작하는 이야기를 나눴습니다. 다음 기능에 대해 논의했습니다.

프리티피()

찾기()

find_all()

선택()
오늘 우리는 worldometer website의 테이블에서 데이터를 긁어 내려고 노력할 것입니다.

테이블에는 "main_table_countries_today"라는 ID가 있습니다. id를 사용하여 테이블 요소를 가져옵니다.
테이블 구조에 대해 알아보겠습니다.

<table>
     <thead>
     </thead>
     <tr>
           <td> </td>
           <td> </td>
           <td> </td>
           .
           .
           .
           .
    </tr>
</table>

"thead"에는 헤더 행( "Country,Other", "Total Cases", "New Cases".........)이 포함됩니다.
이것이 혼란스러워 보인다면 실제로 요소 스크래핑을 시작하고 출력을 살펴보겠습니다.

import requests
from bs4 import BeautifulSoup

html = requests.get("https://www.worldometers.info/coronavirus/").text

soup = BeautifulSoup(html, features= 'html.parser')

table = soup.select("#main_table_countries_today")[0]

headers = table.find("thead").get_text()

print(headers)

split() 함수를 사용하여 문자열을 요소 목록으로 나눌 수 있습니다.

headers = headers.split("\n")
headers = [header for header in headers if header]
print(headers)

'''
OUTPUT
['#', 'Country,Other', 'TotalCases', 'NewCases', 'TotalDeaths', 
'NewDeaths', 'TotalRecovered', 'NewRecovered', 'ActiveCases',
 'Serious,Critical', 'Tot\xa0Cases/1M pop', 'Deaths/1M pop', 'TotalTests', 'Tests/', 
'1M pop', 'Population', 'Continent', 
'1 Caseevery X ppl1 Deathevery X ppl1 Testevery X ppl']
'''

"/n"으로 분할한 다음 데이터를 정리합니다. 빈 요소를 제거합니다. 이제 "tr"요소 중 하나를 스크랩해 보겠습니다.

num_headers = len(headers)
table_body = table.find("tbody")
rows = table_body.find_all("tr")
for idx,row_element in enumerate(rows[8:]):
  row= row_element.get_text().split("\n")[1:]
  if len(row) != num_headers:
    print("Error!")
    break
print(" No Errors")
'''
OUTPUT
 No Errors
'''

모든 요소를 얻음

"USA"행이 목록의 8번째 요소이므로 요소 8부터 시작합니다.

행의 첫 번째 요소가 빈 요소이므로 무시합니다

행과 헤더의 길이가 동일한지 확인하기 위해 검사를 합니다

.

이제 모든 데이터가 있습니다. 데이터를 변환하여 사전 목록 또는 CSV로 저장할 수 있습니다.

태그의 속성을 얻는 방법

"태그"내부에서 href 값을 가져오도록 합시다.

a_tag = soup.find('a')
print(a_tag)
print(f"Attributes :  {a_tag.__dict__['attrs']}")

'''
OUTPUT
<a class="navbar-brand" href="/"><img border="0" 
src="/img/worldometers-logo.gif" title="Worldometer"/></a>

Attributes :  {'href': '/', 'class': ['navbar-brand']}
'''

href를 얻으려면 다음을 수행하면 됩니다.

href = a_tag['href']

"a 태그"안에 있는 이미지의 URL, 즉 "src"의 값을 가져오도록 합시다.

img = soup.select("a img")[0]
print(img)
img_src = img['src']
print(f'Src is {img_src}')

'''
OUTPUT
<img border="0" src="/img/worldometers-logo.gif" title="Worldometer"/>
Src is /img/worldometers-logo.gif
'''

Reference

이 문제에 관하여(#Day24 - 뷰티풀수프 테이블 스크래핑 방법 및 활용 사례 2부), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/rahulbanerjee99/day24-how-to-scrape-tables-and-other-use-cases-of-beautiful-soup-part2-499n

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다