BeautifulSoup 4 를 사용 하여 XML 을 해석 하 는 방법 소결

Beautiful Soup 은 HTML 이나 XML 파일 에서 데 이 터 를 추출 하 는 Python 라 이브 러 리 입 니 다.여러분 이 좋아 하 는 해상도 기 를 이용 하여 문서 트 리 를 탐색 하고 찾 으 며 수정 하 는 일반적인 방법 을 많이 제공 합 니 다.
도움말 문서 영문 판:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
도움말 문서 중국어 버 전:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
입문 예시
다음은 영화'이상 한 나라 의 앨 리 스'의 HTML 내용 이다.

우 리 는 이 를 예 로 들 어 BeautifulSoup 을 사용 하여 HTML 페이지 의 내용 을 분석 하 는 방법 에 대해 간단 한 입문 예 를 들 었 다.


from bs4 import BeautifulSoup
 
# 《       》    
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
 
#      
soup = BeautifulSoup(html_doc, "html.parser")
 
#     
#soup.prettify())
 
#       title   
soup.title
# <title>The Dormouse's story</title>
 
#       title      
soup.title.name
# title
 
#       title        
soup.title.string
# The Dormouse's story
 
#       title          
soup.title.parent.name
# head
 
#       p   
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
 
#       p     class   
soup.p['class']
# ['title']
 
#       a   
soup.a
# <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
 
#       a   
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
 
#       a     href   
for link in soup.find_all('a'):
  print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
 
#    id = link3   a   
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
 
#           
print(soup.get_text())
# The Dormouse's story
# 
# The Dormouse's story
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
# ...

해석 기
Beautiful Soup 은 Python 표준 라 이브 러 리 의 HTML 해상도 기 를 지원 하 는 것 외 에 도 제3자 해상도 기 를 지원 합 니 다.그 중 하 나 는 lxml 입 니 다.
다음 표 는 주요 해석 기 와 그들의 장단 점 을 보 여 줍 니 다.
해석 기
사용 방법
우세 하 다.
열세
파 이 썬 표준 라 이브 러 리
BeautifulSoup(markup, "html.parser")
Python 내 장 된 표준 라 이브 러 리
실행 속도 가 적당 하 다.
문서 용 착 력 이 강하 다
Python 2.7.3 or 3.2.2)이전 버 전에 서 문서 의 잘못 사용 능력 이 떨 어 졌 습 니 다.
lxml HTML 해상도
BeautifulSoup(markup, "lxml")
속도 가 빠르다
문서 용 착 력 이 강하 다
C 언어 라 이브 러 리 설치 필요
lxml XML 해상도 기
BeautifulSoup(markup, ["lxml", "xml"])
BeautifulSoup(markup, "xml")
속도 가 빠르다
XML 을 지원 하 는 유일한 해상도
C 언어 라 이브 러 리 설치 필요
html5lib
BeautifulSoup(markup, "html5lib")
가장 좋 은 용 착 성
브 라 우 저 방식 으로 문 서 를 분석 합 니 다.
HTML 5 형식의 문서 생 성
속도 가 느리다
외부 확장 에 의존 하지 않 음
효율 이 높 기 때문에 lxml 를 해석 기로 추천 합 니 다.Python 2.7.3 이전 버 전과 Python 3 중 3.2.2 이전 버 전 은 lxml 또는 html5lib 를 설치 해 야 합 니 다.Python 버 전의 표준 라 이브 러 리 에 내 장 된 HTML 분석 방법 이 불안정 하기 때 문 입 니 다.
메모:HTML 이나 XML 문서 형식 이 올 바 르 지 않 으 면 서로 다른 해석 기 에서 돌아 오 는 결 과 는 다 를 수 있 습 니 다.
해석 기 간 의 차이
Beautiful Soup 은 서로 다른 해석 기 에 같은 인 터 페 이 스 를 제공 하지만,해석 기 자 체 는 차이 가 있 습 니 다.같은 문 서 를 서로 다른 해석 기로 해석 하면 서로 다른 구조의 트 리 문 서 를 생 성 할 수 있 습 니 다.가장 큰 차이 점 은 HTML 해석 기와 XML 해석 기 입 니 다.아래 부분 을 보면 HTML 구조 로 해 석 됩 니 다.


html_soup = BeautifulSoup("<a><b/></a>", "lxml")
print(html_soup)
# <html><body><a><b></b></a></body></html>

빈 탭이 HTML 표준 에 부합 되 지 않 기 때문에 해석 기 는 그것 을로 해석 합 니 다.
같은 문 서 는 XML 로 다음 과 같이 해석 합 니 다.빈 탭이 남아 있 고 문서 앞 에 XML 헤드 가 추가 되 어 있 습 니 다.탭 에 포함 되 어 있 지 않 습 니 다.
xml_soup = BeautifulSoup("<a></a>", "xml") print(xml_soup) # <?xml version="1.0" encoding="utf-8"?> # <a></a>
HTML 해상도 기 간 에 도 차이 가 있 습 니 다.해 석 된 HTML 문서 가 표준 형식 이 라면 해상도 기 간 에는 차이 가 없습니다.해석 속도 만 다 르 면 결 과 는 정확 한 문서 트 리 로 돌아 갑 니 다.
그러나 해 석 된 문서 가 표준 형식 이 아니라면 해석 기 에 따라 결과 가 다 를 수 있 습 니 다.다음 예 에서 잘못된 형식의 문 서 를 lxml 로 해석 한 결과
탭 이 무시 되 었 습 니 다.
soup = BeautifulSoup("<a>", "lxml") print(soup) # <html><body><a></a></body></html>
html 5lib 라 이브 러 리 를 사용 하여 같은 문 서 를 분석 하면 다른 결 과 를 얻 을 수 있 습 니 다:
soup = BeautifulSoup("<a>", "html5lib") print(soup) # <html><head></head><body><a></a></body></html>
html 5lib 라 이브 러 리 는
탭 을 무시 하지 않 고 자동 으로 탭 을 보완 하고 문서 트 리 에탭 을 추가 합 니 다.
pyhton 내장 라 이브 러 리 분석 결 과 는 다음 과 같 습 니 다.
soup = BeautifulSoup("<a>", "html.parser") print(soup) # <a></a>
lxml 라 이브 러 리 와 유사 한 경우 Python 내장 라 이브 러 리 는
라벨 을 무시 합 니 다.html5lib 라 이브 러 리 와 달리 표준 라 이브 러 리 는 표준 에 맞 는 문서 형식 을 만 들 거나 문서 세 션 을태그 에 포함 시 키 려 고 하지 않 았 습 니 다.lxml 와 달리 표준 라 이브 러 리 는태그 조차 추가 하려 고 하지 않 았 습 니 다.
문서 세 션"
"는 잘못된 형식 이기 때문에 위의 해석 방식 은 모두"정확"이 라 고 할 수 있 습 니 다.html5lib 라 이브 러 리 는 HTML 5 의 일부 표준 을 사용 하기 때문에"정확"에 가장 가 깝 지만 모든 해석 기의 구 조 는"정상"이 라 고 여 길 수 있 습 니 다.
다른 해석 기 는 코드 실행 결과 에 영향 을 미 칠 수 있 으 며,다른 사람 에 게 나 눠 주 는 코드 에 BeautifulSoup 을 사용 했다 면 불필요 한 번 거 로 움 을 줄 이기 위해 어떤 해석 기 를 사 용 했 는 지 밝 히 는 것 이 좋다.
문서 개체 만 들 기
하나의 문 서 를 BeautifulSoup 의 구조 방법 에 전달 하면 문서 의 대상 을 얻 을 수 있 고 문자열 이나 파일 핸들 을 전달 할 수 있 습 니 다.
from bs4 import BeautifulSoup soup = BeautifulSoup(open("index.html")) soup = BeautifulSoup("<html>data</html>")
우선 문 서 는 유 니 코드 로 바 뀌 었 고 HTML 의 인 스 턴 스 는 유 니 코드 인 코딩 으로 바 뀌 었 다.
soup = BeautifulSoup("Sacré bleu!") print(soup) # <html><body>Sacré bleu!</body></html>
그리고 Beautiful Soup 은 이 문 서 를 해석 하기 위해 가장 적합 한 해상 도 를 선택 합 니 다.해상 도 를 수 동 으로 지정 하면 Beautiful Soup 은 지정 한 해상 도 를 선택 하여 문 서 를 해석 합 니 다.
개체 의 종류
Beautiful Soup 은 복잡 한 HTML 문 서 를 복잡 한 트 리 구조 로 변환 합 니 다.각 노드 는 Python 대상 이 고 모든 대상 은 4 가지 로 요약 할 수 있 습 니 다.Tag,NavigableString,Beautiful Soup,Comment.
Tag
태그 대상 은 XML 또는 HTML 원본 문서 의 tag 와 같 습 니 다.
from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold',"html.parser") # b tag = soup.b # type(tag) # <class 'bs4.element.Tag'> # tag.name # b # tag.name = "blockquote" tag # <blockquote class="boldest">Extremely bold</blockquote> # class tag['class'] # ['boldest'] # class tag['class'] = 'verybold' # class tag.get('class') # verybold # id tag['id'] = 'title' tag # <blockquote class="verybold" id="title">Extremely bold</blockquote> # tag.attrs # {'class': ['verybold'], 'id': 'title'} # id del tag['id'] tag # <blockquote class="verybold">Extremely bold</blockquote>
스 트 리밍 가능 문자열
tag 에 문자열 이 자주 포 함 됩 니 다.Beautiful Soup 은 tag 에 있 는 문자열 을 NavigableString 클래스 로 포장 합 니 다.
from bs4 import BeautifulSoup soup = BeautifulSoup('Extremely bold', "html.parser") # b tag = soup.b # tag.string # Extremely bold # type(tag.string) # <class 'bs4.element.NavigableString'>
BeautifulSoup
BeautifulSoup 대상 은 문서 의 모든 내용 을 표시 합 니 다.대부분의 경우 태그 대상 으로 사용 할 수 있 습 니 다.문서 트 리 와 문서 트 리 를 옮 겨 다 니 며 설명 하 는 대부분의 방법 을 지원 합 니 다.
BeautifulSoup 대상 은 진정한 HTML 이나 XML 의 tag 가 아니 기 때문에 name 과 attribute 속성 이 없습니다.그러나 때때로 그것 의.name 속성 을 보 는 것 이 편리 하기 때문에 BeautifulSoup 대상 은'[document]'이라는 특수 속성 을 포함 하고 있 습 니 다.name.
soup = BeautifulSoup('Extremely bold',"html.parser") soup.name # [document]
주석 및 특수 문자열
Tag,NavigableString,BeautifulSoup 은 html 와 xml 의 모든 내용 을 덮어 쓰 지만 특수 한 대상 도 있 습 니 다.걱정 스 러 운 내용 은 문서 의 주석 부분 입 니 다.
markup = "" soup = BeautifulSoup(markup) comment = soup.b.string type(comment) # <class 'bs4.element.Comment'>
Comment 대상 은 특수 한 유형의 NavigableString 대상 입 니 다.
comment # Hey, buddy. Want to buy a used parser?
그러나 HTML 문서 에 나타 날 때 Comment 대상 은 특별한 형식 으로 출력 합 니 다.
soup.b.prettify() # #  # 
Beautiful Soup 에서 정의 하 는 다른 유형 은 XML 문서 에 나타 날 수 있 습 니 다:CData,ProcessingInstruction,Declaration,Doctype.Comment 대상 과 유사 합 니 다.이 클래스 들 은 모두 NavigableString 의 하위 클래스 로 추가 적 인 방법 을 추가 한 문자열 만 독점 합 니 다.다음은 설명 을 CDATA 로 대체 하 는 예 입 니 다.
from bs4 import CData cdata = CData("A CDATA block") comment.replace_with(cdata) print(soup.b.prettify()) # # <![CDATA[A CDATA block]]> # 
하위 노드
하나의 Tag 는 여러 문자열 이나 다른 Tag 를 포함 할 수 있 습 니 다.이것 은 모두 이 Tag 의 하위 노드 입 니 다.Beautiful Soup 은 많은 작업 과 하위 노드 를 옮 겨 다 니 는 속성 을 제공 합 니 다.
메모:Beautiful Soup 에서 문자열 노드 는 하위 노드 가 없 기 때문에 이 속성 을 지원 하지 않 습 니 다.
위의 의 문 서 를 계속 가지 고 예 를 들 어 보 세 요.
from bs4 import BeautifulSoup # 《》 html_doc = """ <html> <head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... </body> </html> """ # soup = BeautifulSoup(html_doc, "html.parser") # tag soup.body.p.b # The Dormouse's story # a soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] # .contents tag soup.head.contents # [<title>The Dormouse's story</title>] # .children tag for child in soup.head.children: print(child) # <title>The Dormouse's story</title> # .descendants tag for descendant in soup.head.descendants: print(descendant) # <title>The Dormouse's story</title> # The Dormouse's story # .string NavigableString soup.head.title.string # The Dormouse's story # .string NavigableString soup.head.string # The Dormouse's story # .strings tag for string in soup.strings: print(repr(string)) # .stripped_strings tag for string in soup.stripped_strings: print(repr(string))
메모:tag 에 여러 개의 하위 노드 가 포함 되 어 있 으 면 tag 는.string 방법 이 어느 하위 노드 의 내용 을 호출 해 야 하 는 지 확인 할 수 없습니다.string 의 출력 결 과 는 None 입 니 다.
부모 노드
모든 tag 나 문자열 에는 부모 노드 가 있 습 니까?아니면 위의 문서 로 예 를 들 면:
from bs4 import BeautifulSoup # 《》 html_doc = """ <html> <head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... </body> </html> """ # soup = BeautifulSoup(html_doc, "html.parser") # .parent title soup.title.parent # <head><title>The Dormouse's story</title></head> # .parent title soup.title.string.parent # <title>The Dormouse's story</title> # <html> BeautifulSoup type(soup.html.parent) # <class 'bs4.BeautifulSoup'> # BeautifulSoup .parent None soup.parent for parent in soup.a.parents: print(parent.name) # p # body # html # [document]
형제 노드
BeautifulSoup 을 사용 하여 형제 노드 를 찾 는 방법 을 예 로 들 기 위해 서 는 상례 의 문 서 를 수정 하고 줄 바 꿈 문자,문자열,라벨 을 삭제 해 야 합 니 다.구체 적 인 예시 코드 는 다음 과 같다.
from bs4 import BeautifulSoup # 《》 html_doc = """ <html> <body> Schindler's List <a href="http://example.com/Oskar" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a href="http://example.com/Helen" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> </body> </html> """ # soup = BeautifulSoup(html_doc, "html.parser") # ID = name2 a name2 = soup.find("a", {"id":{"name2"}}) # <a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # name1 = name2.previous_sibling # <a href="http://example.com/Oskar" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a> # name3 = name2.next_sibling # <a href="http://example.com/Helen" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> name1.previous_sibling # None name3.next_sibling # None # .next_siblings for sibling in soup.find("a", {"id":{"name1"}}).next_siblings: print(repr(sibling)) # <a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # <a href="http://example.com/Helen" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> # .previous_siblings for sibling in soup.find("a", {"id":{"name3"}}).previous_siblings: print(repr(sibling)) # <a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # <a href="http://example.com/Oskar" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a>
메모:탭 에 포 함 된 문자열,문자 또는 줄 바 꿈 문자 등 내용 은 모두 노드 로 간 주 됩 니 다.
후퇴 와 전진
이전 장의 에서 HTML 문 서 를 계속 사용 하여 후퇴 와 전진 예 시 를 진행 합 니 다.
from bs4 import BeautifulSoup # 《》 html_doc = """ <html> <body> Schindler's List <a href="http://example.com/Oskar" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a href="http://example.com/Helen" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> </body> </html> """ # soup = BeautifulSoup(html_doc, "html.parser") # ID = name2 a name2 = soup.find("a", {"id":{"name2"}}) # <a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # name2.previous_element # Oskar Schindler # name2.previous_element.previous_element # <a href="http://example.com/Oskar" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a> # name2.next_element # Itzhak Stern # name2.next_element.next_element # <a href="http://example.com/Helen" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> # .next_elements for element in soup.find("a", {"id":{"name1"}}).next_elements: print(repr(element)) # 'Oskar Schindler' # <a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a> # 'Itzhak Stern' # <a href="http://example.com/Helen" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> # 'Helen Hirsch' # ' ' # ' ' # ' ' # .previous_elements for element in soup.find("a", {"id":{"name1"}}).previous_elements: print(repr(element)) # <a href="http://example.com/Oskar" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a href="http://example.com/Helen" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> # ' ' # "Schindler's List" # Schindler's List # Schindler's List # ' ' # <body> # Schindler's List # <a href="http://example.com/Oskar" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a href="http://example.com/Helen" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> # </body> # ' ' # <html> # <body> # Schindler's List # <a href="http://example.com/Oskar" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name1">Oskar Schindler</a><a href="http://example.com/Itzhak" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name2">Itzhak Stern</a><a href="http://example.com/Helen" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="name3">Helen Hirsch</a> # </body> # </html> # ' '
문서 트 리 검색
find_all( name , attrs , recursive , text , **kwargs ) find( name , attrs , recursive , text , **kwargs ) find_parents( name , attrs , recursive , text , **kwargs ) find_parent( name , attrs , recursive , text , **kwargs ) find_next_siblings( name , attrs , recursive , text , **kwargs ) find_next_sibling( name , attrs , recursive , text , **kwargs ) find_previous_siblings( name , attrs , recursive , text , **kwargs ) find_previous_sibling( name , attrs , recursive , text , **kwargs ) find_all_next( name , attrs , recursive , text , **kwargs ) find_next( name , attrs , recursive , text , **kwargs ) find_all_previous( name , attrs , recursive , text , **kwargs ) find_previous( name , attrs , recursive , text , **kwargs )
Beautiful Soup 은 많은 검색 방법 을 정 의 했 습 니 다.여 기 는 find 에 중점 을 두 고 있 습 니 다.all()의 용법 을 예 로 들다.
from bs4 import BeautifulSoup from bs4 import NavigableString import re # 《》 html_doc = """ <html> <head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... </body> </html> """ # soup = BeautifulSoup(html_doc, "html.parser") # ， (b) soup.find_all('b') # [The Dormouse's story] # ， class = title p soup.find_all("p", "title") # [The Dormouse's story] # id = link2 soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>] # href elsie id = link1 soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">three</a>] # id = link1 print(soup.find_all(attrs={"id": "link1"})) # [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>] # class = sister a soup.find_all("a", class_="sister") # [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] def has_six_characters(css_class): return css_class is not None and len(css_class) == 6 # class 6 soup.find_all(class_=has_six_characters) # [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] # soup.find_all(text=["Tillie", "Elsie", "Lacie"]) # a soup.find_all("a", limit=2) # title , soup.html.find_all("title", recursive=False) # CSS soup.select("head > title") # ， ( b ) for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b # ， ( t) for tag in soup.find_all(re.compile("t")): print(tag.name) # html # title # ， (a b) soup.find_all(["a", "b"]) # [The Dormouse's story, # <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] # True， for tag in soup.find_all(True): print(tag.name) def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') # ， class id soup.find_all(has_class_but_no_id)
여기에 BeautifulSoup 4 를 사용 하여 XML 을 해석 하 는 방법 에 관 한 글 을 소개 합 니 다.더 많은 관련 BeautifulSoup 4 해석 XML 내용 은 우리 의 이전 글 을 검색 하거나 아래 의 관련 글 을 계속 찾 아 보 세 요.앞으로 많은 응원 부 탁 드 리 겠 습 니 다!

BeautifulSoup 4 를 사용 하여 XML 을 해석 하 는 방법 소결

좋은 웹페이지 즐겨찾기