beautifulsoup4 튜토리얼 (3) 문서 트리 훑어보기 및 검색

Beautifulsoup4 튜토리얼 (一) 기초 지식과 첫 번째 파충류 Beautifulsoup4 튜토리얼 (二)bs4 중 4대 대상 Beautifulsoup4 튜토리얼 (셋) 문서 트리 Beautifulsoup4 튜토리얼 (넷) css 선택기

4. 문서 트리를 옮겨다니기

4.1 직접 서브노드

.contents

tag 대상의 .contents 속성은 특정한 tag의 하위 노드를 목록으로 출력할 수 있다. 물론 목록은 색인으로 목록의 요소를 가져올 수 있다.

#-*-coding:utf-8-*-
from bs4 import BeautifulSoup

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""

# beautifulsoup 
# html beautifulsoup ， :
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html,features="lxml")

print soup.body.contents
print soup.body.contents[1]

result:
[u'
', <p class="title" name="dromouse"><b>The Dormouse's story, u'
', Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
, u'
', ...
, u'
']
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

.children

Tag 객체의 children 속성은 교체기

입니다.

print soup.head.children
for child in soup.body.children:
    print child
    
<listiterator object at 0x00000000039C3080>

result:
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>

4.2 모든 자손의 결점

.descendants속성

Tag 대상의children과contents가 Tag 대상만 포함하는 직접 하위 노드와 달리 이 속성은 Tag 대상의 모든 자손 결점을 귀속 순환시킨 다음에 생성기

를 생성하는 것이다.

print soup.head.descendants
for child in soup.body.descendants:
    print child

result:
<generator object descendants at 0x0000000003970E58>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

4.3 결점 내용

Tag 객체에 레이블이 없는 경우

print soup.title
print soup.title.string

result:
The Dormouse's story
The Dormouse's story

Tag 객체에 레이블이 있는 경우

print soup.head
print soup.head.string

result:
The Dormouse's story
The Dormouse's story

Tag 객체에 여러 레이블이 있는 경우

string을 여전히 사용하는 것은 불가능하다

print soup.body
print soup.body.string

result:
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
None

는 .strings 속성이나 .stripped_strings를 사용해야 한다. 그들이 얻은 것은 모두 생성기이다.

print soup.strings
for string in soup.strings:
    print string

result:
<generator object _all_strings at 0x0000000003170E58>
The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

Tag 객체의 .stripped_strings 속성을 사용하여 빈 행을 제거한 태그 내의 여러 컨텐트를 가져옵니다.

print soup.stripped_strings
for string in soup.stripped_strings:
    print string
    
result:
<generator object stripped_strings at 0x00000000030D0E58>
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

4.4 직접 상위 노드

레이블의 상위 노드

p = soup.p
print p.parent.name

result:
body

내용의 부 노드: 내용 밖으로 포장된 첫 번째 라벨

content = soup.head.title.string
print content
print content.parent.name

result:
The Dormouse's story
title

4.5 모든 상위 노드.parents 속성, 얻은 것도 하나의 생성기

content = soup.head.title.string
print content
for parent in content.parents:
    print parent.name
    
result:
The Dormouse's story
title
head
html
[document]

4.6 형제결점.next_sibling와 .previous_sibling의 속성은 각각 다음 형제 결점을 획득하는 것과 이전 형제 결점을 획득하는 것이다.

통상적으로 이 두 속성을 사용하면 공백이나 줄을 바꿀 수 있다.왜냐하면beautifulsoup은 공백과 줄을 하나의 결점으로 식별하기 때문이다

print soup.p.next_sibling
print soup.a.previous_sibling
print soup.p.next_sibling.next_sibling
result:


Once upon a time there were three little sisters; and their names were

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

4.7 모든 형제의 결점.next_siblings와 .previous_siblings는 현재의 형제 결점을 교체하여 출력할 수 있다

for next in soup.a.next_siblings:
    print next

result:
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
;
and they lived at the bottom of a well.

4.8 전후 요소.next_element과.previous_element속성은 차원을 구분하지 않는 전후원소(같은 층의 것을 형제결점이라고 부른다)를 획득한 것이다

<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

4.9 모든 전후 요소.next_elements 및 .previous_elements 속성은 문서 컨텐트를 앞뒤로 처리합니다.

soup = BeautifulSoup(html,features="lxml")

for element in soup.a.next_elements:
    print(repr(element))
    
result:
u' Elsie '
u',
'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
u'Lacie'
u' and
'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
u'Tillie'
u';
and they lived at the bottom of a well.'
u'
'
<p class="story">...</p>
u'...'
u'
'
u'
'
u'
'

5. 문서 트리 검색

5.1 find_all

사용 방법: find_all(name,attrs,recursive,text,**kwargs)

검색 범위: 현재 tag의 모든 tag 서브노드입니다.

작용: 현재 tag의 모든 tag 서브 노드가 필터의 조건에 부합되는지 판단합니다.

name 매개 변수:name이라는 이름의 tag를 찾으면 문자열이 자동으로 무시됩니다.

입력 문자열

print soup.find_all('a')

result:
[, Lacie, Tillie]

전송 정규 표현식

import re
for tag in soup.find_all(re.compile("^b")):
    print tag
    
result:
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
<b>The Dormouse's story</b>

수신 목록

print soup .find_all(["a","b"])

result:
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True로 전송: 모든 태그 찾기

for tag in soup.find_all(True):
    print tag.name
    
result:
html
head
title
body
p
b
p
a
a
a
p

전송 방법: 자체적으로 필터를 구성하고 방법의 매개 변수는 tag 대상이며 반환 값은 Ture | False이다.

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print soup.find_all(has_class_but_no_id)

result:
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

keyword 매개 변수

name 매개 변수를 통해 tag를 검색하는 탭 형식의 이름입니다. 예를 들어 a,head,title 등입니다.

탭 내 속성의 값을 검색하려면 키 값 쌍의 형식으로 지정해야 합니다.예: soup_findall(id='link2').

import re
print soup.find_all(id='link2')
print soup.find_all(href=re.compile("elsie"))

result:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

지정한 키가python의 내장 인자라면 뒤에 밑줄을 쳐야 한다. 예를 들어class =“sister”

print soup.find_all(class_="sister")

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

html5의 데이터-* 속성은 직접 지정할 수 없습니다.attr 파라미터를 통해 매개 변수 사전을 사용자 정의할 수 있습니다:soup.find_all(attrs={"data-foo":"value"})

data_soup = BeautifulSoup('foo!
',features="lxml")
print data_soup.find_all(attrs={"data-foo":"value"})

result:
[<div data-foo="value">foo!</div>]

text 매개 변수

작용은name 매개 변수와 유사하지만text 매개 변수의 검색 범위는 문서의 문자열 내용(주석이 포함되지 않음)이고 완전히 일치하며 정규 표현식, 목록,True도 받아들일 수 있습니다.

import re
print soup.a
print soup.find_all(text="Elsie")
print soup.find_all(text=["Tillie", "Elsie", "Lacie"])
print soup.find_all(text="story")
print soup.find_all(text="The Dormouse's story")
print soup.find_all(text=re.compile("story"))

result:
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
[]
[u'Lacie', u'Tillie']
[]
[u"The Dormouse's story", u"The Dormouse's story"]
[u"The Dormouse's story", u"The Dormouse's story"]

limit 매개 변수는name 매개 변수나attr 매개 변수를 사용하여 여과된 항목의 수량을 제한할 수 있습니다.

print soup.find_all("a")
print "==============="
print soup.find_all("a",limit=2)

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
===============
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recuresive 매개 변수는 일반적으로find 사용all 메서드 시 반환

print soup.body
print "==============================="
print soup.body.find_all("a",recursive=False)
print soup.body.find_all("a")

result:

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

===============================
[]
[, Lacie, Tillie]

이 예에서 a 라벨은 모두 p 라벨 안에 있기 때문에 바디의 직접 하위 노드에서 a 라벨을 검색하면 a 라벨과 일치하지 않습니다.
5.2 find

사용 방법: find(name,attrs,recursive,text,**kwargs)

와findall의 차이점:findall는 모든 일치하는 항목을 하나의 목록으로 조합하고find는 첫 번째 일치하는 항목만 되돌려줍니다.

이외에 사용법은 같다

print soup.body.find_all("a")
print soup.body.find("a")

result:
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

5.3 find_parents find_parent

find_all ()와find () 검색의 범위는 현재 노드의 모든 자손 노드 (recursive 기본적으로) 입니다.

이findparents와findparent의 검색 범위는 현재 노드의 부모 노드입니다.

두 함수의 특성과 다른 용법은 위에서 말한 것과 같다.

print soup.body.find_all("a")
print soup.body.find("a")
print soup.body.find_parents("a")
print soup.body.find_parents("html")

result:
[, Lacie, Tillie]

[]
[The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.

...


]

5.4 find_next_siblings 및findnext_sibling

검색 범위는 현재 결점 뒤의 형제 결점이다.

기타 특성과 용법은 위의 것과 완전히 같다.

5.5 find_previous_siblings 및findprevious_sibling

검색 범위는 현재 결점 앞의 형제 결점입니다.

기타 특성과 용법은 위의 것과 완전히 같다.

5.6 find_all_next 및findnext

검색 범위는 현재 결점 뒤에 있는 결점이나 문자열입니다.

기타 특성과 용법은 위의 것과 완전히 같다

5.6 find_all_previous 및findprevious

검색 범위는 현재 결점 앞의 결점이나 문자열입니다.

기타 특성과 용법은 위의 것과 완전히 같다

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

python은 어떻게 파충류의 효율을 향상시킬 것인가

단일 스레드 + 멀티태스킹 비동기 협동 협정 함수(특수 함수)를 정의할 때 async 수식을 사용합니다. 함수 호출 후 내부 문장은 즉시 실행되지 않고 협동 대상으로 되돌아옵니다. 퀘스트 대상 작업 대상 = 고급 협...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

beautifulsoup4 튜토리얼 (3) 문서 트리 훑어보기 및 검색

4. 문서 트리를 옮겨다니기

5. 문서 트리 검색

좋은 웹페이지 즐겨찾기