초보 자 를 위 한 python Beautiful Soup 기본 용법 분석

10218 단어 python Beautiful Soup

Beautiful Soup 은 Python 의 HTML 이나 XML 분석 라 이브 러 리 로 웹 페이지 에서 데 이 터 를 편리 하 게 추출 할 수 있 습 니 다.그것 은 다음 과 같은 세 가지 특징 이 있다.

Beautiful Soup 은 간단 하고 Python 식 함 수 를 제공 하여 네 비게 이 션,검색,분석 트 리 수정 등 기능 을 처리 합 니 다.이것 은 도구 상자 입 니 다.문 서 를 분석 하여 사용자 에 게 캡 처 할 데 이 터 를 제공 합 니 다.간단 하기 때문에 코드 가 필요 하지 않 아 도 완전한 프로그램 을 쓸 수 있 습 니 다.

Beautiful Soup 은 자동 으로 입력 문 서 를 유 니 코드 인 코딩 으로,출력 문 서 를 UTF-8 인 코딩 으로 변환 합 니 다.문서 에 인 코딩 방식 이 지정 되 어 있 지 않 은 한 인 코딩 방식 을 고려 할 필요 가 없습니다.이 때 는 원본 인 코딩 방식 만 설명 하면 됩 니 다.

Beautiful Soup 은 lxml,html6lib 와 같은 뛰어난 Python 해석 기 가 되 어 사용자 에 게 다양한 해석 전략 이나 강력 한 속 도 를 유연 하 게 제공 합 니 다

우선,우 리 는 그것 을 설치 해 야 합 니 다:pip install bs4,그리고 pip install beautifulsoup 4 를 설치 해 야 합 니 다.
Beautiful Soup 이 지원 하 는 해상도

다음은 lxml 해상도 기 를 예 로 들 면:
from bs4 import BeautifulSoup
soup = BeautifulSoup('

Hello

', 'lxml')
print(soup.p.string)
결과:
Hello
beautiful soup 미화 효과 실례:


html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#  prettify()  。                        
print(soup.prettify())
print(soup.title.string)

결과:


<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title" name="dromouse">
  <b>
  The Dormouse's story
  </b>
 </p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">
  <!-- Elsie -->
  </a>
  ,
  <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">
  Lacie
  </a>
  and
  <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">
  Tillie
  </a>
  ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
  ...
 </p>
 </body>
</html>
The Dormouse's story

요소,속성,이름 을 선택 하 는 방법 을 예 로 들 어 설명 한다.


html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('     title          :
',soup.title)
print('      :
',type(soup.title))
print('         :
',soup.title.string)
print('              :
',soup.head)
print('      p     :
',soup.p)
print('  name         :
',soup.title.name)
#        ，          ，               。
#   ，name        ，            。
#    class，           class，        。
print('           ，  id class :
',soup.p.attrs)
print('         ，    attrs      ：
',soup.p.attrs['name'])
print('  p   name   ：
',soup.p['name'])
print('  p   class   ：
',soup.p['class'])
print('     p     :
',soup.p.string)

결과:


     title          :
<title>The Dormouse's story</title>
      :
<class 'bs4.element.Tag'>
         :
The Dormouse's story
              :
<head><title>The Dormouse's story</title></head>
      p     :
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  name         :
title
           ，  id class :
{'class': ['title'], 'name': 'dromouse'}
         ，    attrs      ：
dromouse
  p   name   ：
dromouse
  p   class   ：
['title']
     p     :
The Dormouse's story

위의 예 에서 우 리 는 모든 반환 결 과 는 bs4.element.Tag 형식 이라는 것 을 알 고 있 습 니 다.또한 노드 를 계속 호출 하여 다음 선택 을 할 수 있 습 니 다.


html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('   head    ，    head       head    :
',soup.head.title)
print('        ：
',type(soup.head.title))
print('        ：
',soup.head.title.string)

결과:


   head    ，    head       head    :
 <title>The Dormouse's story</title>
        ：
 <class 'bs4.element.Tag'>
        ：
 The Dormouse's story

（1）find_all()
find_all,말 그대로 조건 에 맞 는 모든 요 소 를 조회 하 는 것 입 니 다.속성 이나 텍스트 를 입력 하면 조건 에 맞 는 요 소 를 얻 을 수 있 고 기능 이 매우 강하 다.
find_all(name , attrs , recursive , text , **kwargs)
그의 용법:


html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('    ul  ，         ，   2:
',soup.find_all(name='ul'))
print('        bs4.element.Tag  :
',type(soup.find_all(name='ul')[0]))
#          ，    
for ul in soup.find_all(name='ul'):
  print('    u1:',ul.find_all(name='li'))
#    
for ul in soup.find_all(name='ul'):
  print('    u1:',ul.find_all(name='li'))
  for li in ul.find_all(name='li'):
    print('      ：',li.string)

결과:


    ul  ，         ，   2:
 [<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
        bs4.element.Tag  :
 <class 'bs4.element.Tag'>
    u1: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    u1: [<li class="element">Foo</li>, <li class="element">Bar</li>]
    u1: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
      ： Foo
      ： Bar
      ： Jay
    u1: [<li class="element">Foo</li>, <li class="element">Bar</li>]
      ： Foo
      ： Bar

이상 이 바로 본 고의 모든 내용 입 니 다.여러분 의 학습 에 도움 이 되 고 저 희 를 많이 응원 해 주 셨 으 면 좋 겠 습 니 다.

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

로마 숫자를 정수로 또는 그 반대로 변환

그 중 하나는 로마 숫자를 정수로 변환하는 함수를 만드는 것이었고 두 번째는 그 반대를 수행하는 함수를 만드는 것이었습니다. 문자만 포함합니다'I', 'V', 'X', 'L', 'C', 'D', 'M' ; 문자열이 ...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

하 드 디스크 를 장기 적 으로 손상 시 키 는 6 가지 상용 소프트웨어

ASP 는 중국어 의 len(),left(),right()함수 코드 를 지원 합 니 다.

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다