HTML 파서 - 쉽게 HTML 정보 추출

16412 단어 tools python htmlparser appseed

안녕하세요 코더입니다.

이 기사에서는 Python/BS4 라이브러리로 작성된 HTML Parser을 사용하여 HTML 정보를 추출하고 처리하기 위한 몇 가지 실용적인 코드 스니펫을 제공합니다. 다음 주제를 다룹니다.

HTML 로드

파일에서 자산 스캔: 이미지, 자바스크립트 파일, CSS 파일

기존 자산의 경로 변경

기존 요소 업데이트: 이미지의 src 속성 변경

ID를 기반으로 요소를 찾습니다

DOM 트리에서 요소 제거

기존 구성 요소 처리: 하드코딩된 텍스트 제거

처리된 HTML을 파일에 저장

HTML 파서란?

Wikipedia에 따르면Parsing 또는 구문 분석은 형식 문법 규칙에 따라 자연어 또는 컴퓨터 언어로 된 기호 문자열을 분석하는 프로세스입니다. 여기서 적용되는 HTML 파싱의 의미는 HTML을 로드하고 헤드 타이틀, 페이지 자산, 메인 섹션과 같은 관련 정보를 추출 및 처리한 후, 처리된 파일을 저장하는 것을 의미합니다.

파서 환경

이 코드는 Python으로 작성된 잘 알려진 구문 분석 라이브러리인 BeautifulSoup 라이브러리를 사용합니다. 코딩을 시작하려면 시스템에 몇 개의 모듈이 설치되어 있어야 합니다.

$ pip install ipython # the console where we execute the code
$ pip install requests # a library to pull the entire HTML page
$ pip install BeautifulSoup # the real magic is here

HTML 콘텐츠 로드

파일은 다른 파일로 로드되며 내용은 BeautifulSoup 개체에 주입되어야 합니다.

from bs4 import BeautifulSoup as bs

# Load the HTML content
html_file = open('index.html', 'r')
html_content = html_file.read()
html_file.close() # clean up

# Initialize the BS object
soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library

자산에 대한 HTML 구문 분석

이 시점에서 BeautifulSoup 개체에 로드된 DOM 트리가 있습니다. 스크립트 노드인 Javascript 파일에 대한 DOM 트리를 스캔해 보겠습니다.

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

Javascript를 찾는 코드 조각에는 몇 줄의 코드만 있습니다. BS 라이브러리는 객체 배열을 반환하고 각 스크립트 노드를 쉽게 변경할 수 있습니다.

for script in soup.body.find_all('script', recursive=False):

   # Print the src attribute
   print(' JS source = ' + script['src'])

   # Print the type attribute
   print(' JS type = ' + script['type'])

비슷한 방식으로 CSS 노드를 선택하고 처리할 수 있습니다.

...
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/app.css">
...

그리고 코드..

for link in soup.find_all('link'):

   # Print the src attribute
   print(' CSS file = ' + script['href'])

이미지용 HTML 구문 분석

이 코드 조각에서 노드를 변경하고 이미지 노드의 src 속성을 변경합니다.

...
<img src="images/pic01.jpg" alt="Bred Pitt">
...

for img in soup.body.find_all('img'):

   # Print the path 
   print(' IMG src = ' + img[src]) 

   img_path = img['src']
   img_file = img_path.split('/')[-1] # extract the last segment, aka image file  
   img[src] = '/assets/img/' + img_file 
   # the new path is set

ID를 기반으로 요소 찾기

이것은 한 줄의 코드로 달성할 수 있습니다. id가 1234인 요소(div 또는 span)가 있다고 가정해 보겠습니다.

...
<div id="1234" class="handsome">
Some text
</div>

그리고 코드:

mydiv = soup.find("div", {"id": "1234"})

print(mydiv) 

# delete the element
mydiv.decompose()

하드 코딩된 텍스트 제거

이 코드 조각은 구성 요소 추출 및 다른 템플릿 엔진으로의 번역에 유용합니다. 다음과 같은 간단한 구성 요소가 있다고 가정해 보겠습니다.

<div id="1234" class="cool">
   <span>Html Parsing</span>
   <span>the practical guide</span> 
</div>

PHP에서 이 구성 요소를 사용하려면 구성 요소가 다음과 같이 됩니다.

<div id="1234" class="cool">
   <span><?php echo $title ?></span>
   <span><?php echo $info ?></span> 
</div>

또는 Jinja2(Python 템플릿 엔진)

<div id="1234" class="cool">
   <span>{{ title }}</span>
   <span>{{ info }}</span> 
</div>

수동 작업을 무효화하기 위해 코드 스니펫을 사용하여 하드코딩된 텍스트를 자동으로 교체하고 특정 템플릿 엔진을 위한 구성 요소를 준비할 수 있습니다.

# locate the div
mydiv = soup.find("div", {"id": "1234"})

print(mydiv) # print before processing

# iterate on div elements
for tag in mydiv.descendants:

   # NavigableString is the text inside the tag, 
   # not the tag himself 
   if not isinstance(tag, NavigableString):

      print( 'Found tag = ' + tag.name ' -> ' + tag.text )
      # this will print:
      # Found tag = span ->  Html Parsing
      # Found tag = span ->  the practical guide

      # replace the text for Php
      tag.text = '<?php echo $title ?>'

      # replace the text for Jinja
      tag.text = '{{ title }}'

구성 요소를 사용하려면 구성 요소를 파일에 저장할 수 있습니다.


# mydiv is the processed component
php_component is the string representation
php_component = mydiv.prettify(formatter="html") 

file = open( 'component.php', 'w+') 
file.write( php_component )
file.close()

이 시점에서 원본 div가 DOM에서 추출되고 하드 코딩된 텍스트가 제거되고 Php 또는 Python 프로젝트에서 사용할 준비가 됩니다.

새 HTML 저장

이제 메모리의 BeautifulSoup 객체에 변경된 DOM이 있습니다. 내용을 새 파일에 저장하려면 prettify()를 호출하고 내용을 새 HTML 파일에 저장해야 합니다.


new_dom_content = soup.prettify(formatter="html") 

file = open( 'index_parsed.html', 'w+') 
file.write( new_dom_content )
file.close()

HTML 파서 - 사용 사례

특히 수동 작업이 관련된 작업의 경우 HTML 구문 분석을 많이 사용하고 있습니다.

새 프로젝트에서 사용할 HTML 테마 처리

하드 코딩된 텍스트를 추출하고 구성 요소를 추출합니다

플랫 HTML 테마를 Jinja, Mustache 또는 PUG 템플릿으로 변환합니다.

때때로 this public 저장소에 무료 샘플을 게시하고 있습니다.

자원

HTML Parser - AppSeed에서 지원

HTML Parser - How to use Python BS4 to work less

Developer Tools - Open-Source HTML Parser - 관련 기사

BeautifulSoup Html Parser 문서

HTML Parser - Convert HTML to Jinja2 and Php components - 관련 블로그 기사

Thank you! Btw, my (nick) name is Sm0ke and I'm pretty active also on .

Reference

이 문제에 관하여(HTML 파서 - 쉽게 HTML 정보 추출), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/sm0ke/html-parser-extact-html-information-with-ease-308m

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

Nginx 의 설치 와 설정 - 80 포트 를 해결 하여 여러 응용 서비스 포트 에 매 핑 합 니 다.

POJ 1004

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다