Python을 사용하여 웹 페이지에서 내부 및 외부 링크를 확인하는 방법

24583 단어 requests html scrape python

Python은 데이터 추출, 처리 및 분석을 위한 가장 인기 있는 프로그래밍 언어 중 하나입니다. Python의 내장 및 타사 라이브러리를 사용하면 개발자가 웹 페이지에서 특정 데이터를 쉽게 얻고 해당 데이터 세트에 대한 결과를 얻을 수 있습니다.
이 기사에서는 웹 페이지의 지정된 URL에서 링크를 추출하고 해당 웹 페이지에 있는 모든 링크가 포함된 CSV 파일을 링크가 외부 또는 내부인지 알려주는 추가 정보를 포함하는 CSV 파일을 생성할 수 있는 간단한 Python 스크립트를 다루었습니다.

전제 조건

프로그램이 포함된 파이썬 글이므로 스스로 프로그램을 테스트하기 위해서는 파이썬과 파이썬에 대한 기본 지식이 시스템에 설치되어 있어야 함은 말할 필요도 없다.
새 시스템을 사용하는 경우 이 빠른 다운로드 링크를 사용하여 최신 버전의 Python을 쉽게 설치할 수 있습니다.

프로그램을 만들기 위해 저는 4개의 Python 라이브러리를 사용할 것입니다. 그 중 2개는 타사 라이브러리이고 나머지 2개는 내장 라이브러리입니다.

도서관

1. 요청:
requests는 널리 사용되는 Python HTTP 라이브러리입니다. 이 라이브러리를 사용하여 확인하려는 링크의 URL에 대한 HTTP 요청을 만듭니다.
요청은 타사 라이브러리이므로 pip 명령을 사용하여 Python 환경에 설치해야 합니다.

pip install requests

2. 아름다운 수프:
Beautiful soup은 HTML 및 XML 파일에서 데이터를 추출할 수 있는 타사 Python 라이브러리입니다. 일반적으로 웹 페이지는 HTML 문서이며 Python 아름다운 수프를 사용하여 해당 웹 페이지에서 링크를 추출할 수 있습니다.

아름다운 수프를 설치하려면 다음 명령을 사용하십시오.

pip install beautifulsoup4

3. CSV
csv 모듈은 Python과 함께 제공되며 이 모듈을 사용하여 .csv 파일 간에 쓰고 읽고 추가할 수 있습니다.

4. 데이트타임
datetime은 또한 날짜와 시간을 처리할 수 있는 내장 Python 모듈입니다.

프로그램

이제 이 4개의 Python 모듈을 모두 사용하고 웹 페이지의 모든 내부 및 외부 링크를 알려주고 해당 데이터를 .csv 파일로 내보낼 수 있는 프로그램을 작성해 보겠습니다.

저는 이 프로그램을 모듈화하기 위해 세 가지 기능으로 나누었습니다.

기능 1: requestMaker(url)

requestMake(url) 함수는 URL을 문자열로 받아들이고 .get() 메서드를 사용하여 URL에 get 요청을 보냅니다.
요청 후 requestMaker() 함수 내에서 응답 웹 페이지 HTML 콘텐츠와 .text 및 .url 속성을 사용하여 URL을 수집했습니다.
그리고 parseLinks(pageHtml, pageUrl) 함수라고 합니다.

#to make the HTTP request to the give url
def requestMaker(url):
    try:
        #make the get request to the url
        response = requests.get(url)

        #if the request is successful
        if response.status_code in range(200, 300):
            #extract the page html content for parsing the links
            pageHtml = response.text
            pageUrl = response.url

            #call the parseLink function
            parseLinks(pageHtml, pageUrl)

        else:
            print("Sorry Could not fetch the result status code {response.status_code}!")

    except:
        print(f"Could Not Connect to url {url}")

함수 2: parseLinks(pageHtml, pageUrl)

parseLinks() 함수는 pageHtml 및 pageUrl를 문자열로 받아들이고 pageHTML 모듈을 HTML 파서와 함께 스프 객체로 사용하여 BeautiulSoup 문자열을 파싱합니다. 그리고 수프 개체를 사용하여 <a> 메서드를 사용하여 HTML 페이지에 있는 모든 .find_all('a') 태그 목록을 수집했습니다.
그런 다음 parseLinks() 함수 내부에서 extIntLinks(allLinks, pageUrl) 함수를 호출했습니다.

#parse all the links from the web page
def parseLinks(pageHtml, pageUrl):
    soup = BeautifulSoup(pageHtml, 'html.parser')

    #get all the <a> elements from the HTML page
    allLinks = soup.find_all('a')

    extIntLinks(allLinks, pageUrl)

함수 3: extIntLinks(allLinks, pageUrl)

extIntLinks(allLinks, pageUrl) 함수는 다음 작업을 수행합니다.

datetime 모듈을 사용하여 고유한.csv 파일 이름을 만듭니다.

쓰기 모드에서 고유한 .csv 파일을 만듭니다.

추출된 모든 링크<a>를 반복합니다

내부 및 외부 링크를 확인하십시오.

csv 파일에 데이터를 씁니다.

def extIntLinks(allLinks, pageUrl):
    #filename 
    currentTime = datetime.datetime.now()
    #create a unique .csv file name using the datetime module
    filename =  f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv"

    with open(filename, 'w', newline='') as csvfile:
        fieldnames = ['Tested Url','Link', 'Type']

        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        internalLinks = 0
        externalLinks = 0 

        #go through all the <a> elements list 
        for anchor in allLinks:
            link = anchor.get("href")   #get the link from the <a> element

            #check if the link is internal
            if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") :
                writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'})
                internalLinks+=1
            #if the link is external
            else:
                writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'})
                externalLinks+=1
        writer = csv.writer(csvfile)
        writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"])

        print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)")
        print(f"And data has been saved in the {filename}")

완전한 프로그램:

이제 우리는 전체 프로그램을 함께 넣고 실행할 수 있습니다.

import requests
from bs4 import BeautifulSoup
import csv
import datetime 


def extIntLinks(allLinks, pageUrl):
    #filename 
    currentTime = datetime.datetime.now()
    #create a unique .csv file name using the datetime module
    filename =  f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv"

    with open(filename, 'w', newline='') as csvfile:
        fieldnames = ['Tested Url','Link', 'Type']

        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        internalLinks = 0
        externalLinks = 0 

        #go through all the <a> elements list 
        for anchor in allLinks:
            link = anchor.get("href")   #get the link from the <a> element

            #check if the link is internal
            if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") :
                writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'})
                internalLinks+=1
            #if the link is external
            else:
                writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'})
                externalLinks+=1
        writer = csv.writer(csvfile)
        writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"])

        print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)")
        print(f"And data has been saved in the {filename}")


#parse all the links from the web page
def parseLinks(pageHtml, pageUrl):
    soup = BeautifulSoup(pageHtml, 'html.parser')

    #get all the <a> elements from the HTML page
    allLinks = soup.find_all('a')

    extIntLinks(allLinks, pageUrl)

#to make the HTTP request to the give url
def requestMaker(url):
    try:
        #make the get request to the url
        response = requests.get(url)

        #if the request is successful
        if response.status_code in range(200, 300):
            #extract the page html content for parsing the links
            pageHtml = response.text
            pageUrl = response.url

            #call the parseLink function
            parseLinks(pageHtml, pageUrl)

        else:
            print("Sorry Could not fetch the result status code {response.status_code}!")

    except:
        print(f"Could Not Connect to url {url}")



if __name__ == "__main__":
    url = input("Enter the URL eg. https://example.com:  ")
    requestMaker(url)

산출

Enter the URL eg. https://example.com:  https://techgeekbuzz.com
The page https://techgeekbuzz.com has 126 Internal Link(s) and 7 External Link(s)
And data has been saved in the Links-16-7-2022 11644.csv

CSV 파일

You can also download the this code from my github

행복한 코딩!!

Reference

이 문제에 관하여(Python을 사용하여 웹 페이지에서 내부 및 외부 링크를 확인하는 방법), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/khatrivinay/how-to-check-internal-and-external-links-on-a-webpage-using-python-520k

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

행동의 수학적 아름다움 | 추구 곡선

GitHub에 대한 SSH 연결 설정을 거의 자동으로 수행하는 쉘 스크립트

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다