WebScraping [파트-1]

14167 단어 webscraping python webdev

알레티수닐 / 스크래핑_IMDB

영화 제목, 연도, 등급, 장르, 투표를 스크랩

안녕 모두들,
이 웹 스크래핑 자습서에는 Python의 전문가가 될 필요가 없습니다. HTML 및 Python의 기본 사항만으로도 충분합니다.

뛰어들자..

우리가 사용할 도구는 다음과 같습니다.

요청을 사용하면 HTML 파일을 가져오기 위해 HTTP 요청을 보낼 수 있습니다

BeautifulSoup은 HTML 파일을 구문 분석하는 데 도움이 됩니다

Pandas는 데이터를 DataFrame으로 조합하여 정리하고 분석하는 데 도움이 됩니다

csv(선택 사항)- csv 파일 형식으로 데이터를 공유하려는 경우

의 시작하자..

이 자습서에서는 제목, 연도, 등급, 장르 등을 얻을 수 있는 웹사이트IMDB를 스크랩할 것입니다.

먼저 스크레이퍼를 만들기 위한 도구를 가져옵니다.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

웹 페이지의 내용을 결과 변수로 가져오기

url = "https://www.imdb.com/search/title/?groups=top_1000"
results = requests.get(url)

콘텐츠를 이해하기 쉽게 하기 위해 BeautifulSoup을 사용하고 있으며 콘텐츠는 soup 변수에 저장됩니다.

soup = BeautifulSoup(results.text, "html.parser")

이제 데이터를 저장할 목록을 초기화합니다.

titles = []        #Stores the title of movie
years = []         #Stores the launch year of the movie
time = []          #Stores movie duration
imdb_ratings = []  #Stores the rating of the movie
genre = []         #Stores details regarding genre of the movie
votes = []         #Store the no.of votes for the movie

이제 검사를 통해 올바른 동영상 컨테이너를 찾고 아래 이미지와 같은 동영상 div 위로 마우스를 가져갑니다.

그리고 클래스 이름이 있는 50개의 div를 볼 수 있습니다: lister-item mode-advanced따라서 해당 클래스 이름을 가진 모든 div를 찾으십시오.

movie_div = soup.find_all("div", class_="lister-item mode-advanced")

find_all 속성은 클래스가 있는 모든 div를 추출합니다.
이름:"lister-item mode-advanced"

이제 각 lister-item mode-advanced div에 들어가서 제목, 연도, 등급, 장르, 영화 기간을 가져옵니다.

따라서 모든 div를 반복하여 제목, 연도, 등급 등을 얻습니다.

for movieSection in movie_div:

제목 추출

이미지에서 영화 이름이 div>h3>a 아래에 있음을 알 수 있습니다.

name = movieSection.h3.a.text  #we're iterating those divs using <b>movieSection<b> variable
titles.append(name) #appending the movie names into <b>titles</b> list

연도 추출

이미지에서 영화 개봉 연도가 div>h3>span(class name="lister-item-year") 아래에 있는 것을 볼 수 있으며 텍스트 키워드를 사용하여 추출합니다.

year = movieSection.h3.find("span", class_="lister-item-year").text
years.append(year)   #appending into years list

마찬가지로 classname으로 등급, 장르, movieDuration을 얻을 수 있습니다.

ratings = movieSection.strong.text
imdb_ratings.append(ratings)   #appending ratings into list
category = movieSection.find("span", class_="genre").text.strip()
genre.append(category)         #appending category into Genre list
runTime = movieSection.find("span", class_="runtime").text
time.append(runTime)           #appending runTime into time list

투표 추출

이미지에서와 같이 classname="nv"인 두 개의 span 태그가 있음을 알 수 있습니다. 따라서 투표의 경우 nv[0]을 고려해야 하고 총 수집의 경우 nv[1]을 고려해야 합니다.

nv = movieSection.find_all("span", attrs={"name": "nv"})
vote = nv[0].text
votes.append(vote)

이제 우리는 pandas로 DataFrame을 만들 것입니다.
데이터를 저장하려면 테이블로 멋지게 만들어야 합니다.
그리고 우리는 할 수 있습니다..

movies = pd.DataFrame(
    {
        "Movie": titles,
        "Year": years,
        "RunTime": time,
        "imdb": imdb_ratings,
        "Genre": genre,
        "votes": votes,
    }
)

이제 데이터 프레임을 인쇄해 보겠습니다.

행 16과 25에서 볼 수 있듯이 일부 데이터가 일치하지 않습니다. 그래서 우리는 청소가 필요합니다

 movies["Year"] = movies["Year"].str.extract("(\\d+)").astype(int) #Extracting only numerical values. so we can commit "I"
 movies["RunTime"] = movies["RunTime"].str.replace("min", "minutes") #replacing <b>min</b> with <b>minutes</b>
 movies["votes"] = movies["votes"].str.replace(",", "").astype(int) #removing "," to make it more clear

이제 청소 후 우리는 어떻게 보이는지 볼 것입니다

print(movies)

데이터를 .csv 파일 형식으로 내보낼 수도 있습니다.
수출하기 위해서는,
.csv 파일 확장자로 파일 만들기

movies.to_csv(r"C:\Users\Aleti Sunil\Downloads\movies.csv", index=False, header=True)

내 Github repo에서 최종 코드를 얻을 수 있습니다.

다음 부분에서는 이 IMDb 목록의 모든 페이지를 반복하여 1,000개의 영화를 모두 가져오는 방법을 설명하겠습니다. 여기에 있는 최종 코드를 약간 변경해야 합니다.

유용하길 바랍니다
A ❤️는 굉장할 것입니다 😊

해피코딩

Reference

이 문제에 관하여(WebScraping [파트-1]), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/sunilaleti/demystify-the-webscraping-part-1-3d5c

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

간단한 log4j 설정

[Boost] boost 라이브러리 asio 상세 설명 6 - boost::asio:::error의 용법 분석

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다