Python에서 페이지 매김을 사용하여 모든 네이버 비디오 결과 스크랩

36123 단어 python webscraping datascience programming

What will be scraped

Prerequisites

Full Code

Links

Outro

스크랩 할 것

제목, 링크, 썸네일, 출처, 보기, 게시 날짜, 모든 결과의 채널.

📌참고: Naver 검색은 최상의 검색 결과 품질을 위해 600개 이상의 비디오 검색 결과를 제공하지 않습니다. 검색 결과 하단에 도달했을 때.

그러나 여러 테스트 중에 1008개의 결과가 스크랩되었습니다. 네이버가 끊임없이 변화하고 있기 때문일 것입니다.

SelectorGadget Chrome extension을 사용하여 CSS 선택기 테스트:

콘솔에서 CSS 선택기 테스트:

전제 조건

CSS 선택자를 사용한 기본 지식 스크래핑

CSS 선택기는 스타일이 적용되는 마크업 부분을 선언하므로 일치하는 태그 및 속성에서 데이터를 추출할 수 있습니다.

CSS 선택기로 스크랩하지 않은 경우 그것이 무엇인지, 장단점, 웹 스크래핑 관점에서 왜 중요한지 다루는 전용 블로그 게시물how to use CSS selectors when web-scraping이 있습니다.

별도의 가상 환경

이전에 가상 환경으로 작업한 적이 없다면 내 전용 블로그 게시물Python virtual environments tutorial using Virtualenv and Poetry을 살펴보고 익숙해지십시오.

요컨대, 동일한 시스템에서 서로 공존할 수 있는 서로 다른 Python 버전을 포함하여 설치된 라이브러리의 독립 세트를 생성하여 라이브러리 또는 Python 버전 충돌을 방지하는 것입니다.

📌참고: 이것은 이 블로그 게시물에 대한 엄격한 요구 사항이 아닙니다.

라이브러리 설치:

pip install requests, parsel, playwright

전체 코드

이 섹션은 두 부분으로 나뉩니다.

방법
사용된 라이브러리

parse data without browser automation

requests 및 parsel 은 Xpath를 지원하는 bs4 아날로그입니다.

parse data with browser automation

playwright , 현대식 selenium 아날로그입니다.

브라우저 자동화 없이 모든 네이버 동영상 결과 스크랩

import requests, json
from parsel import Selector

params = {
    "start": 0,            # page number
    "display": "48",       # videos to display. Hard limit.
    "query": "minecraft",  # search query
    "where": "video",      # Naver videos search engine 
    "sort": "rel",         # sorted as you would see in the browser
    "video_more": "1"      # required to receive a JSON data
}

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
}

video_results = []

html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
json_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))
html_data = json_data["aData"]

while params["start"] <= int(json_data["maxCount"]):
    for result in html_data:
        selector = Selector(result)

        for video in selector.css(".video_bx"):
            title = video.css(".text").xpath("normalize-space()").get().strip()
            link = video.css(".info_title::attr(href)").get()
            thumbnail = video.css(".thumb_area img::attr(src)").get()
            channel = video.css(".channel::text").get()
            origin = video.css(".origin::text").get()
            video_duration = video.css(".time::text").get()
            views = video.css(".desc_group .desc:nth-child(1)::text").get()
            date_published = video.css(".desc_group .desc:nth-child(2)::text").get()

            video_results.append({
                "title": title,
                "link": link,
                "thumbnail": thumbnail,
                "channel": channel,
                "origin": origin,
                "video_duration": video_duration,
                "views": views,
                "date_published": date_published
            })

    params["start"] += 48
    html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
    html_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))["aData"]

print(json.dumps(video_results, indent=2, ensure_ascii=False))

URL 매개변수 및 요청 헤더를 생성합니다.

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "start": 0,           # page number
    "display": "48",      # videos to display. Hard limit.
    "query": "minecraft", # search query
    "where": "video",     # Naver videos search engine 
    "sort": "rel",        # sorted as you would see in the browser
    "video_more": "1"     # unknown but required to receive a JSON data
}

# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
}

구문 분석된 데이터를 저장할 임시 파일list을 만듭니다.

video_results = []

headers , URL params을 전달하고 JSON 데이터를 가져오도록 요청합니다.

html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)

# removes (replaces) unnecessary parts from parsed JSON 
json_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))
html_data = json_data["aData"]

암호
설명

timeout=30
30초 후에 응답 대기를 중지합니다.

json_data에서 반환된 JSON 데이터:

html_data , 보다 정확하게는 json_data["aData"] 에서 반환된 실제 HTML(저장되고 브라우저에서 열림):

사용 가능한 모든 비디오 결과를 추출하는 while 루프를 만듭니다.

while params["start"] <= int(json_data["maxCount"]):
    for result in html_data:
        selector = Selector(result)

        for video in selector.css(".video_bx"):
            title = video.css(".text").xpath("normalize-space()").get().strip()
            link = video.css(".info_title::attr(href)").get()
            thumbnail = video.css(".thumb_area img::attr(src)").get()
            channel = video.css(".channel::text").get()
            origin = video.css(".origin::text").get()
            video_duration = video.css(".time::text").get()
            views = video.css(".desc_group .desc:nth-child(1)::text").get()
            date_published = video.css(".desc_group .desc:nth-child(2)::text").get()

            video_results.append({
                "title": title,
                "link": link,
                "thumbnail": thumbnail,
                "channel": channel,
                "origin": origin,
                "video_duration": video_duration,
                "views": views,
                "date_published": date_published
            })

        params["start"] += 48

        # update previous page to a new page
        html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
        html_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))["aData"]

암호
설명

while params["start"] <= int(json_data["maxCount"])["maxCount"]의 하드 한계인 1000개의 결과에 도달할 때까지 반복
xpath("normalize-space()")parsel translates every CSS query to XPath 및 because XPath's text() ignores blank text nodes 이후 빈 텍스트 노드를 가져오고 첫 번째 텍스트 요소를 가져옵니다.
::text 또는 ::attr(href)parsel 그에 따라 텍스트 또는 속성을 추출하는 자체 CSS 의사 요소 지원.
params["start"] += 48다음 페이지 결과로 증가: `48, 96, 144, 192 ...

산출:

python

print(json.dumps(video_results, indent=2, ensure_ascii=False))

  

json

[

  {

    "title": "Minecraft : 🏰 How to build a Survival Castle Tower house",

    "link": "https://www.youtube.com/watch?v=iU-xjhgU2vQ",

    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2FiU-xjhgU2vQ%2Fmqdefault.jpg&type=ac612_350",

    "channel": "소피 Sopypie",

    "origin": "Youtube",

    "video_duration": "25:27",

    "views": "126",

    "date_published": "1일 전"

  },

  {

    "title": "조금 혼란스러울 수 있는 마인크래프트 [ Minecraft ASMR Tower ]",

    "link": "https://www.youtube.com/watch?v=y8x8oDAek_w",

    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fy8x8oDAek_w%2Fmqdefault.jpg&type=ac612_350",

    "channel": "세빈 XEBIN",

    "origin": "Youtube",

    "video_duration": "00:58",

    "views": "1,262",

    "date_published": "2021.11.13."

  }

]

  




브라우저 자동화로 모든 네이버 비디오 결과 스크랩 

`python

from playwright.sync_api import sync_playwright

import json 

with sync_playwright() as p:

    browser = p.chromium.launch(headless=False)

    page = browser.new_page()

    page.goto("https://search.naver.com/search.naver?where=video&query=minecraft")



 video_results = []

not_reached_end = True
while not_reached_end:
    page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);
                             scrollingElement.scrollTop = scrollingElement scrollHeight;""")

    if page.locator("#video_max_display").is_visible():
        not_reached_end = False

for index, video in enumerate(page.query_selector_all(".video_bx"), start=1):
    title = video.query_selector(".text").inner_text()
    link = video.query_selector(".info_title").get_attribute("href")
    thumbnail = video.query_selector(".thumb_area img").get_attribute("src")
    channel = None if video.query_selector(".channel") is None else video.query_selector(".channel").inner_text()
    origin = video.query_selector(".origin").inner_text()
    video_duration = video.query_selector(".time").inner_text()
    views = video.query_selector(".desc_group .desc:nth-child(1)").inner_text()
    date_published = None if video.query_selector(".desc_group .desc:nth-child(2)") is None else \
        video.query_selector(".desc_group .desc:nth-child(2)").inner_text()

    video_results.append({
        "position": index,
        "title": title,
        "link": link,
        "thumbnail": thumbnail,
        "channel": channel,
        "origin": origin,
        "video_duration": video_duration,
        "views": views,
        "date_published": date_published
    })

print(json.dumps(video_results, indent=2, ensure_ascii=False))

browser.close()
 
 


` 

Lunch a Chromium browser and make a request:


`python 

비동기도 지원


with sync_playwright() as p:

    # launches Chromium, opens a new page and makes a request

    browser = p.chromium.launch(headless=False) # or firefox, webkit

    page = browser.new_page()

    page.goto("https://search.naver.com/search.naver?where=video&query=minecraft")

` 

Create temporary list 추출된 데이터 저장:
python

video_results = []

  

Create a while 스크롤을 중지하는 예외를 반복하고 확인합니다.
`python

not_reached_end = True

while not_reached_end:

    # scroll to the bottom of the page

    page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);

                         scrollingElement.scrollTop = scrollingElement scrollHeight;""") 


 # break out of the while loop when hit the bottom of the video results 
# looks for text at the bottom of the results:
# "Naver Search does not provide more than 600 video search results..."
if page.locator("#video_max_display").is_visible():
    not_reached_end = False
 
 


` 




Code
Explanation




page.evaluate()JavaScript 표현식을 실행합니다. playwright keyboard keys and shortcuts을 사용하여 동일한 작업을 수행할 수도 있습니다.


스크롤된 결과 및 append를 임시list로 반복합니다.
`python

for index, video in enumerate(page.query_selector_all(".video_bx"), start=1):

    title = video.query_selector(".text").inner_text()

    link = video.query_selector(".info_title").get_attribute("href")

    thumbnail = video.query_selector(".thumb_area img").get_attribute("src") 


 # return None if no result is displayed from Naver.
# "is None" used because query_selector() returns a NoneType (None) object:
# https://playwright.dev/python/docs/api/class-page#page-query-selector
channel = None if video.query_selector(".channel") is None else video.query_selector(".channel").inner_text()
origin = video.query_selector(".origin").inner_text()
video_duration = video.query_selector(".time").inner_text()
views = video.query_selector(".desc_group .desc:nth-child(1)").inner_text()
date_published = None if video.query_selector(".desc_group .desc:nth-child(2)") is None else \
    video.query_selector(".desc_group .desc:nth-child(2)").inner_text()

video_results.append({
    "position": index,
    "title": title,
    "link": link,
    "thumbnail": thumbnail,
    "channel": channel,
    "origin": origin,
    "video_duration": video_duration,
    "views": views,
    "date_published": date_published
})
 
 


` 




Code
Explanation




enumerate()각 비디오의 인덱스 위치를 얻으려면

  query_selector_all() 
일치 항목list을 반환합니다. 기본값: []
  query_selector() 
단일 일치 항목을 반환합니다. 기본값: None


데이터가 추출된 후 브라우저 인스턴스를 닫습니다.
python

browser.close()

  

Output:

json

[

  {

    "position": 1,

    "title": "Minecraft : 🏰 How to build a Survival Castle Tower house",

    "link": "https://www.youtube.com/watch?v=iU-xjhgU2vQ",

    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2FiU-xjhgU2vQ%2Fmqdefault.jpg&type=ac612_350",

    "channel": "소피 Sopypie",

    "origin": "Youtube",

    "video_duration": "25:27",

    "views": "재생수126",

    "date_published": "20시간 전"

  },

  {

    "position": 1008,

    "title": "Titanic [Minecraft] V3 | 타이타닉 [마인크래프트] V3",

    "link": "https://www.youtube.com/watch?v=K39joThAoC0",

    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2FK39joThAoC0%2Fmqdefault.jpg&type=ac612_350",

    "channel": "나이아Naia",

    "origin": "Youtube",

    "video_duration": "02:40",

    "views": "재생수22",

    "date_published": "2021.11.11."

  }

]

  




연결


Code in the online IDE





아웃트로

This blog post is for information purpose only. Use the received information for useful purposes, for example, if you know how to help improve Naver's service. 

If you have anything to share, any questions, suggestions, or something that isn't working correctly, reach out via Twitter at  , or  .

Yours, 

Dmitriy, and the rest of SerpApi Team.




Join us on Reddit |   | 

Add a  Feature Request💫 or a Bug🐞

Reference

이 문제에 관하여(Python에서 페이지 매김을 사용하여 모든 네이버 비디오 결과 스크랩), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/dmitryzub/scrape-all-naver-video-results-using-pagination-in-python-5eph

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

Neumorphism 하는 방법?

AR 기반 건강 체크 앱 개발 방법

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다