구글 검색 결과 기어오르기 - 2부: 기어오르기 영상

52114 단어 python tutorial

본 강좌의 앞부분에서 우리는 구글의 정상적인 검색 결과를 추적할 수 있는 매우 간단한 기어오르기를 만들었다.본 강좌의 이 부분에서 우리는 이전보다 더욱 진일보할 것이다.
경고: 이 기어를 사용하여 대량의 데이터를 긁어내지 마십시오.구글이 공공 API를 제공했기 때문에 100번 무료로 전화를 걸 수 있습니다. 만약 구글이 당신의 컴퓨터 데이터 이상을 발견하면 당신의 IP는 금지됩니다.이 기어는 학습 목적으로만 사용되며 실제 항목에서 사용해서는 안 된다.그러니 이 점을 기억해 주십시오. 우리는 시작할 것입니다.

뭐 공부 해요?

너는 내가 무슨 말을 하는지 완전히 알지 못할지도 몰라, 이 미친 동영상.그래, 내가 설명해 줄게.예를 들어 구글에서 검색Python을 하면 검색과 관련된 동영상이 담긴 카드가 다음 그림과 같다.

제목이 Videos인 부분은 우리가 기어가야 할 부분이다.간단해 보이죠?윌, 이것은 첫 부분처럼 그렇게 간단하지 않아.

분석 시간

네, 이제 우리가 무엇을 짓고 있는지 알게 되었습니다. 이 웹 페이지를 봅시다.
결과를 찾으면 g-scrolling-carousel 요소에 둘러싸인 것을 볼 수 있습니다.

그 안에는 또 다른 g-inner-card 요소가 있는데 각 영상의 영상 디테일을 포함한다.

자, 이제 모든 용기가 생겼습니다. 디테일을 봅시다.우선, 우리는 영상 제목이 필요하다.속성이 div인 role="heading" 요소 내에 있습니다.

...그리고 a 요소 내의 링크:

그리고 비디오 작성자를 찾습니다.

max-height:1.5800000429153442em;min-height:1.5800000429153442em;font-size:14px;padding:2px 0 0;line-height:1.5800000429153442em

우리는 또 영상의 원본이나 플랫폼을 얻어야 한다.예를 들어 유튜브.이것은 span 내부에 위치하고 부모 대상은 div 스타일font-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em을 가지고 있다.

우리는 또 영상의 업로드 날짜를 얻을 것이다.그것은 영상 작성자span의 바로 아래에 있고 같은 부모 요소 안에 있다.코드에 도달하면 비디오 작성자의 텍스트만 벗겨서 텍스트를 가져옵니다.

마지막으로, 우리는 영상 길이를 찾을 것이다.그것은 div 양식의 height:118px;width:212px 두 번째 하위 요소에 있다.

표지는 어디에 있습니까?

너는 아마도 영상 표지 영상에 대해 호기심을 느낄 것이다.그럼 어디 있어요?예, JavaScript에 있습니다.비디오 세부 정보를 자세히 보면 Base64 이미지가 포함된 태그 3개<script>가 표시됩니다.그중 하나를 복사하면 영상 표지를 얻을 수 있다.이제 정보가 생겼으니 그것들을 어떻게 포지셔닝하는지 봅시다.가장 간단한 방법은 그들의 부모 항목<div>을 찾고 모든 스크립트 표시를 찾는 것이다.하지만, 그들은 많다!내가 사용하는 방법은 그들의 형제 요소를 포지셔닝하는 것이다<span id="fld"></span>.그것이 있으면 우리는 그것의 형제 요소인 스크립트 표시를 포지셔닝할 수 있다.우리가 찾는 표시는 마지막 세 개의 스크립트 요소입니다. 첫 번째는 포함되지 않습니다.우리는 파이썬에서 벗어날 수 있다[1:].

네, 시작합시다!

__search_video라는 함수를 만들면 모든 코드를 넣을 것입니다.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        """Search for video results based on the given response

        Args:
            response (requests.Response): the response requested to Google search

        Returns:
            list: A list of found video results, usually three if found
        """
        pass

응답에 따라 BeautifulSoup 객체를 만듭니다.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        """Search for video results based on the given response

        Args:
            response (requests.Response): the response requested to Google search

        Returns:
            list: A list of found video results, usually three if found
        """
        soup = BeautifulSoup(response.text, 'html.parser')

그리고 우리 g-inner-card 를 찾아봅시다.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        cards = soup.find('g-scrolling-carousel').findAll('g-inner-card')

검색 결과를 주기적으로 생성합니다.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        results = []
        # Generate video information
        for card in cards:
            try:  # Just in case
                # Title
                title = card.find('div', role='heading').text
                # Video length
                length = card.findAll('div', style='height:118px;width:212px')[
                    1].findAll('div')[1].text
                # Video upload author
                author = card.find(
                    'div', style='max-height:1.5800000429153442em;min-height:1.5800000429153442em;font-size:14px;padding:2px 0 0;line-height:1.5800000429153442em').text
                # Video source (Youtube, for example)
                source = card.find(
                    'span', style='font-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em').text
                # Video publish date
                date = card.find(
                    'div', style='font-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em').text.lstrip(source).lstrip('- ')  # Strip the source out because they're in the same container
                # Video link
                url = card.find('a')['href']
            except IndexError:
                continue
            else:
                # Append result
                results.append({
                    'title': title,
                    'length': length,
                    'author': author,
                    'source': source,
                    'date': date,
                    'url': url,
                    'type': 'video'
                })
        return results

마지막으로 우리는 함께 표지 부분을 만들었다.우선, 우리는 covers_라는 변수를 만들어서 우리가 찾은 스크립트를 저장합니다.첫 번째 태그를 삭제하기 위해 [1:] 목록의 일부를 만들었습니다.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        # Pre-process the video covers
        covers_ = soup.find('span', id='fld').findNextSiblings('script')[1:]
        # ...

그리고 우리는 그것들을 두루 훑어볼 것이다.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        # Pre-process the video covers
        covers_ = soup.find('span', id='fld').findNextSiblings('script')[1:]
        # Get the cover images
        covers = []
        for c in covers_:
            # TODO
        # ...

모든 base64 인코딩된 이미지를 목록에 추가합니다 covers.여기서 불필요한 JavaScript 코드를 삭제하고 이미지만 보존해야 합니다.만약 제가 코드에서 사용한 rsplit 을 모르신다면, 그것은 split 의 특수한 버전입니다.결과는 처음부터 스캔하여 분할됩니다.예를 들어, text라는 변수가 있는 경우

>>> text = 'Hi everyone! Would you like to say Hi to me?'

정상적인 방법으로 분할하는 경우:

>>> text.split('Hi', 1)
['', ' everyone! Would you like to say Hi to me?']

하지만 rsplit:

>>> text.rsplit('Hi', 1)
['Hi everyone! Would you like to say ', ' to me?']

내가 여기서 그것을 사용하는 이유는base64 이미지에 다른 ;var ii 가 있을 수 있기 때문이다. 만약 내가 split 를 사용한다면, 그것은 손상된 이미지를 만들 것이다.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        # Pre-process the video covers
        covers_ = soup.find('span', id='fld').findNextSiblings('script')[1:]
        # Get the cover images
        covers = []
        for c in covers_:
            # Fetch cover image
            try:
                covers.append(str(c).split('s=\'')[-1].split(
                    '\';var ii')[0].rsplit('\\', 1)[0])
            except IndexError:
                pass
        # ...

...목록에 추가합니다.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        for card in cards:
            # ...
            try:  # Just in case
                # Video cover image
                try:  # Just in case that the cover wasn't found in page's JavaScript
                    cover = covers[cards.index(card)]
                except IndexError:
                    cover = None
            except IndexError:
                continue
            else:
                # Append result
                results.append({
                    # ...
                    'cover': cover,
                    # ...
                })
        # ...

네, 곧 도착할 거예요.동영상 결과가 포함되지 않은 내용을 검색할 때 프로그램이 하나 AttributeError 를 던져야 한다.이를 방지하기 위해 다음 중 하나를 추가해야 합니다try-except.

class GoogleSpider(object):
    # ...

    def __search_video(self, response: requests.Response) -> list:
        # ...
        try:
            cards = soup.find('g-scrolling-carousel').findAll('g-inner-card')
        except AttributeError:
            return []
        # ...
        for card in cards:
            try:
                # Title
                # If the container is not about videos, there won't be a div with
                # attrs `role="heading"`. So to catch that, I've added a try-except
                # to catch the error and return.
                try:
                    title = card.find('div', role='heading').text
                except AttributeError:
                    return []
                # ...
            except IndexError:
                continue
            else:
                # ...
        return results

나는 GoogleSpider류의 구조를 재구성했기 때문에 당신은 나와 같은 일을 하고 싶을 수도 있습니다.제1부분의 모든 코드를 __search_result 방법에 넣고 search 함수를 다시 만듭니다.그것이 하는 모든 것은 개인 함수를 호출하고 결과를 함께 놓는 것이다.

class GoogleSpider(object):
    # ...

    def search(self, query: str, page: int = 1) -> dict:
        """Search Google

        Args:
            query (str): The query to search for
            page (int): The page number of search result

        Returns:
            dict: The search results and the total page number
        """
        # Get response
        response = self.__get_source(
            'https://www.google.com/search?q=%s&start=%d' % (quote(query), (page - 1) * 10))
        results = []
        video = self.__search_video(response)
        result = self.__search_result(response)
        pages = self.__get_total_page(response)
        results.extend(result)
        results.extend(video)
        return {
            'results': results,
            'pages': pages
        }

    # ...

전체 코드

다음은 이 강좌의 전체 코드입니다. 두 번째 부분까지.

# Import dependencies
from pprint import pprint
from urllib.parse import quote

import requests
from bs4 import BeautifulSoup


class GoogleSpider(object):
    def __init__(self):
        """Crawl Google search results

        This class is used to crawl Google's search results using requests and BeautifulSoup.
        """
        super().__init__()
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:79.0) Gecko/20100101 Firefox/79.0',
            'Host': 'www.google.com',
            'Referer': 'https://www.google.com/'
        }

    def __get_source(self, url: str) -> requests.Response:
        """Get the web page's source code

        Args:
            url (str): The URL to crawl

        Returns:
            requests.Response: The response from URL
        """
        return requests.get(url, headers=self.headers)

    def __search_video(self, response: requests.Response) -> list:
        """Search for video results based on the given response

        Args:
            response (requests.Response): the response requested to Google search

        Returns:
            list: A list of found video results, usually three if found
        """
        soup = BeautifulSoup(response.text, 'html.parser')
        try:
            cards = soup.find('g-scrolling-carousel').findAll('g-inner-card')
        except AttributeError:
            return []
        # Pre-process the video covers
        covers_ = soup.find('span', id='fld').findNextSiblings('script')[1:]
        # Get the cover images
        covers = []
        for c in covers_:
            # Fetch cover image
            try:
                covers.append(str(c).split('s=\'')[-1].split(
                    '\';var ii')[0].rsplit('\\', 1)[0])
            except IndexError:
                pass
        results = []
        # Generate video information
        for card in cards:
            try:
                # Title
                # If the container is not about videos, there won't be a div with
                # attrs `role="heading"`. So to catch that, I've added a try-except
                # to catch the error and return.
                try:
                    title = card.find('div', role='heading').text
                except AttributeError:
                    return []
                # Video length
                length = card.findAll('div', style='height:118px;width:212px')[
                    1].findAll('div')[1].text
                # Video upload author
                author = card.find(
                    'div', style='max-height:1.5800000429153442em;min-height:1.5800000429153442em;font-size:14px;padding:2px 0 0;line-height:1.5800000429153442em').text
                # Video source (Youtube, for example)
                source = card.find(
                    'span', style='font-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em').text
                # Video publish date
                date = card.find(
                    'div', style='font-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em').text.lstrip(source).lstrip('- ')  # Strip the source out because they're in the same container
                # Video link
                url = card.find('a')['href']
                # Video cover image
                try:  # Just in case that the cover wasn't found in page's JavaScript
                    cover = covers[cards.index(card)]
                except IndexError:
                    cover = None
            except IndexError:
                continue
            else:
                # Append result
                results.append({
                    'title': title,
                    'length': length,
                    'author': author,
                    'source': source,
                    'date': date,
                    'cover': cover,
                    'url': url,
                    'type': 'video'
                })
        return results

    def __get_total_page(self, response: requests.Response) -> int:
        """Get the current total pages

        Args:
            response (requests.Response): the response requested to Google using requests

        Returns:
            int: the total page number (might be changing when increasing / decreasing the current page number)
        """
        soup = BeautifulSoup(response.text, 'html.parser')
        pages_ = soup.find('div', id='foot', role='navigation').findAll('td')
        maxn = 0
        for p in pages_:
            try:
                if int(p.text) > maxn:
                    maxn = int(p.text)
            except:
                pass
        return maxn

    def __search_result(self, response: requests.Response) -> list:
        """Search for normal search results based on the given response

        Args:
            response (requests.Response): The response requested to Google

        Returns:
            list: A list of results
        """
        # Initialize BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        # Get the result containers
        result_containers = soup.findAll('div', class_='rc')
        # Final results list
        results = []
        # Loop through every container
        for container in result_containers:
            # Result title
            title = container.find('h3').text
            # Result URL
            url = container.find('a')['href']
            # Result description
            des = container.find('span', class_='st').text
            results.append({
                'title': title,
                'url': url,
                'des': des,
                'type': 'result'
            })
        return results

    def search(self, query: str, page: int = 1) -> dict:
        """Search Google

        Args:
            query (str): The query to search for
            page (int): The page number of search result

        Returns:
            dict: The search results and the total page number
        """
        # Get response
        response = self.__get_source(
            'https://www.google.com/search?q=%s&start=%d' % (quote(query), (page - 1) * 10))
        results = []
        video = self.__search_video(response)
        result = self.__search_result(response)
        pages = self.__get_total_page(response)
        results.extend(result)
        results.extend(video)
        return {
            'results': results,
            'pages': pages
        }


if __name__ == '__main__':
    pprint(GoogleSpider().search(input('Search for what? ')))

총결산

그래서 지금 우리는 구글의 관련 동영상 결과를 잡을 수 있지만, 왜 3개의 동영상 결과만 잡을 수 있느냐고 물어볼 수도 있다.구글의 원본 코드 중 세 개밖에 없기 때문이다.만약 당신이 더 많은 방법을 찾았다면, 댓글을 남겨 주십시오. 나는 가능한 한 빨리 그것을 댓글에 추가할 것입니다.물론 질문이나 인코딩 중 오류가 있으면 아래에 메시지를 남겨 주십시오. 기꺼이 도움을 드리겠습니다.

Reference

이 문제에 관하여(구글 검색 결과 기어오르기 - 2부: 기어오르기 영상), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/samzhangjy/crawling-google-search-results-part-2-crawling-video-4hi9

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

H3C SSL*** Radius AUTH

서버 성능 분석

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다