다양한 YouTube 동영상에 nlp를 적용한 방법

19281 단어 astro python nlp

YouTube 동영상에 nlp 적용

처음에는 assemblyai를 사용하여 직접 YouTube 비디오에서 대본을 추출하려고 했지만 Google은 매우 발전하여 자동으로 대본을 제공하기 시작했습니다.

assemblyai은 동영상에서 대본을 추출하는 훌륭한 도구이며, 다른 출처에서 투자자 프레젠테이션에 사용했습니다.

이전에 가지고 있던 레거시 프로젝트( ytube_nlp )는 대본 페이지를 생성하기 위해 jinja2를 사용하고 있었으며 범주별로 비디오를 그룹화하거나 toc이 없었습니다. 게다가 ytube_nlp에는 RSS 피드가 없어서 언제 새 영상이 나오는지 알 방법이 없습니다.

내 최신 반복media_nlp은 모든 종류의 미디어(주로 YouTube 비디오)를 가져와서 대본 및 기타 메타데이터를 추출하도록 설계되었습니다.

개선 사항은 다음과 같습니다.

astrojs가 포함된 RSS 피드

astrojs를 사용한 증분 페이지 생성, 원하는 경우 json 파일을 직접 제공하는 기능

범주별로 비디오를 그룹화하는 기능

향상된 이메일

기술 아키텍처에 대한 설명

media_nlp에는 두 가지 주요 구성 요소가 있습니다.

성적표 수집 Python 스크립트( 사용) 및 정적 사이트 생성은 astrojs에서 처리합니다.

scripts 폴더의 python 스크립트는 ytube_nlp에서 크게 영감을 받았지만 astrojs에서 처리할 json 파일을 생성합니다. YouTube 데이터 API를 사용하여 관심 있는 채널(config.yml 파일에 하드코딩됨)에 대해 24시간 동안 업로드된 비디오를 가져옵니다.

def search_videos_for_channel(channel_id, params=dict(part="snippet")):
    youtube_api = "https://www.googleapis.com/youtube/v3/search"
    youtube_api_key = os.getenv("YOUTUBE_API_KEY")
    if youtube_api_key is None:
        raise SystemExit("Need Youtube API KEY")
    params["channelId"] = channel_id
    params["order"] = "date"
    current_date = datetime.now(timezone.utc)
    # hardcoded fix for now, only query for videos in august
    publishedAfter = (current_date - timedelta(days=REF_DAYS)).isoformat()
    params["publishedAfter"] = publishedAfter
    params["maxResults"] = 100
    params["key"] = youtube_api_key

    r = requests.get(youtube_api, params=params).json()
    # Check if an error object is present
    if r.get("error") is not None:
        print(r)
        print("Add Authentication Key")
    return r

위의 함수는 매개변수화된 시간 프레임(내 목적을 위해 일반적으로 하루) 내에 youtube api에서 비디오를 가져옵니다. 사용자가 100개 이상의 동영상을 만들 가능성은 거의 없지만 cnbc와 같은 대형 뉴스 채널에서는 가능합니다.

def extract_key_video_data(video_data):
    # Takes video search response and extracts the data of interest
    # videoId, title, description, channelId, publishedAt
    key_video_data = []
    if video_data is None or video_data is []:
        return
    # video id is None do nothing, it happens during livestreams
    # before publishing
    for video in video_data.get("items"):
        snippet = video.get("snippet")
        vid_id = video.get("id")

        videoId = vid_id.get("videoId", None)
        if videoId == None:
            continue
        channelId = snippet.get("channelId")
        description = snippet.get("description")
        title = snippet.get("title")
        publishedAt = snippet.get("publishedAt")
        video_data = dict(
            videoId=videoId,
            channelId=channelId,
            description=description,
            title=title,
            publishedAt=publishedAt,
        )
        key_video_data.append(video_data)
    return key_video_data

이렇게 하면 내가 관심 있는 주요 데이터의 목록이 생성됩니다. 게시일은 성적표가 있을 가능성을 아는 데 유용합니다.

그런 다음 youtube_transcript_api를 사용하여 비디오에서 스크립트를 추출합니다. 동영상의 스크립트가 표시되려면 몇 시간이 걸립니다.
성적표를 받은 후 spacy를 사용하여 엔터티 감지를 수행한 다음 결과를 json 파일에 기록합니다.

이제 astro 사이트는 모든 정보를 구문 분석한 다음 정적 html로 출력할 수 있습니다.

그래서 하나는 다른 하나에 의존하는 두 개의 github 작업을 만들었습니다.

YouTube 파일을 스캔하고 새 YouTube 성적표 데이터로 리포지토리를 업데이트하는 Cron 작업입니다.

name: Scan YTube
# Don't want to burn my private minutes at this point
on:
  push:
    branches:
      - master
      - main
    paths-ignore:
      - "website/**"
  schedule:
    # * is a special character in YAML so you have to quote this string
    - cron:  '30 13 * * *'


env:
  YOUTUBE_API_KEY: ${{ secrets.YOUTUBE_API_KEY  }}
  MJ_APIKEY_PUBLIC: ${{ secrets.MJ_APIKEY_PUBLIC }}
  MJ_APIKEY_PRIVATE: ${{ secrets.MJ_APIKEY_PRIVATE }}
  DISCORD_CODE_STATUS_WEBHOOK: ${{ secrets.DISCORD_CODE_STATUS_WEBHOOK }}
  DISCORD_VIDEO_WEBHOOK: ${{ secrets.DISCORD_VIDEO_WEBHOOK }}

jobs:
  make_report:
    name: Generate Report
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@master
      - name: Set up Python 3.x
        uses: actions/setup-python@v1
        with:
          python-version: '3.8' # Semantic version range syntax or exact version of a Python version
          architecture: 'x64' # Optional - x64 or x86, defaults to x64
      - name: installation of dependencies
        run: |
          if [ -f scripts/requirements.txt ]; then pip install -r scripts/requirements.txt; fi
          python -m spacy download en_core_web_sm
          python -m textblob.download_corpora

      - name: Generate Report
        run:  python3 scripts/main.py

      - name: Commit files
        run: |
          git config --local user.email "[email protected]"
          git config --local user.name "GitHub Action"
          git add *.json
          git add data/ytube
          git commit -m "added json files"
      - name: Push changes
        uses: ad-m/github-push-action@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}

      - uses: actions/upload-artifact@v1
        name: Upload Report folder
        with:
          name: report
          path: data/ytube/investing

Python 스크립트에서 데이터 수집에 성공하면 astro 파일을 빌드합니다.

# Workflow to build and deploy to your GitHub Pages repo.

# Edit your project details here.
# Remember to add API_TOKEN_GITHUB in repo Settings > Secrets as well!
env:
  githubEmail: <YOUR GITHUB EMAIL ADDRESS>
  deployToRepo: <NAME OF REPO TO DEPLOY TO (E.G. <YOUR USERNAME>.github.io)>

name: Github Pages Astro CI

on:
  # Triggers the workflow on push and pull request events but only for the main branch
  push:
    branches: [main]
  pull_request:
    branches: [main]

  workflow_run:
    workflows: ["Scan YTube"]
    types:
      - completed

  # Allows you to run this workflow manually from the Actions tab.
  workflow_dispatch:

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@v2

      # Install dependencies with npm
      - name: Install dependencies
        run: cd website && npm install

      # Build the project and add .nojekyll file to supress default behaviour
      - name: Build
        run: |
          cd website
          npm run build
          touch ./dist/.nojekyll
      # Push to your pages repo
      - name: Deploy 🚀
        uses: JamesIves/[email protected]
        with:
          branch: gh-pages # The branch the action should deploy to.
          folder: website/dist # The folder the action should deploy.

바라건대 이것은 일반적인 작업 흐름이 어떻게 작동하는지 이해하기에 충분할 것입니다. 내가 배운 것 중 일부는 다음과 같습니다.

github 작업 워크플로를 사용하여 다른 워크플로를 트리거하는 방법

YouTube 데이터 API와 상호 작용하는 방법

공간이 있는 기본 nlp는 이 프로젝트에 그다지 유용하지 않지만 좋은 출발점입니다.

구문 추출 기능이 있으면 상당히 흥미로울 수 있습니다.

Reference

이 문제에 관하여(다양한 YouTube 동영상에 nlp를 적용한 방법), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/friendlyuser/how-i-applied-nlp-to-various-youtube-videos-4jp1

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

Typescript 시리즈 - 배열에 유형 포함

Nginx 서버 블록 기본값

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다