'CAPTCHA'를 돌파하는 서비스 '2Captcha'와 Ruby+Chrome_Remote로 자동 스크래핑

17040 단어 스크래핑 RPA 루비 CAPTCHA

소개

스크래핑하고 있으면, CAPTCHA가 나와 프로그램이 멈춘 경험, 있다고 생각합니다.
(그런 분 밖에 이 기사는 보지 않습니다.)
어떻게든 CAPTHCA를 회피하기 위해서, BOT같지 않은 움직임을 시키거나, IP 분산이라고 하는 손도 있습니다만, 이번은 솔직하게 CAPTCHA를 풀어 주려고 합니다.
물론, 엔지니어이므로 자신의 손으로 풀기보다는 프로그램상에서 자동으로 풀어주고 싶네요.
기계 학습은 학습 비용과 도입 비용이 높고 더 즐겁게하고 싶습니다.
2Cpathca라는 서비스가 그것을 실현합니다.
그 밖에도 다양한 서비스가 있으므로, 자신에게 있던 것을 찾아 주세요.
파이썬 기사 는 있었지만, Ruby의 기사는 발견되지 않았기 때문에 썼습니다.

2Capthca란?

CAPTHCA 기능을 돌파하는 서비스로 API를 이용하여 인증을 자동화할 수 있습니다.
유료 서비스입니다만, reCAPTCHA v2라면 1,000 리퀘스트로 $2.99로 저렴합니다.
만약을 위해 거절을 넣어 둡니다만, 나와 2Captcha의 사이에 판촉등에서의 금전의 교환은 없습니다.

Chrome_Remote란?

Chrome의 인스턴스를 Ruby에서 조작할 수 있는 라이브러리입니다.
자세한 사용법은 설명 페이지 및 리포지토리을 참조하십시오.
스크래핑하는 전제로서, 처음에는 CAPTHCA가 나오기 어려운 방식으로 해야 합니다.
Selenium 등과 달리 Chrome을 그대로 움직이는 Chrome_Remote 쪽이 BOT 판정되기 어렵다고 생각합니다. (그중 차이를 검증하고 싶다.)

하고 싶은 일

reCAPTCHA 데모 페이지 돌파합니다.
2Capthca 계정을 만들거나 api 키를 얻으려면 선인의 기사을 참조하십시오.

'2Captcha'와 Ruby + Chrome_Remote로 reCAPTHCA를 돌파

2Captcha의 api 키를 얻고 파일을 저장합니다.

key.yaml

---
:2Capthca: 2Captchaのapiキー

Chrome을 debugging-port로 시작합니다.

Mac의 경우

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 &

필요한 Gem을 설치하십시오.

Gemfile

source "https://rubygems.org"
gem 'nokogiri'
gem 'chrome_remote'

bundle install

루비 프로그램 본체입니다.

crawler.rb


require 'nokogiri'
require 'chrome_remote'
require 'yaml'

class CaptchaDetectedException < StandardError; end

class ChromeController
  def initialize
    @chrome = ChromeRemote.client

    # Enable events
    @chrome.send_cmd "Network.enable"
    @chrome.send_cmd "Page.enable"
  end

  def open(url)
    # ページアクセス
    move_to url
    captcha_detect
  end

  def reload_page
    sleep 1
    @chrome.send_cmd "Page.reload", ignoreCache: false
    wait_event_fired
  end

  def execute_js(js)
      @chrome.send_cmd "Runtime.evaluate", expression: js
  end

  def wait_event_fired
      @chrome.wait_for "Page.loadEventFired"
  end

  # ページ移動
  def move_to(url)
    sleep 1
    @chrome.send_cmd "Page.navigate", url: url
    wait_event_fired
  end

  # HTMLを取得
  def get_html
    response = execute_js 'document.getElementsByTagName("html")[0].innerHTML'
    html = '<html>' + response['result']['value'] + '</html>'
  end

  def captcha_detect
    bot_detect_cnt = 0
    begin
      html = get_html
      raise CaptchaDetectedException, 'captchaが確認されました' if html.include?("captcha")
    rescue CaptchaDetectedException => e
      p e
      bot_detect_cnt += 1
      p "captcha突破試行: #{bot_detect_cnt}回目"
      doc = Nokogiri::HTML.parse(html, nil, 'UTF-8')
      return if captcha_solve(doc) == '解除成功'
      reload_page
      retry if bot_detect_cnt < 3
      p 'captcha突破エラー。Rubyを終了します'
      exit
    end
    p 'captchaはありませんでした'
  end

  def captcha_solve(doc)
    id = request_id(doc).match(/(\d.*)/)[1]
    solution = request_solution(id)
    return false unless solution
    submit_solution(solution)
    p captcha_result
  end

  def request_id(doc)
    # APIキーの読み込み
    @key = YAML.load_file("key.yaml")[:"2Capthca"]
    # data-sitekey属性の値を取得
    googlekey = doc.at_css('#recaptcha-demo')["data-sitekey"]
    method = "userrecaptcha"
    pageurl = execute_js("location.href")['result']['value']
    request_url="https://2captcha.com/in.php?key=#{@key}&method=#{method}&googlekey=#{googlekey}&pageurl=#{pageurl}"
    # captcha解除を依頼
    fetch_url(request_url)
  end

  def request_solution(id)
    action = "get"
    response_url = "https://2captcha.com/res.php?key=#{@key}&action=#{action}&id=#{id}"
    sleep 15
    retry_cnt = 0
    begin
      sleep 5
      # captcha解除コードを取得
      response_str = fetch_url(response_url)
      raise 'captcha解除前' if response_str.include?('CAPCHA_NOT_READY')
    rescue => e
      p e
      retry_cnt += 1
      p "リトライ:#{retry_cnt}回目"
      retry if retry_cnt < 10
      return false
    end
    response_str.slice(/OK\|(.*)/,1)
  end

  def submit_solution(solution)
    # 解除コードを所定のtextareaに入力
    execute_js("document.getElementById('g-recaptcha-response').innerHTML=\"#{solution}\";")
    sleep 1
    # 送信ボタンクリック
    execute_js("document.getElementById('recaptcha-demo-submit').click();")
  end

  def captcha_result
    sleep 1
    html = get_html
    doc = Nokogiri::HTML.parse(html, nil, 'UTF-8')
    doc.at_css('.recaptcha-success') ? '解除成功' : '解除失敗'
  end


  def fetch_url(url)
    sleep 1
    `curl "#{url}"`
  end

end

crawler = ChromeController.new
url = 'https://www.google.com/recaptcha/api2/demo'
crawler.open(url)

프로그램을 실행하면 reCAPTCHA 데모 페이지로 이동하여 CAPTCHA를 돌파하려고 시도합니다.

bundle exec ruby crawler.rb

마지막으로

Reference

이 문제에 관하여('CAPTCHA'를 돌파하는 서비스 '2Captcha'와 Ruby+Chrome_Remote로 자동 스크래핑), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/kobayashi-masayuki/items/22efe9cf924da89a0be4

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

【Ruby】anemone과 nokogiri로 크롤러를 만들어 보았다.

지바현 Go To EAT용으로 점포 검색 BOT(AI LINE BOT)를 만든 이야기(1)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다