Nokogiri 및 RegEx를 사용하여 Ruby에서 웹 페이지 스캔

12387 단어 ruby nokogiri regex scraping

TLDR;

스니펫 목록을 생성하기 위해 gem을 사용하여 웹 기사에서 주제를 검색할 것입니다.

Nokogiri에 대한 소개를 얻기 위해 내가 쓴 기사를 읽으십시오.

나는 야구에 대한 캐주얼 팬입니다. ESPN에는 많은 MLB 기사가 있습니다. 그들은 무엇에 대해 이야기합니까? 미스터리입니다! 알아내려면 Nokogiri로 페이지를 스캔해야 합니다. 준비 되었나요?

그 기사를 잡아라!

➜  ~ irb --simple-prompt
>> require 'nokogiri'
=> true
>> require 'open-uri'
=> true

# I'm going to scan the article at this URL.  Feel free to 
# find one of your own, since this will probably be outdated 
# by the time you read this.

>> url = 'https://www.espn.com/mlb/story/_/id/33855507/mlb-power-rankings-week-4
-new-team-surged-our-no-1-spot'
=> "https://www.espn.com/mlb/story/_/id/33855507/mlb-power-rankings-week-4-...
>> web_page = Nokogiri::HTML5(URI(url))
=>
#(Document:0x5b4 {
...

이제 이 기사의 본문을 찾아야 하므로 관련 부분만 스캔합니다. div 및 클래스 이름과 함께 #css 메서드를 사용합니다.

>> body = web_page.css('div.article-body')
=> [#<Nokogiri::XML::Element:0x5730 name="div" attributes=[#<Nokogiri::XML:...
>> text = body.text
=> "7:00 AM ETESPNFacebookTwitterFacebook MessengerPinterestEmailprintFour ...

이제 재미있는 부분입니다.

정규식 스캔

Ruby에는 정규식을 인수로 사용하고 모든 항목의 배열을 반환하는 메서드가 있습니다String#scan.
>> "don't do that dog!".scan(/do/) => ["do", "do", "do"]
그러나 우리는 단지 발생 목록을 얻고 싶지 않습니다. 우리는 기사가 말하는 내용을 볼 수 있도록 전체 맥락을 얻고 싶습니다. 이를 달성하려면 문자열이 문자 배열인 것처럼 각 항목의 인덱스를 찾아야 합니다. 그런 다음 각 발생의 컨텍스트를 얻기 위해 이 인덱스 전후를 잘게 쪼갤 것입니다. 이렇게 하면 #to_enum(열거자에게)라는 (덜 알려진) 메서드가 나타납니다.
to_enum 메서드를 사용하면 문자열을 열거하고 메서드와 선택적 인수를 전달할 수 있습니다. 다음은 문자열의 각 ASCII 문자에 대한 바이트 코드를 얻는 예입니다. to_s(2)를 사용하여 각각을 바이너리로 인쇄합니다.

>> "abcdefg".to_enum(:each_byte).each { |b| p b.to_s(2) }
"1100001"
"1100010"
"1100011"
"1100100"
"1100101"
"1100110"
"1100111"

우리의 목적을 위해 인수가 정규식인 :scan 메서드를 전달합니다. 그런 다음 각 발생을 Regexp.last_match.begin(0)로 매핑하여 발생에 대한 시작 인덱스를 얻습니다. 이것이 작동하는 방식입니다.

# remember text holds the text of the article body
# each index will go into the indices variable
# we can search for whatever we want, let's search for pitch
# this will work for pitch, pitchers, pitches, etc.
>> indices = text.to_enum(:scan, /pitch/i).map do |pitch|
     Regexp.last_match.begin(0)
>> end
=>
[1825,
...
>> indices
=>
[1825,
 3699,
 4727,
 10007,
 10127,
 10846,
 11016,
 12617,
 13734,
 14060,
 14585,
 14927,
 16019,
 17835,
 18858]

엄청난! 이 인덱스 목록은 데이터를 얻기 위해 슬라이스할 위치를 알려줍니다. 시작하기 전에 30자를 슬라이스하고 슬라이스의 길이를 70자로 만듭니다. 이 텍스트 스니펫을 배열로 푸시합니다.

>> snippets = []
=> []
?> indices.each do |i|
?>   snippets << text.slice(i - 30, 70)
>> end

>> snippets
=>
["n-differential in the majors. Pitching has mostly carried them, but th",
 "st year, Milwaukee's starting pitching was basically three deep. That ",
 "rt envious: Too many starting pitchers. Clevinger is the sixth member ",
 " allowed. While he has a five-pitch repertoire, one thing he's done th",
 "eup combo. He threw those two pitches a combined 64.3% of the time las",
 "ause his swing rate and first-pitch swing rate in particular are up. H",
 "nd him he's going to get more pitches to hit. -- Schoenfield17. Chicag",
 "2 batting line this year. The pitching staff has been one of the brigh",
 "ice start. Good, right-handed pitching will stymie them this year, tho",
 "le against both hard and soft pitching, despite dominating the league ",
 " ranks among some of the best pitchers in all of baseball in WAR. -- L",
 " back to .500. Their starting pitchers have lifted them, with Zac Gall",
 ". The Rangers did have better pitching last week, moving them up the l",
 "r nine innings in 11⅓ innings pitched. -- Lee29. Washington NationalsR",
 " Colorado will do that -- but pitching was a big problem. The Reds com"]

우리는 해냈다! 이제 전체 단어로 시작하고 끝나도록 정리하겠습니다. 각 스니펫을 가져와서 공백으로 분리하고 첫 번째 단어와 마지막 부분 단어를 제거한 다음 공백과 함께 다시 붙여넣습니다.

snippets.map do |snippet|
?>   words = snippet.split(" ")
?>   words.pop
?>   words.shift
?>   snippet = words.join(" ")
>> end
=>
["in the majors. Pitching has mostly carried them, but",
 "year, Milwaukee's starting pitching was basically three deep.",
 "envious: Too many starting pitchers. Clevinger is the sixth",
 "While he has a five-pitch repertoire, one thing he's done",
 "combo. He threw those two pitches a combined 64.3% of the time",
 "his swing rate and first-pitch swing rate in particular are up.",
 "him he's going to get more pitches to hit. -- Schoenfield17.",
 "batting line this year. The pitching staff has been one of the",
 "start. Good, right-handed pitching will stymie them this year,",
 "against both hard and soft pitching, despite dominating the",
 "among some of the best pitchers in all of baseball in WAR. --",
 "to .500. Their starting pitchers have lifted them, with Zac",
 "The Rangers did have better pitching last week, moving them up the",
 "nine innings in 11⅓ innings pitched. -- Lee29. Washington",
 "will do that -- but pitching was a big problem. The Reds"]

당신은 그것을 가지고 있습니다! 이 기사가 Ruby의 멋진 기능 중 일부를 공개하기를 바랍니다. Ruby-Doc 웹사이트에서 새로운 보석을 탐색하고 새로운 방법을 찾는 것을 두려워하지 마십시오.

Reference

이 문제에 관하여(Nokogiri 및 RegEx를 사용하여 Ruby에서 웹 페이지 스캔), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/jvon1904/using-nokogiri-and-regex-to-scan-a-webpage-in-ruby-45l3

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

트위터에서 게임 스쿠쇼를 공유하는 과정을 만들 때 빠진 이야기

오픈 소스 모험: 에피소드 54: BATTLETECH 무기 순위 앱

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다