쇠녹 스크래치

14131 단어 rust selectrs cli scraping

탭 매개 변수와 계수에 따라 Stackoverflow에서 제목, 질문 링크, 답안 계수, 보기 계수, 투표를 추출합니다.이 스크레이퍼는 Kadekillary Scarper에서 영감을 얻었고 업데이트된 라이브러리와 더 많은 기능을 가지고 있다.

사용 도서관

Reqwest
> 녹 방지용 HTTP 클라이언트를 포함한 인체공학적 배터리

Select
> Rust용 웹 라이브러리(유용한 HTML 문서 캡처용)

Clap
> 명령줄 매개 변수와 하위 명령을 분석하는 데 사용되는 간단하고 효율적이며 기능이 완비된 라이브러리

특징.

간단하고 빠르기

비동기식 액세스 요청

Cli 모드

어떻게 달리기를 합니까

실행 파일을 cargo build로 생성

부터./target/debug/stackoverflow-scraping-with-rust -t <tag> -c <count>은(는) 정보를 얻으려는 주제입니다.
은 긁을 게시물/스레드 수입니다.참고: 최대 한계는 16입니다.
이렇게./target/debug/stackoverflow-scraping-with-rust -t java -c 1

우리 뭐 할 거야?

Clap 라이브러리를 사용하여 명령줄에서 매개변수 가져오기

Reqwest 라이브러리

로 요청

선택기 라이브러리

로 스크래치

사용 도서관

Reqwest
> 녹 방지용 HTTP 클라이언트를 포함한 인체공학적 배터리

Select
> HTML 문서에서 유용한 데이터를 추출하는 Rust 라이브러리, 웹 캡처용

Clap
> 명령줄 매개 변수와 하위 명령을 분석하는 데 사용되는 간단하고 효율적이며 기능이 완비된 라이브러리

라이브러리 설치
간단히 카고에 다음 라이브러리를 추가합니다.톰

[dependencies]
reqwest = { version = "0.10", features = ["json"] }
tokio = { version = "0.2", features = ["full"] }
select = "0.6.0-alpha.1"
clap = "2.33.3"
rand = "0.8.4"

계속하기 전에 Css 선택기를 알아야 합니다.
무엇이 선택기/위치추적기입니까?
CSS 선택기는 요소 선택기와 웹 페이지의 웹 요소를 식별하는 값의 조합입니다.
포지셔닝 머신의 선택은 어느 정도 테스트된 응용 프로그램에 달려 있다
신분증
XPATH에서 요소의 id는 "[
='example']'이고 CSS에서'#'-ID를 사용하려면 DOM에서 유일해야 합니다.
예를 들면 다음과 같습니다.

XPath: //div[@id='example']
CSS: #example

요소 유형
앞의 예는 xpath의//div를 보여 줍니다.이것은 요소 형식입니다. 텍스트 상자나 단추를 입력하거나 그림에 img을 입력하거나 링크에 'a' 를 입력할 수 있습니다.

Xpath: //input or
Css: =input

직계 자녀
HTML 페이지의 구조는 XML과 유사하며 하위 페이지는 부모 페이지에 중첩되어 있습니다.예를 들어div의 첫 번째 링크를 찾을 수 있다면 문자열을 만들어서 접근할 수 있습니다.XPATH의 직접 하위 레벨은 "/"로 정의되고 CSS에서는 ">"로 정의됩니다.
예를 들면 다음과 같습니다.

XPath: //div/a
CSS: div > a

아동
끼워 넣는div를 작성하는 것은 힘들고 코드가 취약할 수 있습니다.때때로, 코드가 바뀌기를 원하거나, 층을 뛰어넘기를 원합니다.원소가 다른 원소나 그 하위 원소에 있을 수 있다면, 원소는 XPATH에서 '/' 를 사용하고, CSS에서는 빈칸만 사용합니다.
예를 들면 다음과 같습니다.

XPath: //div//a
CSS: div a

등급
클래스에 대해 말하자면 XPATH의 상황은 매우 비슷하다.'[@class='example']'이고 CSS에서는''
예를 들면 다음과 같습니다.

XPath: //div[@class='example']
CSS: .example

단계 1-> Clap 라이브러리를 사용하여 명령줄에서 매개변수 가져오기

Clap 라이브러리를 사용하여 명령줄에서 매개변수를 가져오는 중입니다.
세 가지 상황이 있다. -

태그만 제공 = > 입력 태그에서만 기본 번호가post인post, 즉 16

Only count is supplies=> 우리는 랜더 도서관

을 사용하여 모든 랜덤 테마에서 유일한 게시물 입력 수를 얻을 것입니다

count와tag=>inputtag의post 입력 번호를 획득

우선, 명령행 프로그램 이름을 초기화합니다.그리고 이 세 가지 사례를 언급했는데 각각 짧은 명칭과 긴 명칭이 있다.

fn main() {
    let matches = App::new("StackOverflow Scraper")
        .version("1.0")
        .author("Praveen Chaudhary &lt;[email protected]&gt;")
        .about("It will scrape questions from stackoverflow depending on the tag.")
        .arg(
            Arg::with_name("tag")
                .short("t")
                .long("tag")
                .takes_value(true)
                .help("takes tag and scrape according to this"),
        )
        .arg(
            Arg::with_name("count")
                .short("c")
                .long("count")
                .takes_value(true)
                .help("gives n count of posts"),
        )
        .get_matches();
        ....
        ....

일단 우리가 모든 사례를 언급하면현재 우리는 매칭을 사용하여 매개 변수 값을 추출해야 한다. 이것은 우리가 특정한 패턴을 찾는 데 도움이 된다

    fn main() {
        .....
        .....

        if matches.is_present("tag") && matches.is_present("count") {
            let url = format!(
                "https://stackoverflow.com/questions/tagged/{}?tab=Votes",
                matches.value_of("tag").unwrap()
            );
            let count: i32 = matches.value_of("count").unwrap().parse::<i32>().unwrap();
            stackoverflow_post(&url, count as usize);
        } else if matches.is_present("tag") {
            let url = format!(
                "https://stackoverflow.com/questions/tagged/{}?tab=Votes",
                matches.value_of("tag").unwrap()
            );
            stackoverflow_post(&url, 16);
        } else if matches.is_present("count") {
            let url = get_random_url();
            let count: i32 = matches.value_of("count").unwrap().parse::<i32>().unwrap();
            stackoverflow_post(&url, count as usize);
        } else {        
            let url = get_random_url();        
            stackoverflow_post(&url, 16);
        }
    }

위 코드에서 Stackoverflow post 함수를 사용했습니다.우리는 3단계에서 이 점을 이해할 것이다

2단계 -> Reqwest 라이브러리를 사용한 요청

우리는 reqwest 라이브러리를 사용하여 입력 라벨을 사용하여 맞춤형 Stackoverflow 사이트에 get 요청을 보낼 것입니다

#[tokio::main]
async fn hacker_news(url: &str, count: usize) -> Result<(), reqwest::Error> {
    let resp = reqwest::get(url).await?;
    ....

3단계 -> 선택기 라이브러리를 사용하여 스크래치

css 선택기를 사용하여 Stackoverflow에서 질문 댓글을 받을 것입니다

#[tokio::main]
async fn hacker_news(url: &str, count: usize) -> Result<(), reqwest::Error> {
    ..... 
    ..... 

    let document = Document::from(&*resp.text().await?);

    for node in document.select(Class("mln24")).take(count) {
        let question = node.select(Class("excerpt")).next().unwrap().text();
        let title_element = node.select(Class("question-hyperlink")).next().unwrap();
        let title = title_element.text();
        let question_link = title_element.attr("href").unwrap();
        let votes = node.select(Class("vote-count-post")).next().unwrap().text();
        let views = node.select(Class("views")).next().unwrap().text();
        let striped_views = views.trim();
        let tags = node
            .select(Attr("class", "post-tag grid--cell"))
            .map(|tag| tag.text())
            .collect::&lt;Vec&lt;_&gt;&gt;();
        let answer = node
            .select(Or(
                Attr("class", "status answered-accepted").descendant(Name("strong")),
                Attr("class", "status answered").descendant(Name("strong")),
            ))
            .next()
            .unwrap()
            .text();
        println!("Question       => {}", question);
        println!("Question-link  => https://stackoverflow.com{}",question_link);
        println!("Question-title => {}", title);
        println!("Votes          => {}", votes);
        println!("Views          => {}", striped_views);
        println!("Tags           => {}", tags.join(" ,"));
        println!("Answers        => {}", answer);
        println!("-------------------------------------------------------------\n");
    }
    Ok(())
}

전체 코드

extern crate clap;
extern crate reqwest;
extern crate select;
extern crate tokio;

use clap::{App, Arg};
use rand::seq::SliceRandom;
use select::document::Document;
use select::predicate::{Attr, Class, Name, Or, Predicate};

fn main() {
    let matches = App::new("StackOverflow Scraper")
        .version("1.0")
        .author("Praveen Chaudhary &lt;[email protected]&gt;")
        .about("It will scrape questions from stackoverflow depending on the tag.")
        .arg(
            Arg::with_name("tag")
                .short("t")
                .long("tag")
                .takes_value(true)
                .help("takes tag and scrape according to this"),
        )
        .arg(
            Arg::with_name("count")
                .short("c")
                .long("count")
                .takes_value(true)
                .help("gives n count of posts"),
        )
        .get_matches();

    if matches.is_present("tag") && matches.is_present("count") {
        let url = format!(
            "https://stackoverflow.com/questions/tagged/{}?tab=Votes",
            matches.value_of("tag").unwrap()
        );
        let count: i32 = matches.value_of("count").unwrap().parse::<i32>().unwrap();
        hacker_news(&url, count as usize);
    } else if matches.is_present("tag") {
        let url = format!(
            "https://stackoverflow.com/questions/tagged/{}?tab=Votes",
            matches.value_of("tag").unwrap()
        );
        hacker_news(&url, 16);
    } else if matches.is_present("count") {
        let url = get_random_url();
        let count: i32 = matches.value_of("count").unwrap().parse::<i32>().unwrap();
        hacker_news(&url, count as usize);
    } else {        
        let url = get_random_url();        
        hacker_news(&url, 16);
    }
}

#[tokio::main]
async fn hacker_news(url: &str, count: usize) -> Result&lt;(), reqwest::Error&gt; {
    let resp = reqwest::get(url).await?;
    // println!("body = {:?}", resp.text().await?);
    // assert!(resp.status().is_success());
    let document = Document::from(&*resp.text().await?);

    for node in document.select(Class("mln24")).take(count) {
        let question = node.select(Class("excerpt")).next().unwrap().text();
        let title_element = node.select(Class("question-hyperlink")).next().unwrap();
        let title = title_element.text();
        let question_link = title_element.attr("href").unwrap();
        let votes = node.select(Class("vote-count-post")).next().unwrap().text();
        let views = node.select(Class("views")).next().unwrap().text();
        let striped_views = views.trim();
        let tags = node
            .select(Attr("class", "post-tag grid--cell"))
            .map(|tag| tag.text())
            .collect::&lt;Vec&lt;_&gt;&gt;();
        let answer = node
            .select(Or(
                Attr("class", "status answered-accepted").descendant(Name("strong")),
                Attr("class", "status answered").descendant(Name("strong")),
            ))
            .next()
            .unwrap()
            .text();
        println!("Question       => {}", question);
        println!("Question-link  => https://stackoverflow.com{}",question_link);
        println!("Question-title => {}", title);
        println!("Votes          => {}", votes);
        println!("Views          => {}", striped_views);
        println!("Tags           => {}", tags.join(" ,"));
        println!("Answers        => {}", answer);
        println!("-------------------------------------------------------------\n");
    }
    Ok(())
}

// Getting random tag
fn get_random_url() -> String {
    let default_tags = vec!["python", "rust", "c#", "android", "html", "javascript"];
    let random_tag = default_tags.choose(&mut rand::thread_rng()).unwrap();
    let url = format!(
        "https://stackoverflow.com/questions/tagged/{}?tab=Votes",
        random_tag
    );
    url.to_string()
}

어떻게 우리의 스크레이퍼를 운행합니까?

실행 파일을 cargo build로 생성

Run by./target/debug/stackoverflow-scraping-with-rust -t <tag> -c <count>는 스크랩할 테마는 스크랩할 댓글/스레드 수입니다.참고: 최대 한계는 16입니다.이렇게./target/debug/stackoverflow-scraping-with-rust -t java -c 1

배치하다

Heroku의 도움을 받아 Circle CI에 배포할 수 있습니다.
너는 CircleCI Blog에서 더 많은 것을 읽을 수 있다

웹 페이지 미리 보기/내보내기

Google Drive
Github 링크 = > https://github.com/chaudharypraveen98/stackoverflow-scraping-with-rust

Reference

이 문제에 관하여(쇠녹 스크래치), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/chaudharypraveen98/stackoverflow-scraping-with-rust-4624

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

Promoetheus 모니터링으로 인해 발생한 경고를 슬랙에 게시

[Go] fatih/color를 사용하여 콘솔에서 컬러 출력 문자열

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다