Google 학술 검색 결과 긁기

이 기사에서는 Unirest 및 Cheerio를 사용하여 Node JS로 Google Scholar 결과 페이지를 긁는 방법을 알려줍니다.



목차


  • Requirements
  • Scraping Google Organic Scholar Results
  • Scraping Google Scholar Profiles
  • Scraping Google Scholar Cite Results
  • Scraping Google Scholar Author Profile
  • Conclusion

  • Additional Resources

  • 요구 사항:



    CSS 선택자를 사용한 웹 구문 분석



    HTML 파일에서 태그를 검색하는 것은 어려울 뿐만 아니라 시간이 많이 걸리는 과정입니다. 웹 스크래핑 여정을 더 쉽게 만들기 위해 완벽한 태그를 선택하려면 CSS Selectors Gadget을 사용하는 것이 좋습니다.

    이 가젯은 필요에 맞는 완벽한 CSS 선택기를 찾는 데 도움이 될 수 있습니다. 다음은 tutorial에 대한 링크입니다. 이 가젯을 사용하여 필요에 따라 최상의 CSS 선택기를 선택하는 방법을 알려줍니다.

    사용자 에이전트



    User-Agent는 요청한 사용자 에이전트의 애플리케이션, 운영 체제, 공급업체 및 버전을 식별하는 데 사용되며 실제 사용자로 행동하여 Google에 가짜 방문을 하는 데 도움을 줄 수 있습니다.
    사용자 에이전트를 교체할 수도 있습니다. 이에 대한 자세한 내용은 How to fake and rotate User Agents using Python 3 문서를 참조하십시오.

    IP가 Google에 의해 차단되지 않도록 추가로 보호하려면 다음을 시도할 수 있습니다10 Tips to avoid getting Blocked while Scraping Websites.

    라이브러리 설치



    시작하기 전에 스크레이퍼를 준비하고 진행할 수 있도록 이러한 라이브러리를 설치합니다.
  • Unirest JS
  • Cheerio JS

  • 또는 프로젝트 터미널에 아래 명령을 입력하여 라이브러리를 설치할 수 있습니다.

    npm i unirest
    npm i cheerio
    


    HTML 데이터를 추출하기 위해 Unirest JS를 사용하고 HTML 데이터를 구문 분석하기 위해 Cheerio JS를 사용합니다.

    Google Organic Scholar 결과 스크랩:





    제목, 제목 링크, ID, 표시된 링크, 스니펫 및 기타 사이트 링크를 스크랩합니다.
    다음은 Google Organic Scholar 결과 👇🏻를 스크랩하는 전체 코드입니다.

    const cheerio = require("cheerio");
    const unirest = require("unirest");
    
    
    const getScholarData = async() => {
    try
    {
    const url = "https://www.google.com/scholar?q=IIT+MUMBAI&hl=en";
    
    return unirest
    .get(url)
    .headeras({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
    })
    .then((response) => {
        let $ = cheerio.load(response.body);
    
    let scholar_results = [];
    
    $(".gs_ri").each((i,el) => {
        scholar_results.push({
        title: $(el).find(".gs_rt").text(),
        title_link: $(el).find(".gs_rt a").attr("href"),
        id: $(el).find(".gs_rt a").attr("id")
        displayed_link: $(el).find(".gs_a").text(),
        snippet: $(el).find(".gs_rs").text().replace("\n", ""),
        cited_by_count: $(el).find(".gs_nph+ a").text(),
        cited_link: "https://scholar.google.com" + $(el).find(".gs_nph+ a").attr("href"),
        versions_count: $(el).find("a~ a+ .gs_nph").text(),
        versions_link: $(el).find("a~ a+ .gs_nph").text() ? "https://scholar.google.com" + $(el).find("a~ a+ .gs_nph").attr("href") : "",
        })
    })
    
    for (let i = 0; i < scholar_results.length; i++) {
        Object.keys(scholar_results[i]).forEach(key => scholar_results[i][key] === "" || scholar_results[i][key] === undefined ? delete scholar_results[i][key] : {});  
    }
    
    console.log(scholar_results)
    })
    }
    catch(e)
    {
        console.log(e);
    }
    }
    getScholarData();                                       
    

    결과는 다음과 같아야 합니다 👇🏻:

    [
        {
            title: 'Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study.',
            title_link: 'https://search.ebscohost.com/login.aspx?direct=true&profile=ehost&scope=site&authtype=crawler&jrnl=22295984&AN=108373670&h=bqlRj0gjNNQoSuJb5zZxtrAWRoe7e4cT7cfMNTEYxWbUdYAXdv0An55XKjithW%2FT3A9v3vC8m87cvR3EXu%2BdkA%3D%3D&crl=c',
            id: 'TPhPjzP8H_MJ',
            displayed_link: 'SK Gupta, S Sharma - International Journal of Information …, 2015 - search.ebscohost.com',
            snippet: "The rapid advancement in information technology has changed the resources and services of a library. Now day's libraries are not confined only to print resources and traditional library …",
            cited_by_count: 'Cited by 19',
            cited_link: 'https://scholar.google.com/scholar?cites=17518998373872433228&as_sdt=2005&sciodt=0,5&hl=en',
            versions_count: 'All 5 versions',
            versions_link: 'https://scholar.google.com/scholar?cluster=17518998373872433228&hl=en&as_sdt=0,5'
        },
        {
            title: '[PDF][PDF] Design of Solar powered vehicle. project III, Industrial Design Center, IIT Mumbai',
            title_link: 'https://dsource.in/sites/default/files/case-study/solar-powered-rickshaw/introduction/file/solar-powered-rickshaw.pdf',
            id: '_w_nBYVUe8AJ',
            displayed_link: 'UA Athavankar, SR Singh - 2016 - dsource.in',
            snippet: 'The greatest problem that faces the world today is Global warming. It is more apparent here in India than anywhere else, specially Rajasthan where temperatures over the last few years …',
            cited_by_count: 'Cited by 2',
            cited_link: 'https://scholar.google.com/scholar?cites=13869772407723986943&as_sdt=2005&sciodt=0,5&hl=en'
        },
        ....
    

    Google 학자 프로필 스크랩





    이제 저자 이름, 링크, 조직 내 직위 및 부서, 이메일 및 인용 대상을 스크랩합니다.
    코드 👇🏻는 다음과 같습니다.

    const unirest = require("unirest");
    const cheerio = require("cheerio")
    
    const getScholarProfiles = async() => {
    
    try
    {
    const url = "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=IIT+MUMBAI";
    
    return unirest
    .get(url)
    .headeras({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
    })
    .then((response) => {
        let $ = cheerio.load(response.body);
    
    let scholar_profiles = [];
    
    $(".gsc_1usr").each((i,el) => {
        scholar_profiles.push({
        name: $(el).find(".gs_ai_name").text(),
        name_link: "https://scholar.google.com" + $(el).find(".gs_ai_name a").attr("href"),
        position: $(el).find(".gs_ai_aff").text(),
        email: $(el).find(".gs_ai_eml").text(),
        departments: $(el).find(".gs_ai_int").text(),
        cited_by_count: $(el).find(".gs_ai_cby").text().split(" ")[2],
        })
    })
    
    for (let i = 0; i < scholar_profiles.length; i++) {
        Object.keys(scholar_profiles[i]).forEach(key => scholar_profiles[i][key] === "" || scholar_profiles[i][key] === undefined ? delete scholar_profiles[i][key] : {});  
    }
    
    console.log(scholar_profiles)
    });
    
    }
    catch(e)
    {
        console.log(e);
    }
    }
    getScholarProfiles();
    

    결과는 다음과 같아야 합니다 👇🏻:

    
      [
        {
            name: 'Piyali Banerjee',
            name_link: 'https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ',
            position: 'Postdoctoral Researcher in Physics, IIT Bombay',
            email: 'Verified email at iitb.ac.in',
            departments: 'Experimental High Energy Physics Phenomenology ',
            cited_by_count: '230769'
        },
        {
            name: 'Archana Pai',
            name_link: 'https://scholar.google.com/citations?hl=en&user=2Dw4Y9AAAAAJ',
            position: 'IIT Bombay',
            email: 'Verified email at phy.iitb.ac.in',
            departments: 'Gravitational Wave Astronomy Statistical Signal Processing Multimessenger astronomy ',
            cited_by_count: '70703'
        },
        {
            name: 'Krithi Ramamritham',
            name_link: 'https://scholar.google.com/citations?hl=en&user=LFLG5pcAAAAJ',
            position: 'Sai University, Chennai, India (retired from IIT Bombay)',
            email: 'Verified email at iitb.ac.in',
            departments: 'databases real-time systems ICT based  solutions for society ',
            cited_by_count: '23765'
        },
        ....
    


    Google Scholar Cite 결과 스크랩





    아래 코드 블록은 유기적 학자 검색 결과의 인용 결과를 스크랩합니다.

    const cheerio = require("cheerio");
    const unirest = require("unirest");
    
    const getData = async () => {
        try {
        const url =
            "https://scholar.google.com/scholar?q=info:TPhPjzP8H_MJ:scholar.google.com&output=cite";
    
        return unirest
            .get(url)
            .headers({})
            .then((response) => {
            let $ = cheerio.load(response.body);
    
            let cite_results = [];
    
            $("#gs_citt tr").each((i, el) => {
                cite_results.push({
                title: $(el).find(".gs_cith").text(),
                snippet: $(el).find(".gs_citr").text(),
                });
            });
    
            let links = [];
    
            $("#gs_citi .gs_citi").each((i, el) => {
                links.push({
                name: $(el).text(),
                link: $(el).attr("href"),
                });
            });
    
            console.log(cite_results);
            console.log(links);
    
            });
        } catch (e) {
        console.log(e);
        }
    };
    getData();                                
    

    대상 URL을 보면 info 다음에 Google Scholar Organic Results를 스크랩하여 얻은 ID에 불과한 문자열을 사용했습니다.
    결과는 다음과 같아야 합니다 👇🏻:

    
      [
        {
            title: 'MLA',
            snippet: 'Gupta, Sanjay Kumar, and Sanjeev Sharma. "Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study." International Journal of Information Dissemination & Technology 5.1 (2015).'
        },
        {
            title: 'APA',
            snippet: 'Gupta, S. K., & Sharma, S. (2015). Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study. International Journal of Information Dissemination & Technology, 5(1).'
        },
        {
            title: 'Chicago',
            snippet: 'Gupta, Sanjay Kumar, and Sanjeev Sharma. "Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study." International Journal of Information Dissemination & Technology 5, no. 1 (2015).'
        },
        {
            title: 'Harvard',
            snippet: 'Gupta, S.K. and Sharma, S., 2015. Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study. International Journal of Information Dissemination & Technology, 5(1).'
        },
        {
            title: 'Vancouver',
            snippet: 'Gupta SK, Sharma S. Use of Digital Information Resources and Services by the Students of IIT Mumbai Central Library: A Study. International Journal of Information Dissemination & Technology. 2015 Jan 1;5(1).'
        }
      ]
      [
        {
            name: 'BibTeX',
            link: 'https://scholar.googleusercontent.com/scholar.bib?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=4&ct=citation&cd=-1&hl=en'
        },
        {
            name: 'EndNote',
            link: 'https://scholar.googleusercontent.com/scholar.enw?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=3&ct=citation&cd=-1&hl=en'
        },
        {
            name: 'RefMan',
            link: 'https://scholar.googleusercontent.com/scholar.ris?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=2&ct=citation&cd=-1&hl=en'
        },
        {
            name: 'RefWorks',
            link: 'https://scholar.googleusercontent.com/scholar.rfw?q=info:TPhPjzP8H_MJ:scholar.google.com/&output=citation&scisdr=CgU3OnawGAA:AAGBfm0AAAAAYxNDSiRdOB5oj4ETzNEdl0FPaLTdQMOA&scisig=AAGBfm0AAAAAYxNDSvsFzqtEKkNp_fOIv-P0--SSVfpG&scisf=1&ct=citation&cd=-1&hl=en'
        }
      ]
    
    


    Google Scholar 저자 프로필 스크랩





    이제 Google Scholar Author Profile을 긁어낼 것입니다.
    먼저 저자의 이름, 직책, 이메일, 부서를 스크랩합니다.



    const unirest = require("unirest");
    const cheerio = require("cheerio");
    
    const getAuthorProfileData = async () => {
    try {
        const url = "https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ";
    
        return unirest.get(url)
        .headers({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
        })
        .then((response) => {
        const $ = cheerio.load(response.body)                                
        let author_results = {};
    
        author_results.name = $("#gsc_prf_in").text();
        author_results.position = $("#gsc_prf_inw+ .gsc_prf_il").text();
        author_results.email = $("#gsc_prf_ivh").text();
        author_results.departments = $("#gsc_prf_int").text();
    
        console.log(author_results);
    })
    } catch (e) {
        console.log(e);
    }
    };
    getAuthorProfileData();
    

    결과는 다음과 같아야 합니다 👇🏻:

      {
        name: 'Piyali Banerjee',
        position: 'Postdoctoral Researcher in Physics, IIT Bombay',
        email: 'Verified email at iitb.ac.in',
        departments: 'Experimental High Energy PhysicsPhenomenology'
      }
    
    


    이제 저자가 작성한 기사를 그의 프로필에서 긁어낼 것입니다. Google Scholar 저자 프로필 기사

    $(".gsc_a_t").each((i,el) => {
        articles.push({
            title: $(el).find(".gsc_a_at").text(),
            link: "https://scholar.google.com" + $(el).find(".gsc_a_at a").attr("href"),
            authors: $(el).find(".gsc_a_at+ .gs_gray").text(),
            publication: $(el).find(".gs_gray+ .gs_gray").text()
        })
    }) 
    
    for (let i = 0; i < articles.length; i++) {
        Object.keys(articles[i]).forEach((key) =>
            articles[i][key] === "" || articles[i][key] === undefined
            ? delete articles[i][key]
            : {}
        );
        }
    

    결과는 다음과 같아야 합니다.

     [
      {
        title: 'Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC',
        link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cOsxSDEAAAAJ&citation_for_view=cOsxSDEAAAAJ:u5HHmVD_uO8C',
        authors: 'G Aad, T Abajyan, B Abbott, J Abdallah, SA Khalek, AA Abdelalim, ...',
        publication: 'Physics Letters B 716 (1), 1-29, 2012'
      },
      {
        title: 'The ATLAS simulation infrastructure',
        link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cOsxSDEAAAAJ&citation_for_view=cOsxSDEAAAAJ:d1gkVwhDpl0C',
        authors: 'G Aad, B Abbott, J Abdallah, AA Abdelalim, A Abdesselam, B Abi, ...',
        publication: 'The European Physical Journal C 70 (3), 823-874, 2010'
      },
    
    


    이제 2017년부터 인용, h-index 및 i10-index를 다룰 Google Scholar Author 프로필 Cited By 결과를 스크랩할 것입니다.

    let cited_by = {};
    
    cited_by.table = [];
    cited_by.table[0] = {};
    cited_by.table[0].citations = {};
    cited_by.table[0].citations.all = $("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text();
    cited_by.table[0].citations.since_2017 = $("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text();
    cited_by.table[1] = {};
    cited_by.table[1].h_index = {};
    cited_by.table[1].h_index.all = $("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text();
    cited_by.table[1].h_index.since_2017 = $("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text();
    cited_by.table[2] = {};
    cited_by.table[2].i_index = {};
    cited_by.table[2].i_index.all = $("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text();
    cited_by.table[2].i_index.since_2017 = $("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text();
    

    결과는 다음과 같습니다 👇🏻:

    {
      [
        { citations: { all: '230769', since_2017: '105070' } },
        { h_index: { all: '185', since_2017: '133' } },
        { i_index: { all: '1154', since_2017: '706' } }
      ]
    }
    

    전체 Google 작성자 프로필 페이지 👇🏻를 스크랩하는 전체 코드는 다음과 같습니다.

        const cheerio = require("cheerio");
        const unirest = require("unirest");
    
        const getAuthorProfileData = async () => {
        try {
        const url = "https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ";
    
        return unirest
        .get(url)
        .headers({
            "User-Agent":
                "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
        })
        .then((response) => {
            let $ = cheerio.load(response.body);
    
            let author_results = {};
            let articles = {};
    
            author_results.name = $("#gsc_prf_in").text();
            author_results.position = $("#gsc_prf_inw+ .gsc_prf_il").text();
            author_results.email = $("#gsc_prf_ivh").text();
            author_results.departments = $("#gsc_prf_int").text();
    
            $("#gsc_a_b .gsc_a_t").each((i, el) => {
                articles.push({
                    title: $(el).find(".gsc_a_at").text(),
                    link: "https://scholar.google.com" + $(el).find(".gsc_a_at").attr("href"),
                    authors: $(el).find(".gsc_a_at+ .gs_gray").text(),
                    publication: $(el).find(".gs_gray+ .gs_gray").text()
                })
            })
    
            for (let i = 0; i < articles.length; i++) {
                Object.keys(articles[i]).forEach((key) =>
                    articles[i][key] === "" || articles[i][key] === undefined
                        ? delete articles[i][key]
                        : {}
                );
            }
    
            let cited_by = {};
    
            cited_by.table = [];
            cited_by.table[0] = {};
            cited_by.table[0].citations = {};
            cited_by.table[0].citations.all = $("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text();
            cited_by.table[0].citations.since_2017 = $("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text();
            cited_by.table[1] = {};
            cited_by.table[1].h_index = {};
            cited_by.table[1].h_index.all = $("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text();
            cited_by.table[1].h_index.since_2017 = $("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text();
            cited_by.table[2] = {};
            cited_by.table[2].i_index = {};
            cited_by.table[2].i_index.all = $("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text();
            cited_by.table[2].i_index.since_2017 = $("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text();
    
            console.log(author_results);
            console.log(articles);
            console.log(cited_by.table);
        })
    
        } catch (e) {
        console.log(e);
        }
        };
        getAuthorProfileData();
    


    결론:



    이 튜토리얼에서는 Node JS를 사용하여 Google Scholar Results를 스크랩하는 방법을 배웠습니다. 내가 놓친 것이 있거나 설명이 필요한 것이 있으면 언제든지 message me로 문의하십시오. 읽어주셔서 감사합니다!

    추가 리소스


  • Scrape Google Organic Search Result
  • Scrape Google Images Results
  • Scrape Google News Results
  • Scrape Google Maps Reviews

  • 작가:



    내 이름은 Darshan이고 serpdog.io의 설립자입니다.

    좋은 웹페이지 즐겨찾기