๐Ÿ’Ž ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๋กœ ์ •์  ์›น์‚ฌ์ดํŠธ๋ฅผ 4๋ถ„๋งŒ์— ํฌ๋กค๋งํ•˜๋Š” ๋ฐฉ๋ฒ• ๐Ÿ’ฅ

17223 ๋‹จ์–ด npmjavascriptopensourcenode
์ „์ œ ์กฐ๊ฑด: Javascript์— ๋Œ€ํ•ด ์•ฝ๊ฐ„ ์•Œ๊ณ  ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์˜ค๋Š˜์˜ ์ฃผ์ œ๋Š” ์ •์  ์›น ์‚ฌ์ดํŠธ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•œ ๋‹ค์Œ ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋‚˜ ์ปดํ“จํ„ฐ์˜ ํŒŒ์ผ ๋˜๋Š” ์™„์ „ํžˆ ๋‹ค๋ฅธ ๊ฒƒ์œผ๋กœ ๊ตฌ์กฐํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Fetch ํฌ๋กค๋Ÿฌ(Node JS) ์†Œ๊ฐœ





Fetch Crawler๋Š” ์›น ์‚ฌ์ดํŠธ ํฌ๋กค๋ง์„ ์œ„ํ•œ ๊ธฐ๋ณธ์ ์ด๊ณ  ์œ ์—ฐํ•˜๋ฉฐ ๊ฐ•๋ ฅํ•œ API๋ฅผ ์ œ๊ณตํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํฌ๋กค๋Ÿฌ๋Š” ๋‹ค์Œ ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ •์  ์›น ์‚ฌ์ดํŠธ๋ฅผ ํฌ๋กค๋งํ•˜๋Š” ๊ฐ„๋‹จํ•œ API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๋ถ„์‚ฐ ํฌ๋กค๋ง
  • ๋ณ‘๋ ฌ, ์žฌ์‹œ๋„, ์ตœ๋Œ€ ์š”์ฒญ, ์š”์ฒญ ๊ฐ„ ์‹œ๊ฐ„ ๊ตฌ์„ฑ(์›น ์‚ฌ์ดํŠธ์—์„œ ์ฐจ๋‹จ๋˜์ง€ ์•Š๋„๋ก)...
  • depth-first search ๋ฐ breadth-first search ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ชจ๋‘ ์ง€์›
  • ์ตœ๋Œ€ ์š”์ฒญ ์ˆ˜๊ฐ€ ์‹คํ–‰๋œ ํ›„ ์ค‘์ง€
  • ์Šคํฌ๋ž˜ํ•‘์„ ์œ„ํ•ด ์ž๋™์œผ๋กœ Cheerio ์‚ฝ์ž…
  • [์•ฝ์†] ์ง€์›

  • ์ „์ฒด ๋ฌธ์„œ๋Š” Github์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. https://github.com/viclafouch/Fetch-Crawler

    Fetch-crawler์˜ ํŠน์ง•์€ ์ƒ๋‹นํ•œ ์‹œ๊ฐ„ ์ ˆ์•ฝ์„ ํ—ˆ์šฉํ•˜๋Š” ์š”์ฒญ์„ ๋ณ‘๋ ฌ๋กœ ๊ด€๋ฆฌํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค(์˜ˆ: ๋™์‹œ์— 10๊ฐœ์˜ ์š”์ฒญ, ํ•˜๋‚˜์”ฉ์ด ์•„๋‹˜).

    ์ฆ‰, ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ๋ชจ๋“  ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ ๋‹ค์–‘ํ•œ ์˜ต์…˜์„ ๊ตฌ์„ฑํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

    ๋‹จ๊ณ„๋ณ„:



    ๋จผ์ € ํ•„์š”ํ•œ ์ข…์†์„ฑ์„ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

    # npm i @viclafouch/fetch-crawler
    


    ๊ทธ๋Ÿฐ ๋‹ค์Œ js ํŒŒ์ผ์—์„œ ๋ชจ๋“ˆ์„ ๊ฐ€์ ธ์˜ค๊ณ  launch์˜ FetchCrawler ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. ํ•„์š”ํ•œ ์œ ์ผํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๊ท€ํ•˜์˜ ์›น์‚ฌ์ดํŠธ(๋˜๋Š” ํŽ˜์ด์ง€) ๋งํฌ(์—ฌ๊ธฐhttps://github.com)์ž…๋‹ˆ๋‹ค.

    const FetchCrawler = require('@viclafouch/fetch-crawler')
    
    FetchCrawler.launch({
      url: 'https://github.com'
    })
    


    ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‹ค์Œ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

    # node example-crawl.js 
    


    Node JS๋กœ ์ด ํŒŒ์ผ์„ ์‹คํ–‰ํ•˜๋ฉด ์ž‘๋™ํ•˜์ง€๋งŒ ํฌ๋กค๋Ÿฌ๊ฐ€ ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ์•„๋ฌด ์ผ๋„ ์ผ์–ด๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

    ์ด์ œ ์›น ์‚ฌ์ดํŠธ( documentation )์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ๊ธฐ๋ณธ ์˜ต์…˜ ๋ฐ ๋ฐฉ๋ฒ•์œผ๋กœ ์ด๋™ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

    const FetchCrawler = require('@viclafouch/fetch-crawler')
    
    // `$ = Cheerio to get the content of the page
    // See https://cheerio.js.org
    const collectContent = $ =>
      $('body')
        .find('h1')
        .text()
        .trim()
    
    // After getting content of the page, do what you want :)
    // Accept async function
    const doSomethingWith = (content, url) => console.log(`Here the title '${content}' from ${url}`)
    
    // Here I start my crawler
    // You can await for it if you want
    FetchCrawler.launch({
      url: 'https://github.com',
      evaluatePage: $ => collectContent($),
      onSuccess: ({ result, url }) => doSomethingWith(result, url),
      onError: ({ error, url }) => console.log('Whouaa something wrong happened :('),
      maxRequest: 20
    })
    


    ์ž, ์œ„์— ํฌํ•จ๋œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๊ณผ ์˜ต์…˜์„ ๊ฒ€ํ† ํ•ด ๋ด…์‹œ๋‹ค.
    evaluatePage : ํŽ˜์ด์ง€์˜ ์ฝ˜ํ…์ธ ๋ฅผ ํƒ์ƒ‰/์กฐ์ž‘ํ•˜๋Š” ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค. Cheerio๋Š” ๋งˆํฌ์—…์„ ๊ตฌ๋ฌธ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด ์ œ๊ณต๋˜๋ฉฐ ์ด๋ฅผ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์›น ํŽ˜์ด์ง€์—์„œ ์›ํ•˜๋Š” ์ •ํ™•ํ•œ ๋ฐ์ดํ„ฐ ์กฐ๊ฐ์„ ์ถ”์ถœํ•˜๋Š” ํŠน์ˆ˜ ๊ธฐ๋Šฅ์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    onSuccess : evaluatePage๊ฐ€ ์„ฑ๊ณตํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ? ์›ํ•˜๋Š” ๋Œ€๋กœ ํ•˜์‹ญ์‹œ์˜ค(๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ถ”๊ฐ€? ํŒŒ์ผ์— ๋ฐ์ดํ„ฐ ํฌํ•จ? ๋“ฑ..).
    onError : if evaluatePage๋ผ๋Š” ์ฝœ๋ฐฑ์ด ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค.
    maxRequest : ํฌ๋กค๋Ÿฌ๊ฐ€ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ์š”์ฒญ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ œํ•œ์„ ๋น„ํ™œ์„ฑํ™”ํ•˜๋ ค๋ฉด ์ „๋‹ฌ-1ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์œ„์˜ ์˜ˆ์—์„œ๋Š” 20๋ฒˆ์˜ ์š”์ฒญ ํ›„์— ํฌ๋กค๋Ÿฌ๋ฅผ ์ค‘์ง€ํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค(์‹คํŒจํ•˜๋”๋ผ๋„).

    ๋‚˜๋จธ์ง€ ๊ตฌ์„ฑ์˜ ๊ฒฝ์šฐ ์—ฌ๊ธฐ์—์„œ documentation์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    ์‹ค์Šต ์˜ˆ:



    ๋น„๋””์˜ค ๊ฒŒ์ž„ ์›น์‚ฌ์ดํŠธ์˜ ์˜ˆ๋ฅผ ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. Instant Gaming

    ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ: ์›น ์‚ฌ์ดํŠธ์—์„œ ํŒ๋งค ์ค‘์ธ ๋น„๋””์˜ค ๊ฒŒ์ž„(Xbox)์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต๊ตฌํ•˜๊ณ  JSON ํŒŒ์ผ๋กœ ์ปดํŒŒ์ผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ํ”„๋กœ์ ํŠธ์—์„œ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: ์ด ๋ชฉ๋ก์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋Š” Chrome ํ™•์žฅ ํ”„๋กœ๊ทธ๋žจ).

    ์ด๊ฒƒ์ด ์šฐ๋ฆฌ ํŒŒ์ผexample-crawl.js์— ํฌํ•จ๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

    const fs = require('fs')
    const FetchCrawler = require('@viclafouch/fetch-crawler')
    
    // Get all games on xbox platform
    const urlToCrawl = 'https://www.instant-gaming.com/en/search/?type%5B0%5D=xbox'
    let games = []
    
    // I'm getting an array of each game on the page (name, price, cover, discount)
    const collectContent = $ => {
      const content = []
      $('.item.mainshadow').each(function(i, elem) {
        content.push({
          name: $(this)
            .find($('.name'))
            .text()
            .trim(),
          price: $(this)
            .find($('.price'))
            .text()
            .trim(),
          discount: $(this)
            .find($('.discount'))
            .text()
            .trim(),
          cover: $(this)
            .find($('.picture'))
            .attr('src')
        })
      })
      return content
    }
    
    // Only url including an exact string
    const checkUrl = url => {
      try {
        const link = new URL(url)
        if (link.searchParams.get('type[0]') === 'xbox' && link.searchParams.get('page')) {
          return url
        }
        return false
      } catch (error) {
        return false
      }
    }
    
    // Concat my new games to my array
    const doSomethingWith = content => (games = games.concat(content))
    
    // Await for the crawler, and then save result in a JSON file
    ;(async () => {
      try {
        await FetchCrawler.launch({
          url: urlToCrawl,
          evaluatePage: $ => collectContent($),
          onSuccess: ({ result, url }) => doSomethingWith(result, url),
          preRequest: url => checkUrl(url),
          maxDepth: 4,
          parallel: 6
        })
        const jsonResult = JSON.stringify({ ...games }, null, 2)
        await fs.promises.writeFile('examples/example_4.json', jsonResult)
      } catch (error) {
        console.error(error)
      }
    })()
    


    ์ด์ œ ํฌ๋กค๋Ÿฌ๋ฅผ ์‹œ์ž‘ํ•˜๊ณ  ๋ช‡ ์ดˆ๋งŒ ๊ธฐ๋‹ค๋ฆฌ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

    # node example-crawl.js 
    


    ๋‹ค์Œ์€ JSON ํŒŒ์ผ์ž…๋‹ˆ๋‹ค. https://github.com/viclafouch/Fetch-Crawler/blob/master/examples/example_4.json

    ๋ณด์‹œ๋‹ค์‹œํ”ผ json ํŒŒ์ผ์—์„œ ๋งค์šฐ ๊นจ๋—ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. ๋ถ„๋ช…ํžˆ ์›น ์‚ฌ์ดํŠธ์˜ ๋ฐ์ดํ„ฐ๋Š” ๊ณง ๋ณ€๊ฒฝ๋  ๊ฒƒ์ด๋ฏ€๋กœ 24์‹œ๊ฐ„๋งˆ๋‹ค ํฌ๋กค๋Ÿฌ๋ฅผ ๋ฐ˜๋ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    Fetch Crawler ํŒจํ‚ค์ง€์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด documentation ์„ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.

    ...

    ์ฝ์–ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

    ์ด ํŒจํ‚ค์ง€์— ์ €์™€ ํ•จ๊ป˜ ๊ธฐ์—ฌํ•ด ์ฃผ์„ธ์š” :)
    Google ํ”„๋กœ์ ํŠธ์— ํ•„์š”ํ–ˆ๊ณ  ๋ฐ์ดํ„ฐ ์ถ”์ถœ์ด ๊ฝค ์–ด๋ ค์› ๊ธฐ ๋•Œ๋ฌธ์— ์ด ํŒจํ‚ค์ง€๋ฅผ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

    ์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ