๐ฅ BE TIL Day 9 0324
โฌ๏ธ Main Note
https://docs.google.com/document/d/1IZ5yYEtX92E7k2ijoAZZB3W_nBG9MpGPX6OKk_POxLQ/edit
๐ Scraping vs. Crawling
๐ง Scraping
- Literally scraping the other site's data only once.
- use Cheerio as a tool.
๐ท Crawling
- Constantly getting the data from other web site.
- use Puppeteer as a tool.
How scraping/crawling works:
inspect/developer tools command: command
+ option
+ i
There is <em>
tag in elements.
--> Bringing the data is scraping and doing whatever else with that data depends on the developer.
XML
- Before knowing about scraping, the form before JSON should be understood. Before JSON, XML form is used.
- XML: Extensible markup language
--></>
=> hyper markup language
--> example of XML:<Writer/>
,<School/>
, etc - Before JSON,
<Name>JB</Name>
format was used.
--> Inefficient (there needs two divs that encompass the value) - But by using JSON, HTML is received so drawn in string fomula.
--> Able to feth in postman. GET https://naver.com
: able to get the data of elements.
๐ Scraping
Cheerio
Cheerio helps to get HTML tags into string form. [tool]
-
When we send particular links to some other sites, for example like discord, there pops out a preview image and title on the link box.
-
When a site is created, there is meta tag and property added to og in the head tag. Here, Discord developers create these tags.
--> Creating link-preview -
og is created by Facebook, where Facebook first wanted to create the link-preview. og stands for open graph.
-
If I'm creating my own site and the site address is
mysite.com
, meta tag should be initially created in the head tag.
<meta og: title />
,<meta og: image/>
Process
- The user uploads a post that contains --> title:"Hi there, this is my title" , contents: "The weather's nice today. I want you guys to visit this site: aaa.com"
--> here, the goal is to show link-preveiw to users. (The title and image of the site.) - To achieve this goal, the title and contents should be sent to backend via API.
--> Post '/boards' => Sent in a form of JSON - Here, backend developers pick out the link that starts with http from the contents. HERE, we need scraping. (axios.get)
--> And that result is put into another variable. - Then find the meta og tag inside the developer tool - elements page.
--> After picking the data that are needed, title, contents, and ogs should be sent to database.
Practice
import axios from "axios"
import cheerio from "cheerio"
async function createBoardAPI(mydata){ // mydata <== frontendData ๋ฐ์ดํฐ ๊ฐ์ ธ์ค๊ธฐ
const targetUrl = mydata.contents.split(" ").filter((el) => el.startsWith("http"))[0]
//๊ณต๋ฐฑ์ ๊ธฐ์ค์ผ๋ก split์ ํ๋ฉด ํ ๋จ์ด์ฉ ์ฐ๋ ค์ ๋ฐฐ์ด๋ก ์ถ๋ ฅ๋จ //=> ์ด๋ http๋ก ์์ํ๋ ์ ๋ฅผ ๊ฐ์ ธ์ค๋ฉด ๋๋๊ฑฐ
// ์ด๋ ๊ฒ ํ๋ฉด ์ต์ข
๊ฒฐ๊ณผ๋ก "http๋ก ์์ํ๋๊ฑฐ ํ๋๋ง" ๋ฐฐ์ด์ ๋ค์ด์ค๊ฒ ๋จ
// ๊ทธ ๋ฐฐ์ด์ 0๋ฒ์งธ๋ฅผ ๋ฝ์์์ผ ์์ํ๊ฒ ์ฃผ์๋ง ๋ฝ์์ฌ ์๊ฐ ์๋๊ฑฐ์
const aaa = await axios.get(targetUrl)
const $ = cheerio.load(aaa.data)
$("meta").each((_, el) => { // ๋ฉํํ๊ทธ๋ค๋ง ์ญ ๋ฝํ์ ธ ๋์ค๋๊ฑฐ์ => .each : for๋ฌธ์ฒ๋ผ ์์ฉ (meta์ ๋ชจ๋ ํ๊ทธ์์ ์๋ํด์ค)
// _ :๋ช๋ฒ์งธ meta tag์ธ์ง // el=element => ex) 3๋ฒ์งธ๋ฉด 3๋ฒ์งธ meta tag์ ๋ด์ฉ์ ๊ฐ์ ธ์ค๋๊ฒ
// ์ฐ๋ฆฌํํ
ํ์ํ๊ฑด og: ๊ฐ ํฌํจ๋์ด์๋ meta tag
// $๊ฐ ํน์ ํ๊ทธ๋ฅผ ์ปจํธ๋กค ํ๋ ์
if ($(el).attr('property')){ // $("meta").each((_, el) => { ์ธ ์ํ๋ก ํ๋ฉด ๋ชจ๋ meta tag๋ฅผ ๋์๋ณด๊ธฐ ๋๋ฌธ์ ๋นํจ์จ์ ์. ๊ทธ๋์ if ๋ฌธ ๊ฐ๋
const key = $(el).attr('property').split(":")[1] //์์ฑ์ด property์ธ, og: ์ ๊ฐ์ง๊ณ ์๋ ์์ฑ์ ์ฐพ๋๊ฒ
// ==> split(":") --> :์ ๊ธฐ์ค์ผ๋ก og์ url์ด ๋๋ ์ง ['og', 'title'] ์ด๋ฐ์์ผ๋ก ์ฌ๊ธฐ์ title์ 1๋ฒ์งธ ์ธ๋ฑ์ค์ ์๋๊ฑฐ์
// title --> key, "๋ค์ด๋ฒ" --> value
const value = $(el).attr('content') // ๋ค์ด๋ฒ๋ผ๋ ๋จ์ด๊ฐ ๋์ด
console.log(key, value)
}
})
}
const frontendData = { // frontend์์ ๊ฒ์๋ฌผ์ ๋ฑ๋กํ ๋ ์๋ ๋ด์ฉ์ ๋ฑ๋กํ๋ค:
title: "Hi there, this is my title ๐ ",
contents: "The weather's nice today. I want you guys to visit this site: https://naver.com ์
๋๋ค~"
}
createBoardAPI(frontendData)
onclick
is an attribute (์์ฑ)
Property
is also an attribute
<meta og: title/>
When scraping happens constantly, that becomes crawling.
๐ Crawling
When I want to do something after opening a browser, Puppeteer is used. [tool]
// ์ฌ๊ธฐ์ด๋ ํฌ๋กค๋ง ์๋ฒ ์ฌ๋ก: https://biz.chosun.com/topics/law_firm/2021/09/29/OOBWHWT5ZBF7DESIRKNPYIODLA/
// ๋ฌด์ฐจ๋ณ์ ์ผ๋ก ํฌ๋กค๋ง์ ์์ฒญํ๋ค๋ณด๋ฉด ์ ์์๊ฐ ๋ง์์ ธ์ ๋ฉ๋ชจ๋ฆฌ๊ฐ ๋ง์ด ํ์ํ๊ฐ ๋จ => ์ด๋ฌ๋ฉด ๋ ๋ง์ ์ปดํจํฐ๊ฐ ํ์ํด์ง๊ฒ ๋จ
import puppeteer from 'puppeteer'
async function startCrawling(){ //ํ๋์ฉ ๋ค ๊ธฐ๋ค๋ ค์ค์ผํจ (๋ธ๋ผ์ฐ์ ์ด๊ณ ์ฐฝ ์ด๊ณ )
const browser = await puppeteer.launch({headless: false}) // ๋ธ๋ผ์ฐ์ ๋ํ๋จ
const page = await browser.newPage() // ์ ํ์ด์ง ์ด๊ธฐ
await page.setViewport({width: 1280, height: 720}) // page ํฌ๊ธฐ๋ ์ง์ ๊ฐ๋ฅํจ
await page.goto("https://www.goodchoice.kr/product/search/2") // chromium ๋ธ๋ผ์ฐ์ ๋ก ์ด๋ํ๊ฒ ๋จ // chromium์ ๊ธฐ๋ฐ์ผ๋ก ํด์ ๋ง๋ค์ด์ง ๋ธ๋ผ์ฐ์ ๊ฐ ํฌ๋กฌ์ (๋์ ์ ํ ๋ค๋ฅธ๊ฑฐ)
page.waitForTimeout(1000) // ์ ์ํ๊ณ ์๊ฐํ
์ ์ฃผ๊ณ ์ ์ํ๋๊ฑฐ์
const star = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > div > span", (el) => el.textContent)
//$eval์ ํ๊ฐ์ ๋ํด์, $$eval์ ์ฌ๋ฌ๊ฐ ์ ํํ ๋ // '>' => ์์์ผ๋ก ์๋ ํ๊ทธ //div์ ์์์ด span์ด๋ค
// child()์ ์ซ์๋ง ๋ค๋ฆ => ๋ค๋ฅธ ํธํ
์ฑ๊ธ: #poduct_list_area > li:nth-child(3) > a > div > div.name > div > span //=> ์ด๋ฌ๋ฉด for๋ฌธ ๋๋ ค์ ๋ชจ๋ ๋ฐ์ดํฐ ๊ฐ์ ธ์ค๊ธฐ ๊ฐ๋ฅ
page.waitForTimeout(1000)
const location = await (await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > p:nth-child(4)", (el)=> el.textContent)).trim()
page.waitForTimeout(1000)
const price = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.price > p > b", (el) => el.textContent)
page.waitForTimeout(1000)
console.log("โญ๏ธ star:", star)
console.log("๐ location:", location)
console.log("๐ณ Price:", price)
await browser.close() // crawling ๋๋๋ฉด browser ์ข
๋ฃํด์ฃผ๊ธฐ
}
startCrawling()
iframe
- iframe is a separate page inside the browser. (์ฌ์ดํธ์ ๊ฐ๊ธฐ ๋ค๋ฅธ ์๋งน์ด)
--> ifram is a total different page.
--> The outerShell and the inside is different. - Even if the devleoper brings the data by Copy selector, iframe selector doesn't work on the site selector.
- EX) If I copied $30 product in the market by iframe selector, I'm trying to get the data inside the iframe of the market site.
--> The accessing site is naver, but the data of iframe is getting pulled out.
Author And Source
์ด ๋ฌธ์ ์ ๊ดํ์ฌ(๐ฅ BE TIL Day 9 0324), ์ฐ๋ฆฌ๋ ์ด๊ณณ์์ ๋ ๋ง์ ์๋ฃ๋ฅผ ๋ฐ๊ฒฌํ๊ณ ๋งํฌ๋ฅผ ํด๋ฆญํ์ฌ ๋ณด์๋ค https://velog.io/@j00b33/BE-TIL-Day-9-0324์ ์ ๊ท์: ์์์ ์ ๋ณด๊ฐ ์์์ URL์ ํฌํจ๋์ด ์์ผ๋ฉฐ ์ ์๊ถ์ ์์์ ์์ ์ ๋๋ค.
์ฐ์ํ ๊ฐ๋ฐ์ ์ฝํ ์ธ ๋ฐ๊ฒฌ์ ์ ๋ (Collection and Share based on the CC Protocol.)
์ข์ ์นํ์ด์ง ์ฆ๊ฒจ์ฐพ๊ธฐ
๊ฐ๋ฐ์ ์ฐ์ ์ฌ์ดํธ ์์ง
๊ฐ๋ฐ์๊ฐ ์์์ผ ํ ํ์ ์ฌ์ดํธ 100์ ์ถ์ฒ ์ฐ๋ฆฌ๋ ๋น์ ์ ์ํด 100๊ฐ์ ์์ฃผ ์ฌ์ฉํ๋ ๊ฐ๋ฐ์ ํ์ต ์ฌ์ดํธ๋ฅผ ์ ๋ฆฌํ์ต๋๋ค