[๐Ÿค— ๊ฐ•์ขŒ 6.6] Byte-Pair Encoding (BPE) ํ† ํฐํ™”

36746 ๋‹จ์–ด BPEBPE

BPE(Byte-Pair Encoding)๋Š” ์ดˆ๊ธฐ์— ํ…์ŠคํŠธ๋ฅผ ์••์ถ•ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๊ฐœ๋ฐœ๋œ ํ›„, GPT ๋ชจ๋ธ์„ ์‚ฌ์ „ ํ•™์Šตํ•  ๋•Œ ํ† ํฐํ™”๋ฅผ ์œ„ํ•ด OpenAI์—์„œ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. GPT, GPT-2, RoBERTa, BART ๋ฐ DeBERTa๋ฅผ ํฌํ•จํ•œ ๋งŽ์€ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์ด ์„น์…˜์—์„œ๋Š” ์ „์ฒด ๊ตฌํ˜„ ๊ณผ์ •์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ๊นŒ์ง€๋ฅผ ํฌํ•จํ•˜์—ฌ BPE๋ฅผ ์‹ฌ์ธต์ ์œผ๋กœ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์ธ ๊ฐœ์š”๋งŒ์„ ์›ํ•˜๋Š” ๊ฒฝ์šฐ ์ด ์žฅ์„ ๊ฑด๋„ˆ๋›ฐ์–ด๋„ ๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜

BPE ํ•™์Šต์€ ์ •๊ทœํ™” ๋ฐ ์‚ฌ์ „ ํ† ํฐํ™” ๋‹จ๊ณ„๊ฐ€ ์™„๋ฃŒ๋œ ํ›„, ๋ง๋ญ‰์น˜์— ์‚ฌ์šฉ๋œ ๊ณ ์œ ํ•œ ๋‹จ์–ด ์ง‘ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ด๋Ÿฌํ•œ ๋‹จ์–ด๋“ค์„ ๊ตฌ์„ฑํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ ๋ชจ๋“  ๊ธฐํ˜ธ(๊ธ€์ž)๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ vocabulary๋ฅผ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ์•„์ฃผ ๊ฐ„๋‹จํ•œ ์˜ˆ๋กœ์„œ ๋ง๋ญ‰์น˜๊ฐ€ ๋‹ค์Œ ๋‹ค์„ฏ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ด…์‹œ๋‹ค:

"hug", "pug", "pun", "bun", "hugs"

๊ธฐ๋ณธ vocabulary๋Š” ["b", "g", "h", "n", "p", "s", "u"]๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ๋Š” ๊ธฐ๋ณธ vocabulary์—๋Š” ์ตœ์†Œํ•œ ๋ชจ๋“  ASCII ๋ฌธ์ž์™€ ์ผ๋ถ€ ์œ ๋‹ˆ์ฝ”๋“œ ๋ฌธ์ž๊ฐ€ ํฌํ•จ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ† ํฐํ™”ํ•˜๋Š” ๋Œ€์ƒ์ด ํ•™์Šต ๋ง๋ญ‰์น˜์— ์—†๋Š” ๋ฌธ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ํ•ด๋‹น ๋ฌธ์ž๋Š” "์•Œ ์ˆ˜ ์—†๋Š” ํ† ํฐ(unknown token)"์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ๋งŽ์€ NLP ๋ชจ๋ธ์ด ์ด๋ชจํ‹ฐ์ฝ˜์ด ํฌํ•จ๋œ ์ฝ˜ํ…์ธ ๋ฅผ ๋ถ„์„ํ•˜๋Š”๋ฐ ์‹ฌ๊ฐํ•œ ์–ด๋ ค์›€์„ ๊ฒช๋Š” ์ด์œ ์ž…๋‹ˆ๋‹ค.

GPT-2 ๋ฐ RoBERTa ํ† ํฌ๋‚˜์ด์ €๋Š” ์ด ๋ฌธ์ œ๋ฅผ ๋งค์šฐ ์˜๋ฆฌํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์–ด๋ฅผ ์œ ๋‹ˆ์ฝ”๋“œ ๋ฌธ์ž๊ฐ€ ์•„๋‹Œ ๋ฐ”์ดํŠธ ๋‹จ์œ„๋กœ ๊ตฌ์„ฑ๋œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๊ธฐ๋ณธ vocabulary๋Š” ์ž‘์€ ํฌ๊ธฐ(256)๋ฅผ ๊ฐ–์ง€๋งŒ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๋ฌธ์ž๋“ค์ด ์—ฌ์ „ํžˆ ํฌํ•จ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์•Œ ์ˆ˜ ์—†๋Š” ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ํŠธ๋ฆญ(trick)์„ byte-level BPE ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ธฐ๋ณธ vocabulary๋ฅผ ๊ตฌํ•œ ํ›„, ๊ธฐ์กด vocabulary์˜ ๋‘ ์š”์†Œ๋ฅผ ์ƒˆ๋กœ์šด ๊ฒƒ์œผ๋กœ ๋ณ‘ํ•ฉํ•˜๋Š” ๊ทœ์น™์ธ merges ๋ฅผ ํ•™์Šตํ•จ์œผ๋กœ์จ ์›ํ•˜๋Š” vocabulary ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ์ƒˆ ํ† ํฐ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ฒ˜์Œ์—๋Š” ์ด๋Ÿฌํ•œ ๋ณ‘ํ•ฉ์œผ๋กœ ๋‘ ๊ฐœ์˜ ๋ฌธ์ž๊ฐ€ ์žˆ๋Š” ํ† ํฐ์ด ์ƒ์„ฑ๋˜๊ณ  ํ•™์Šต์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ๋” ๊ธด ํ•˜์œ„ ๋‹จ์–ด(subwords)๊ฐ€ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.

ํ† ํฌ๋‚˜์ด์ € ํ•™์Šต ๊ณผ์ •์—์„œ ์–ด๋–ค ๋‹จ๊ณ„์—์„œ๋“  BPE ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ถœํ˜„ํ•˜๋Š” ํ† ํฐ ์Œ์„ ๊ฒ€์ƒ‰ํ•ฉ๋‹ˆ๋‹ค(์—ฌ๊ธฐ์„œ "์Œ"์€ ํ•œ ๋‹จ์–ด์—์„œ ๋‘ ๊ฐœ์˜ ์—ฐ์† ํ† ํฐ์„ ์˜๋ฏธํ•˜๊ณ  ํ† ํฐ์€ ์ฒ˜์Œ์—๋Š” ๋‹จ์ผ ๋ฌธ์ž์ž…๋‹ˆ๋‹ค). ๊ฒ€์ƒ‰๋œ ๊ณ ๋นˆ๋„ ํ† ํฐ ์Œ์ด ๋ณ‘ํ•ฉ๋˜๋ฉฐ ์ด๋Ÿฌํ•œ ๊ณผ์ •์ด ๊ณ„์† ๋ฐ˜๋ณต๋ฉ๋‹ˆ๋‹ค.

์ด์ „ ์˜ˆ์ œ๋กœ ๋Œ์•„๊ฐ€์„œ ๊ฐ ๋‹จ์–ด๋“ค์˜ ์ถœํ˜„๋นˆ๋„๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ด…์‹œ๋‹ค:

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

๋ง๋ญ‰์น˜ ๋‚ด์— "hug"๊ฐ€ 10๋ฒˆ, "pug"๊ฐ€ 5๋ฒˆ, "pun"์ด 12๋ฒˆ, "bun"์ด 4๋ฒˆ, "hugs"๊ฐ€ 5๋ฒˆ ์ถœํ˜„ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ๊ฐ ๋‹จ์–ด๋ฅผ ํ† ํฐ์˜ ๋ชฉ๋ก์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋„๋ก ๊ฐ ๋‹จ์–ด๋ฅผ ๋ฌธ์ž(์ดˆ๊ธฐ vocabulary๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ฌธ์ž)๋กœ ๋ถ„ํ• ํ•˜์—ฌ ํ•™์Šต์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:

("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)

๊ทธ๋Ÿฐ ๋‹ค์Œ ๊ฐ ๋ฌธ์ž ์Œ๋“ค์„ ์‚ดํŽด๋ด…์‹œ๋‹ค. ("h", "u")์€ "hug" ๋ฐ "hugs"๋ผ๋Š” ๋‹จ์–ด์— ์กด์žฌํ•˜๋ฏ€๋กœ ๋ง๋ญ‰์น˜์—์„œ ์ด 15๋ฒˆ ์ถœํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์Œ์€ ์•„๋‹™๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ถœํ˜„ํ•˜๋Š” ์Œ์€ "hug", "pug" ๋ฐ "hugs"์— ์žˆ๋Š” ("u", "g")์ด๋ฉฐ ์ด 20๋ฒˆ ์ถœํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ํ•™์Šตํ•œ ์ฒซ ๋ฒˆ์งธ ๋ณ‘ํ•ฉ ๊ทœ์น™์€ ("u", "g") -> "ug"์ด๋ฉฐ, ์ด๋Š” "ug"๊ฐ€ vocabulary์— ์ถ”๊ฐ€๋˜๊ณ  ์ฝ”ํผ์Šค ๋‚ด์˜ ๋ชจ๋“  ๋‹จ์–ด์—์„œ "u"์™€ "g"๊ฐ€ ๋ณ‘ํ•ฉ๋˜์–ด์•ผ ํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„๊ฐ€ ๋๋‚˜๋ฉด vocabulary์™€ ๋ง๋ญ‰์น˜๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€๊ฒฝ๋ฉ๋‹ˆ๋‹ค:

Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug"]
Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)

์ด์ œ 2๊ฐœ์˜ ๋ฌธ์ž๋ณด๋‹ค ๋” ๊ธด ํ† ํฐ์ด ์ƒ์„ฑ๋˜๋Š” ๋ช‡ ๊ฐ€์ง€ ์Œ์ด ์ฝ”ํผ์Šค ๋‚ด์— ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ("h", "ug")(๋ง๋ญ‰์น˜์— 15๋ฒˆ ์ถœํ˜„)๊ฐ€ ๊ทธ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„์—์„œ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ถœํ˜„๋œ ์Œ์€ ("u", "n")๋กœ์„œ ๋ง๋ญ‰์น˜์— 16๋ฒˆ ๋‚˜ํƒ€๋‚˜๋ฏ€๋กœ ํ•™์Šต๋œ ๋‘ ๋ฒˆ์งธ ๋ณ‘ํ•ฉ ๊ทœ์น™์€ ("u", "n") -> "un"์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ vocabulary์— ์ถ”๊ฐ€ํ•˜๊ณ  ๊ธฐ์กด์˜ ๋ชจ๋“  ํ•ญ๋ชฉ์„ ๋ณ‘ํ•ฉํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฉ๋‹ˆ๋‹ค:

Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un"]
Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)

์ด์ œ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์Œ์€ ("h", "ug")์ด๋ฏ€๋กœ ๋ณ‘ํ•ฉ ๊ทœ์น™("h", "ug") -> "hug"์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ฒ˜์Œ์œผ๋กœ 3๊ธ€์ž๋กœ ๊ตฌ์„ฑ๋œ ํ† ํฐ์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค. ๋ณ‘ํ•ฉ ํ›„ ์ฝ”ํผ์Šค๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]
Corpus: ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)

์›ํ•˜๋Š” vocabulary ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ์ด ์ž‘์—…์„ ๊ณ„์†ํ•ฉ๋‹ˆ๋‹ค.

โœ๏ธ Now your turn! ๋‹ค์Œ ๋ณ‘ํ•ฉ ๊ทœ์น™์€ ๋ฌด์—‡์ผ๊นŒ์š”?

ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜

ํ† ํฐํ™”๋Š” ๋‹ค์Œ ๋‹จ๊ณ„๋ฅผ ์ ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ž…๋ ฅ์„ ํ† ํฐํ™”ํ•œ๋‹ค๋Š” ์ ์—์„œ ์•ž์—์„œ ์‚ดํŽด๋ณธ ํ•™์Šต ํ”„๋กœ์„ธ์Šค์™€ ๋ฐ€์ ‘ํ•˜๊ฒŒ ์—ฐ๊ด€๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ์ •๊ทœํ™” (Normalization)

  2. ์‚ฌ์ „ ํ† ํฐํ™” (Pre-tokenization)

  3. ๋‹จ์–ด๋ฅผ ๊ฐœ๋ณ„ ๋ฌธ์ž๋“ค๋กœ ๋ถ„ํ• 

  4. ํ•ด๋‹น ๋ถ„ํ• ์— ์ˆœ์„œ๋Œ€๋กœ ํ•™์Šต๋œ ๋ณ‘ํ•ฉ ๊ทœ์น™ ์ ์šฉ

์œ„์—์„œ ํ•™์Šต๋œ 3๊ฐ€์ง€ ๋ณ‘ํ•ฉ ๊ทœ์น™์„ ์ ์šฉํ•˜์—ฌ ์˜ˆ๋ฅผ ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

("u", "g") -> "ug"
("u", "n") -> "un"
("h", "ug") -> "hug"

"bug"๋ผ๋Š” ๋‹จ์–ด๋Š” ["b", "ug"]๋กœ ํ† ํฐํ™”๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ "mug"๋Š” ๊ธฐ๋ณธ vocabulary์— ๋ฌธ์ž "m"์ด ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ["[UNK]", "ug"]๋กœ ํ† ํฐํ™”๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ "thug"๋ผ๋Š” ๋‹จ์–ด๋Š” ["[UNK]", "hug"]๋กœ ํ† ํฐํ™”๋ฉ๋‹ˆ๋‹ค. ๋ฌธ์ž "t"๋Š” ๊ธฐ๋ณธ vocabulary์— ์—†์œผ๋ฉฐ ๋ณ‘ํ•ฉ ๊ทœ์น™์„ ์ ์šฉํ•˜๋ฉด ๋จผ์ € "u"์™€ "g"๊ฐ€ ๋ณ‘ํ•ฉ๋œ ๋‹ค์Œ "hu"์™€ "g"๊ฐ€ ๋ณ‘ํ•ฉ๋ฉ๋‹ˆ๋‹ค.

โœ๏ธ Now your turn! "unhug"๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์–ด๋–ป๊ฒŒ ํ† ํฐํ™”๋ ๊นŒ์š”?

BPE ๊ตฌํ˜„

์ด์ œ BPE ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์—์„œ ์„ค๋ช…ํ•˜๋Š” ์ฝ”๋“œ๋Š” ๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜์—์„œ ์‹ค์ œ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์ ํ™”๋œ ๋ฒ„์ „์ด ์•„๋‹™๋‹ˆ๋‹ค. ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‰ฝ๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑ๋œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

๋จผ์ € ๋ง๋ญ‰์น˜๊ฐ€ ํ•„์š”ํ•˜๋ฏ€๋กœ ๋ช‡ ๋ฌธ์žฅ์œผ๋กœ ๊ฐ„๋‹จํ•œ ๋ง๋ญ‰์น˜๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

corpus = [
    "This is the Hugging Face course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

๋‹ค์Œ์œผ๋กœ, ์œ„ ๋ง๋ญ‰์น˜๋ฅผ ๋‹จ์–ด ๋‹จ์œ„๋กœ ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenize)ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. GPT-2์—์„œ ์‚ฌ์šฉ๋œ BPE ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenization)์— gpt2 ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

๊ทธ๋Ÿฐ ๋‹ค์Œ ์‚ฌ์ „ ํ† ํฐํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ ๋ง๋ญ‰์น˜์— ์žˆ๋Š” ๊ฐ ๋‹จ์–ด์˜ ๋นˆ๋„๋ฅผ ํ•จ๊ป˜ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:

from collections import defaultdict

word_freqs = defaultdict(int)

for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

print(word_freqs)

๋‹ค์Œ ๋‹จ๊ณ„๋Š” ๋ง๋ญ‰์น˜์— ์‚ฌ์šฉ๋œ ๋ชจ๋“  ๋ฌธ์ž๋กœ ๊ตฌ์„ฑ๋œ ๊ธฐ๋ณธ vocabulary๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

alphabet = []

for word in word_freqs.keys():
    for letter in word:
        if letter not in alphabet:
            alphabet.append(letter)
alphabet.sort()

print(alphabet)

์ถ”๊ฐ€์ ์œผ๋กœ ํ•ด๋‹น vocabulary์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” ํŠน์ˆ˜ ํ† ํฐ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. GPT-2์˜ ๊ฒฝ์šฐ ์œ ์ผํ•œ ํŠน์ˆ˜ ํ† ํฐ์€ "<|endoftext|>"์ž…๋‹ˆ๋‹ค:

vocab = ["<|endoftext|>"] + alphabet.copy()

์ด์ œ ํ•™์Šต์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐ ๋‹จ์–ด๋ฅผ ๊ฐœ๋ณ„ ๋ฌธ์ž๋กœ ๋ถ„ํ• ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

splits = {word: [c for c in word] for word in word_freqs.keys()}

์ด์ œ ํ•™์Šตํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์œผ๋ฏ€๋กœ ๊ฐ ์Œ์˜ ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ํ•™์Šต์˜ ๊ฐ ๋‹จ๊ณ„์—์„œ ์ด๊ฒƒ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

def compute_pair_freqs(splits):
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i+1])
            pair_freqs[pair] += freq
    return pair_freqs

์ดˆ๊ธฐ ๋ถ„ํ•  ํ›„ ์ด ๋”•์…”๋„ˆ๋ฆฌ(pair-freqs)์˜ ์ผ๋ถ€๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

pair_freqs = compute_pair_freqs(splits)

for i, key in enumerate(pair_freqs.keys()):
    print(f"{key}: {pair_freqs[key]}")
    if i > 5:
        break

์ด์ œ ๊ฐ„๋‹จํ•œ ๋ฃจํ”„๋ฅผ ์ด์šฉํ•ด์„œ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ถœํ˜„ํ•˜๋Š” ์Œ์„ ์ฐพ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

best_pair = ""
max_freq = None

for pair, freq in pair_freqs.items():
    if max_freq is None or max_freq < freq:
        best_pair = pair
        max_freq = freq

print(best_pair, max_freq)

๋”ฐ๋ผ์„œ ํ•™์Šตํ•  ์ฒซ ๋ฒˆ์งธ ๋ณ‘ํ•ฉ์€ ('ฤ ', 't') -> 'ฤ t'์ด๊ณ  vocabulary์— 'ฤ t'๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

merges = {("ฤ ", "t"): "ฤ t"}
vocab.append("ฤ t")

๊ณ„์†ํ•˜๋ ค๋ฉด splits ๋”•์…”๋„ˆ๋ฆฌ์— ํ•ด๋‹น ๋ณ‘ํ•ฉ์„ ์ ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค๋ฅธ ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
            
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                split = split[:i] + [a + b] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

์ด์ œ ์ฒซ๋ฒˆ์งธ ๋ณ‘ํ•ฉ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

splits = merge_pair("ฤ ", "t", splits)
print(splits["ฤ trained"])

์ด์ œ ์›ํ•˜๋Š” ๋ชจ๋“  ๋ณ‘ํ•ฉ์„ ํ•™์Šตํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•˜๋Š” ๋ชจ๋“ˆ์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Vocabulary์˜ ํฌ๊ธฐ๋ฅผ 50์œผ๋กœ ์ง€์ •ํ•ด๋ด…์‹œ๋‹ค:

vocab_size = 50

while len(vocab) < vocab_size:
    pair_freqs = compute_pair_freqs(splits)
    best_pair = ""
    max_freq = None
    for pair, freq in pair_freqs.items():
        if max_freq is None or max_freq < freq:
            best_pair = pair
            max_freq = freq
    splits = merge_pair(*best_pair, splits)
    merges[best_pair] = best_pair[0] + best_pair[1]
    vocab.append(best_pair[0] + best_pair[1])

๊ฒฐ๊ณผ์ ์œผ๋กœ 19๊ฐ€์ง€ ๋ณ‘ํ•ฉ ๊ทœ์น™์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค(์ดˆ๊ธฐ vocabulary์˜ ํฌ๊ธฐ๋Š” ์•ŒํŒŒ๋ฒณ 31 - 30์ž, ํŠน์ˆ˜ ํ† ํฐ ํฌํ•จ):

print(merges)

๊ทธ๋ฆฌ๊ณ  vocabulary๋Š” ํŠน์ˆ˜ ํ† ํฐ, ์ดˆ๊ธฐ ์•ŒํŒŒ๋ฒณ ๋ฐ ๋ณ‘ํ•ฉ์˜ ๋ชจ๋“  ๊ฒฐ๊ณผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

print(vocab)

๐Ÿ’ก ๋™์ผํ•œ ๋ง๋ญ‰์น˜์—์„œ train_new_from_iterator()๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋˜‘๊ฐ™์€ vocabulary๊ฐ€ ๋„์ถœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ถœํ˜„ํ•œ ์Œ์„ ์„ ํƒํ• ๋•Œ ๊ฐ€์žฅ ๋จผ์ € ๋งˆ์ฃผ์น˜๋Š” ์Œ์„ ์„ ํƒํ•˜๋Š” ๋ฐ˜๋ฉด์—, ๐Ÿค—Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๋‚ด๋ถ€ IDs๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฒซ ๋ฒˆ์งธ ์Œ์„ ์„ ํƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ƒˆ๋กœ์šด ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์šฐ์„  ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenize)ํ•˜๊ณ  ๋ถ„ํ• (split)ํ•œ ๋‹ค์Œ ํ•™์Šตํ•œ ๋ชจ๋“  ๋ณ‘ํ•ฉ ๊ทœ์น™(merge rules)์„ ์ ์šฉํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    splits = [[l for l in word] for word in pre_tokenized_text]
    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2 :]
                else:
                    i += 1
            splits[idx] = split
    
    return sum(splits, [])

์•ŒํŒŒ๋ฒณ ๋ฌธ์ž๋กœ ๊ตฌ์„ฑ๋œ ๋ชจ๋“  ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenize("This is not a token.")

โš ๏ธ ์˜ˆ์™ธ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ์•Œ ์ˆ˜ ์—†๋Š” ๋ฌธ์ž(unknown character)๊ฐ€ ์žˆ์œผ๋ฉด ๊ตฌํ˜„์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. GPT-2์—๋Š” ์‹ค์ œ๋กœ ์•Œ ์ˆ˜ ์—†๋Š” ํ† ํฐ์ด ์—†์ง€๋งŒ(๋ฐ”์ดํŠธ ์ˆ˜์ค€ BPE๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์•Œ ์ˆ˜ ์—†๋Š” ๋ฌธ์ž๋ฅผ ์–ป๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค), ์—ฌ๊ธฐ์„œ๋Š” ์ดˆ๊ธฐ vocabulary์— ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋ฐ”์ดํŠธ๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์€ ์ด ์„น์…˜์˜ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋ฏ€๋กœ ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ƒ๋žตํ–ˆ์Šต๋‹ˆ๋‹ค.

BPE ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ๋‚ด์šฉ์ด ๋๋‚ฌ์Šต๋‹ˆ๋‹ค! ๋‹ค์Œ์œผ๋กœ WordPiece๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ