[๐Ÿค— ๊ฐ•์ขŒ 6.8] Unigram ํ† ํฐํ™”

63267 ๋‹จ์–ด unigramunigram

Unigram ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ AlBERT, T5, mBART, Big Bird ๋ฐ XLNet๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ SentencePiece์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์ด ์„น์…˜์—์„œ๋Š” ์ „์ฒด ๊ตฌํ˜„์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ํฌํ•จํ•˜์—ฌ Unigram์„ ๊นŠ์ด ์žˆ๊ฒŒ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์ธ ๊ฐœ์š”๋ฅผ ์›ํ•˜๋Š” ๊ฒฝ์šฐ ์ƒ๋žตํ•ด๋„ ์ข‹์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜

BPE ๋ฐ WordPiece์™€ ๋น„๊ตํ•˜์—ฌ Unigram์€ ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํฌ๊ธฐ๊ฐ€ ํฐ vocabulary์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ์›ํ•˜๋Š” vocabulary ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ํ† ํฐ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ vocabulary๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ช‡ ๊ฐ€์ง€ ์˜ต์…˜์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์‚ฌ์ „ ํ† ํฐํ™”๋œ ๋‹จ์–ด์—์„œ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋ถ€๋ถ„ ๋ฌธ์ž์—ด์„ ์ทจํ•˜๊ฑฐ๋‚˜ ํฐ ๊ทœ๋ชจ์˜ vocabulary๋ฅผ ๊ฐ€์ง„ ์ดˆ๊ธฐ ๋ง๋ญ‰์น˜์— BPE๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต์˜ ๊ฐ ๋‹จ๊ณ„์—์„œ Unigram ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ˜„์žฌ vocabulary๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ์˜ ๋ง๋ญ‰์น˜์— ๋Œ€ํ•œ ์†์‹ค(loss)์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ, vocabulary์˜ ๊ฐ ๊ธฐํ˜ธ(symbol)์— ๋Œ€ํ•ด, ํ•ด๋‹น ๊ธฐํ˜ธ๊ฐ€ ์ œ๊ฑฐ๋˜๋ฉด ์ „์ฒด ์†์‹ค์ด ์–ผ๋งˆ๋‚˜ ์ฆ๊ฐ€ํ• ์ง€ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฐ€์žฅ ์ ๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๊ธฐํ˜ธ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ฐพ์€ ๊ธฐํ˜ธ๋“ค(symbols)์€ ๋ง๋ญ‰์น˜์— ๋Œ€ํ•œ ์ „์ฒด ์†์‹ค์— ๋” ์ ์€ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฏ€๋กœ ์–ด๋–ค ์˜๋ฏธ์—์„œ๋Š” "๋œ ํ•„์š”(less needed)"ํ•˜๊ณ  ์ œ๊ฑฐ ๋Œ€์ƒ์œผ๋กœ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ํ›„๋ณด์ž…๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ๋น„์šฉ์ด ๋งŽ์ด ๋“œ๋Š” ์ž‘์—…์ด๋ฏ€๋กœ ๊ฐ€์žฅ ๋‚ฎ์€ ์†์‹ค์„ ์ดˆ๋ž˜ํ•˜๋Š” ๊ธฐํ˜ธ ํ•˜๋‚˜๋งŒ ์ œ๊ฑฐํ•˜์ง€ ์•Š๊ณ , ์ด๋Ÿฌํ•œ ๊ธฐํ˜ธ๋“ค์˜ pp%(pp๋Š” ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ด๊ณ  ์ผ๋ฐ˜์ ์œผ๋กœ 10 ๋˜๋Š” 20์„ ์ง€์ •)๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์€ vocabulary๊ฐ€ ์›ํ•˜๋Š” ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต๋ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  ๋‹จ์–ด๋ฅผ ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ธฐ๋ณธ ๋ฌธ์ž๋“ค์„ ์ œ๊ฑฐํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ๋„ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์•„์ง๋„ ์„ค๋ช…์ด ์•ฝ๊ฐ„ ๋ชจํ˜ธํ•œ ๋ถ€๋ถ„์ด ์žˆ์ง€์š”. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ์€ ๋ง๋ญ‰์น˜์— ๋Œ€ํ•œ ์†์‹ค์„ ๊ณ„์‚ฐํ•˜๊ณ  vocabulary์—์„œ ์ผ๋ถ€ ํ† ํฐ์„ ์ œ๊ฑฐํ•  ๋•Œ ์†์‹ค์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด์ง€๋งŒ ์•„์ง ์–ด๋–ป๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ์„ค๋ช…ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์€ Unigram ๋ชจ๋ธ์˜ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ธฐ๋ฐ˜ํ•˜๋ฏ€๋กœ ๋‹ค์Œ์— ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์—ญ์‹œ ์ด์ „ ์˜ˆ์ œ์˜ ๋ง๋ญ‰์น˜๋ฅผ ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

์ดˆ๊ธฐ vocabulary๋Š” ์œ„ ๋ง๋ญ‰์น˜์— ์กด์žฌํ•˜๋Š” ๋ชจ๋“  ๋‹จ์–ด๋“ค์˜ ๋ชจ๋“  ํ•˜์œ„ ๋ฌธ์ž์—ด(substrings)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

["h", "u", "g", "hu", "ug", "p", "pu", "n", "un", "b", "bu", "s", "hug", "gs", "ugs"]

ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜

Unigram ๋ชจ๋ธ์€ ๊ฐœ๋ณ„ ํ† ํฐ๋“ค์˜ ์ถœํ˜„ ๋ถ„ํฌ๊ฐ€ ์„œ๋กœ ๋…๋ฆฝ์ (i.i.d)์ด๋ผ๋Š” ๊ฐ€์ •์„ ํ•˜๋Š” ์–ธ์–ด ๋ชจ๋ธ ์œ ํ˜•์ž…๋‹ˆ๋‹ค. ํ† ํฐ X์˜ ํ™•๋ฅ ์ด ๋ฌธ๋งฅ์— ์ƒ๊ด€์—†์ด ๋™์ผํ•˜๋‹ค๋Š” ์ ์—์„œ ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Unigram ์–ธ์–ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ ํ•ญ์ƒ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ด๊ณ  ํ”ํ•œ(common) ํ† ํฐ์„ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.

ํŠน์ • ํ† ํฐ์˜ ํ™•๋ฅ ์€ ๋ง๋ญ‰์น˜ ๋‚ด์—์„œ์˜ ํ•ด๋‹น ํ† ํฐ ์ถœํ˜„ ๋นˆ๋„๋ฅผ vocabulary์— ์กด์žฌํ•˜๋Š” ๋ชจ๋“  ํ† ํฐ๋“ค์˜ ์ถœํ˜„ ๋นˆ๋„์˜ ํ•ฉ์œผ๋กœ ๋‚˜๋ˆˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค(ํ™•๋ฅ ์˜ ํ•ฉ์ด 1์ด ๋˜๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด). ์˜ˆ๋ฅผ ๋“ค์–ด "ug"๋Š” "hug", "pug" ๋ฐ "hugs"์— ์žˆ์œผ๋ฏ€๋กœ ๋ง๋ญ‰์น˜์—์„œ์˜ ๋นˆ๋„๋Š” 20์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ์€ vocabulary์—์„œ์˜ ๋ชจ๋“  ํ•˜์œ„ ๋‹จ์–ด(subwords)์˜ ๋นˆ๋„์ž…๋‹ˆ๋‹ค:

("h", 15) ("u", 36) ("g", 20) ("hu", 15) ("ug", 20) ("p", 17) ("pu", 17) ("n", 16)
("un", 16) ("b", 4) ("bu", 4) ("s", 5) ("hug", 15) ("gs", 5) ("ugs", 5)

๋”ฐ๋ผ์„œ ๋ชจ๋“  ๋นˆ๋„์˜ ํ•ฉ์€ 210์ด๊ณ  ํ•˜์œ„ ๋‹จ์–ด(subword) "ug"์˜ ํ™•๋ฅ ์€ 20/210์ž…๋‹ˆ๋‹ค.

โœ๏ธ Now your turn! ์œ„์˜ ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ณ  ํ‘œ์‹œ๋œ ๊ฒฐ๊ณผ์™€ ์ดํ•ฉ์ด ์˜ฌ๋ฐ”๋ฅธ์ง€ ๋‹ค์‹œ ํ™•์ธํ•ด๋ณด์„ธ์š”.

์ด์ œ ์ฃผ์–ด์ง„ ๋‹จ์–ด๋ฅผ ํ† ํฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ํ† ํฐ ๋ถ„ํ• ์„ ์‚ดํŽด๋ณด๊ณ  Unigram ๋ชจ๋ธ์— ๋”ฐ๋ผ ๊ฐ๊ฐ์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ํ† ํฐ์˜ ์ถœํ˜„ ๋นˆ๋„๊ฐ€ ๋…๋ฆฝ์ ์ธ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ด ํ™•๋ฅ ์€ ๊ฐ ํ† ํฐ์˜ ํ™•๋ฅ ์˜ ๊ณฑ์ผ ๋ฟ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, "pug"์˜ ํ† ํฐํ™” ๊ฒฐ๊ณผ์ธ ["p", "u", "g"]์˜ ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:

P(["p","u","g"])=P("p")ร—P("u")ร—P("g")=5210ร—36210ร—2210=0.000389P(["p", "u", "g"]) = P("p") \times P("u") \times P("g") = \frac{5}{210} \times \frac{36}{210} \times \frac{2}{210} = 0.000389

์ด์— ๋น„ํ•ด ํ† ํฐํ™” ๊ฒฐ๊ณผ์ธ ["pu", "g"]์˜ ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

P(["pu","g"])=P("pu")ร—P("g")=5210ร—20210=0.0022676P(["pu", "g"]) = P("pu") \times P("g") = \frac{5}{210} \times \frac{20}{210} = 0.0022676

๋”ฐ๋ผ์„œ, ["pu", "g"]์ด ํ›จ์”ฌ ๋” ์ž์ฃผ ์ถœํ˜„ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ฒ ์ง€์š”. ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ€์žฅ ์ ์€ ์ˆ˜์˜ ํ•˜์œ„ ํ† ํฐ๋“ค๋กœ ๊ตฌ์„ฑ๋œ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋Š” ๋น„๊ต์  ๋†’์€ ํ™•๋ฅ (๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ๋ฐ˜๋ณต๋˜๋Š” 210์œผ๋กœ ๋‚˜๋ˆ„๊ธฐ ๋•Œ๋ฌธ์—)์„ ๊ฐ€์ง€๋ฉฐ, ์ด๋Š” ์šฐ๋ฆฌ๊ฐ€ ์ง๊ด€์ ์œผ๋กœ ์›ํ•˜๋Š” ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

Unigram ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ ๋‹จ์–ด์˜ ํ† ํฐํ™”๋Š” ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ถ„ํ•  ํ˜•ํƒœ๋กœ ํ† ํฐํ™”๋ฉ๋‹ˆ๋‹ค. "pug"์˜ ์˜ˆ์—์„œ ๊ฐ€๋Šฅํ•œ ๊ฐ ๋ถ„ํ• ์— ๋Œ€ํ•ด ์–ป์„ ์ˆ˜ ์žˆ๋Š” ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

["p", "u", "g"] : 0.000389
["p", "ug"] : 0.0022676
["pu", "g"] : 0.0022676

๋”ฐ๋ผ์„œ "pug"๋Š” ์œ„ ๋ถ„ํ•  ๋ฐฉ๋ฒ• ์ค‘์—์„œ ["p", "ug"] ๋˜๋Š” ["pu", "g"]๋กœ ํ† ํฐํ™”๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํฐ ๊ทœ๋ชจ์˜ ๋ง๋ญ‰์น˜์—์„œ๋Š” ๋ถ„ํ•  ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ํ™•๋ฅ ๊ฐ’์ด ๊ฐ™์€ ๊ฒฝ์šฐ๊ฐ€ ๋งค์šฐ ๋“œ๋ญ…๋‹ˆ๋‹ค.

์œ„์˜ ๊ฒฝ์šฐ์—์„œ๋Š” ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋ถ„ํ• ์„ ์ฐพ๊ณ  ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์‰ฌ์› ์ง€๋งŒ, ์ผ๋ฐ˜์ ์œผ๋กœ๋Š” ๋” ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๊ณ ์ „์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ Viterbi ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ์งˆ์ ์œผ๋กœ, ์ฃผ์–ด์ง„ ๋‹จ์–ด์— ๋Œ€ํ•œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋ถ„ํ• ๋“ค์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋งŒ์ผ ์ฃผ์–ด์ง„ ๋‹จ์–ด ๋‚ด์˜ ๋ฌธ์ž a์—์„œ b๊นŒ์ง€์˜ ํ•˜์œ„ ๋‹จ์–ด(subword)๊ฐ€ vocabulary์— ์กด์žฌํ•œ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” ์ด ๊ทธ๋ž˜ํ”„ ๋‚ด์—์„œ a์—์„œ ์ถœ๋ฐœํ•˜์—ฌ b๊นŒ์ง€ ๊ฐ€๋Š” ๊ทธ๋ž˜ํ”„ ๋‚ด์˜ ๊ฐ€์ง€(branch)๊ฐ€ ์žˆ๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ํ•˜์œ„ ๋‹จ์–ด์˜ ํ™•๋ฅ ์„ ํ•ด๋‹น ๊ฐ€์ง€(branch)์— ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜ํ”„์—์„œ ์ตœ๊ณ  ์ ์ˆ˜๋ฅผ ์–ป์„ ๊ฒฝ๋กœ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด Viterbi ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹จ์–ด ๋‚ด์˜ ๊ฐ ์œ„์น˜(๋ฌธ์ž)์— ๋Œ€ํ•ด ํ•ด๋‹น ์œ„์น˜์—์„œ ๋๋‚˜๋Š” ๊ฒฝ๋กœ์˜ ์ตœ๊ณ  ์ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ถ„ํ• (segmentation)์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์–ด์˜ ์ฒ˜์Œ ์œ„์น˜๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์ด๋™ํ•˜๋ฉด์„œ, ํ˜„์žฌ ์œ„์น˜์—์„œ ๋๋‚˜๋Š” ๋ชจ๋“  ํ•˜์œ„ ๋‹จ์–ด๋ฅผ ๊ฒ€์‚ฌํ•œ ๋‹ค์Œ ์ด ํ•˜์œ„ ๋‹จ์–ด๊ฐ€ ์‹œ์ž‘ํ•˜๋Š” ์œ„์น˜์—์„œ ์ตœ๊ณ ์˜ ํ† ํฐํ™” ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ƒ์˜ ์ ์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•ด ์„ ํƒํ•œ ๊ฒฝ๋กœ๋ฅผ ํŽผ์น˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

์•ž์—์„œ ๊ตฌ์„ฑํ•œ vocabulary์™€ "unhug"๋ผ๋Š” ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•œ ์˜ˆ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋‹จ์–ด์˜ ๊ฐ ์œ„์น˜์— ๋Œ€ํ•ด ์ตœ๊ณ  ์ ์ˆ˜๋กœ ๋๋‚˜๋Š” ํ•˜์œ„ ๋‹จ์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Character 0 (u): "u" (score 0.171429)
Character 1 (n): "un" (score 0.076191)
Character 2 (h): "un" "h" (score 0.005442)
Character 3 (u): "un" "hu" (score 0.005442)
Character 4 (g): "un" "hug" (score 0.005442)

์œ„์—์„œ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋งˆ์ง€๋ง‰ ๊ธ€์ž('g')๊นŒ์ง€ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ, ["un", "hug"]๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์ ์ˆ˜์ธ 0.005442๋ฅผ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ "unhug"๋Š” ["un", "hug"]๋กœ ํ† ํฐํ™”๋ฉ๋‹ˆ๋‹ค.

โœ๏ธ Now your turn! "huggun"์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ ํ† ํฐํ™”ํ•˜๊ณ  ํ•ด๋‹น ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ด๋ณด์„ธ์š”.

๋‹ค์‹œ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ด์ œ ํ† ํฐํ™”๊ฐ€ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ๋ณด์•˜์œผ๋ฏ€๋กœ ํ•™์Šต ๊ณผ์ •์—์„œ ์‚ฌ์šฉ๋œ ์†์‹ค(loss)์— ๋Œ€ํ•ด ์กฐ๊ธˆ ๋” ์ž์„ธํžˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ ์ฃผ์–ด์ง„ ๋‹จ๊ณ„์—์„œ ์ด ์†์‹ค์€ ๋ง๋ญ‰์น˜ ๋‚ด์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ํ† ํฐํ™”ํ•˜์—ฌ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. ๊ณ„์‚ฐ ๊ณผ์ •์—์„œ ์•ž์—์„œ ์„ค๋ช…ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ํ˜„์žฌ vocabulary์™€ ๋ง๋ญ‰์น˜์— ์žˆ๋Š” ๊ฐ ํ† ํฐ์˜ ๋นˆ๋„์— ์˜ํ•ด ๊ฒฐ์ •๋œ ์œ ๋‹ˆ๊ทธ๋žจ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ง๋ญ‰์น˜์˜ ๊ฐ ๋‹จ์–ด๋ณ„๋กœ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉฐ ์†์‹ค์€ ํ•ด๋‹น ์ ์ˆ˜์˜ ์Œ์˜ ๋กœ๊ทธ ์šฐ๋„(negative log likelihood)์ž…๋‹ˆ๋‹ค. ์ฆ‰, ๋ง๋ญ‰์น˜์— ์žˆ๋Š” ๋ชจ๋“  ๋‹จ์–ด์˜ โˆ’log(P(word))-log(P(word)) ํ•ฉ๊ณ„์ž…๋‹ˆ๋‹ค.

์œ„์—์„œ ์„ค๋ช…ํ•œ ๋ง๋ญ‰์น˜๋ฅผ ๊ฐ€์ง€๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค:

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

๊ฐ ๋‹จ์–ด์˜ ํ† ํฐํ™” ๊ฒฐ๊ณผ ๋ฐ ์ ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

"hug": ["hug"] (score 0.071428)
"pug": ["pu", "g"] (score 0.007710)
"pun": ["pu", "n"] (score 0.006168)
"bun": ["bu", "n"] (score 0.001451)
"hugs": ["hug", "s"] (score 0.001701)

๋”ฐ๋ผ์„œ ์†์‹ค(loss)์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

10 * (-log(0.071428)) + 5 * (-log(0.007710)) + 12 * (-log(0.006168)) + 4 * (-log(0.001451)) + 5 * (-log(0.001701)) = 169.8

์ด์ œ ๊ฐ ํ† ํฐ์„ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์ด ์†์‹ค๊ฐ’์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ ์ˆ˜์ž‘์—…์œผ๋กœ ํ•˜๊ธฐ์—๋Š” ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๋ฏ€๋กœ ์—ฌ๊ธฐ์—์„œ๋Š” ๋‘ ๊ฐœ์˜ ํ† ํฐ("pu", "hug")์— ๋Œ€ํ•ด ์ด๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๋‚˜๋จธ์ง€ ํ”„๋กœ์„ธ์Šค๋Š” ์•„๋ž˜์—์„œ ํ•ด๋‹น ์ž‘์—…์— ๋Œ€ํ•œ ์‹ค์ œ ๊ตฌํ˜„์ด ์™„๋ฃŒ๋˜์—ˆ์„ ๋•Œ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์œ„์—์„œ ์‚ดํŽด๋ณด์•˜๋“ฏ์ด, ์ด ์‹œ์ ์—์„œ "pug"๋Š” ๋™์ผํ•œ ์ ์ˆ˜(0.0022676)๋ฅผ ๊ฐ€์ง„ ๋‘๊ฐœ์˜ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ๊ธฐ์–ตํ•˜์‹œ์ง€์š”? ๋ฐ”๋กœ ["p", "ug"]์™€ ["pu", "g"]๊ฐ€ ๊ทธ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ "pu" ํ† ํฐ์„ vocabulary์—์„œ ์ œ๊ฑฐํ•˜๋”๋ผ๋„ ํ† ํฐํ™” ๊ฒฐ๊ณผ๊ฐ€ ๋™์ผํ•œ ์ ์ˆ˜๋ฅผ ๊ฐ€์ง„ ["p", "ug"]๊ฐ€ ๋˜๋ฏ€๋กœ, ๊ฒฐ๋ก ์ ์œผ๋กœ๋Š” ์œ„์—์„œ ๊ณ„์‚ฐํ•œ ๊ฒƒ๊ณผ ๋˜‘๊ฐ™์€ ์†์‹ค(loss)๊ฐ’์ด ๋„์ถœ๋˜๊ฒ ์ง€์š”.

๋ฐ˜๋ฉด์—, "hug"๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด ์†์‹ค๊ฐ’์ด ๋” ์˜ฌ๋ผ๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋Š” "hug"์™€ "hugs"์˜ ํ† ํฐํ™” ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๊ฒ ์ง€์š”:

"hug": ["hu", "g"] (score 0.006802)   # ์œ„์˜ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ณด๋‹ค ์ ์ˆ˜๊ฐ€ ๋‚ฎ์•„์ง.
"hugs": ["hu", "gs"] (score 0.001701)

๊ทธ ๊ฒฐ๊ณผ ๋‹ค์Œ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ๊ฐ’ ๋งŒํผ ์†์‹ค๊ฐ’์ด ์˜ฌ๋ผ๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

- 10 * (-log(0.071428)) + 10 * (-log(0.006802)) = 23.5

๊ฒฐ๋ก ์ ์œผ๋กœ, ํ† ํฐ "pu"๋Š” vocabulary์—์„œ ์ œ๊ฑฐ๋˜๊ฒ ์ง€๋งŒ "hug"๋Š” ์ œ๊ฑฐ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

Unigram ๊ตฌํ˜„

์ด์ œ ์ง€๊ธˆ๊นŒ์ง€ ์‚ดํŽด๋ณธ ๋ชจ๋“  ๊ฒƒ์„ ์ฝ”๋“œ๋กœ ๊ตฌํ˜„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. BPE ๋ฐ WordPiece์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ด๊ฒƒ์€ Unigram ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํšจ์œจ์ ์ธ ๊ตฌํ˜„์€ ์•„๋‹ˆ์ง€๋งŒ ์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ดํ•ดํ•˜๋Š”๋ฐ๋Š” ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด์ „๊ณผ ๋™์ผํ•œ ๋ง๋ญ‰์น˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

corpus = [
    "This is the Hugging Face course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

์ด๋ฒˆ์—๋Š” xlnet-base-cased๋ฅผ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

BPE ๋ฐ WordPiece์˜ ๊ฒฝ์šฐ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ง๋ญ‰์น˜์—์„œ ๊ฐ ๋‹จ์–ด์˜ ์ถœํ˜„ ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:

from collections import defaultdict

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1

word_freqs

๊ทธ๋ฆฌ๊ณ  ์ตœ์ข…์ ์œผ๋กœ ์›ํ•˜๋Š” ํฌ๊ธฐ๋ณด๋‹ค ๋” ํฌ๊ฒŒ vocabulary๋ฅผ ์ดˆ๊ธฐํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐํ™” ๊ณผ์ •์—์„œ vocabulary์— ๋ชจ๋“  ๊ธฐ๋ณธ ๋ฌธ์ž๋“ค์„ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ํ† ํฐํ™”ํ•  ์ˆ˜ ์—†๊ฒ ์ง€์š”. ๋˜ํ•œ ๊ธธ์ด๊ฐ€ ๋” ๊ธด ๋ถ€๋ถ„ ๋ฌธ์ž์—ด์— ๋Œ€ํ•ด์„œ๋Š” ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ถœํ˜„ํ•˜๋Š” ๊ฒƒ๋“ค๋งŒ ์ถ”๊ฐ€ํ•  ๊ฒƒ์ด๋ฏ€๋กœ ์ผ๋‹จ ๋นˆ๋„์ˆœ์œผ๋กœ ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

char_freqs = defaultdict(int)
subwords_freqs = defaultdict(int)
for word, freq in word_freqs.items():
    for i in range(len(word)):
        char_freqs[word[i]] += freq
        # ๊ธธ์ด๊ฐ€ ์ ์–ด๋„ 2 ์ด์ƒ์ธ subword๋“ค์„ ์ถ”๊ฐ€ํ•จ.
        for j in range(i + 2, len(word) + 1):
            subwords_freqs[word[i:j]] += freq
            
# Subword๋“ค์„ ๋นˆ๋„ ์—ญ์ˆœ์œผ๋กœ ์ •๋ ฌ
sorted_subwords = sorted(subwords_freqs.items(), key=lambda x: x[1], reverse=True)
sorted_subwords[:10]

ํฌ๊ธฐ๊ฐ€ 300์ธ ์ดˆ๊ธฐ vocabulary๋ฅผ ๊ตฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด์„œ ์•ž์—์„œ ๋งŒ๋“ค์–ด์ง„ sorted_subwords ์ค‘์—์„œ ๋นˆ๋„๊ฐ€ ๋†’์€ ํ•˜์œ„ ๋‹จ์–ด๋“ค์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค:

token_freqs = list(char_freqs.items()) + sorted_subwords[: 300 - len(char_freqs)]
token_freqs = {token: freq for token, freq in token_freqs}

๐Ÿ’ก SentencePiece๋Š” ESA(Enhanced Suffix Array)๋ผ๋Š” ๋ณด๋‹ค ํšจ์œจ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ดˆ๊ธฐ ์–ดํœ˜๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ ๋ชจ๋“  ๋นˆ๋„์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ ๋นˆ๋„๋ฅผ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ํ™•๋ฅ ์˜ ๋กœ๊ทธ๊ฐ’์„ ์ €์žฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ž‘์€ ์ˆซ์ž๋ฅผ ๊ณฑํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋กœ๊ทธ๋ฅผ ๋”ํ•˜๋Š” ๊ฒƒ์ด ์ˆ˜์น˜์ ์œผ๋กœ ๋” ์•ˆ์ •์ ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ชจ๋ธ ์†์‹ค ๊ณ„์‚ฐ์ด ๋‹จ์ˆœํ™”๋ฉ๋‹ˆ๋‹ค:

from math import log

total_sum = sum([freq for token, freq in token_freqs.items()])
model = {token: -log(freq / total_sum) for token, freq in token_freqs.items()}

์ด์ œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์€ Viterbi ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค. ์ด์ „์— ๋ณด์•˜๋“ฏ์ด ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹จ์–ด์˜ ๊ฐ ๋ถ€๋ถ„ ๋ฌธ์ž์—ด์— ๋Œ€ํ•œ ์ตœ์ƒ์˜ ๋ถ„ํ• ์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ด๋ฅผ best_segmentations๋ผ๋Š” ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์–ด์˜ ๊ฐ ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ๋”•์…”๋„ˆ๋ฆฌ(0์—์„œ ์ „์ฒด ๊ธธ์ด๊นŒ์ง€)์„ ๋‘ ๊ฐœ์˜ ํ‚ค์™€ ํ•จ๊ป˜ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ€์žฅ ์ ์ˆ˜๊ฐ€ ๋†’์€ ๋ถ„ํ• (segmentation)์—์„œ ๋งˆ์ง€๋ง‰ ํ† ํฐ์˜ ์‹œ์ž‘ ์ธ๋ฑ์Šค์™€ ํ•ด๋‹น ์ ์ˆ˜์ž…๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ํ† ํฐ์˜ ์‹œ์ž‘ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชฉ๋ก์ด ์™„์ „ํžˆ ์ฑ„์›Œ์ง€๋ฉด ์ „์ฒด ๋ถ„ํ• ์„ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชฉ๋ก ์ฑ„์šฐ๊ธฐ๋Š” ๋‹จ 2๊ฐœ์˜ ๋ฃจํ”„๋กœ ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋ฃจํ”„๋Š” ๊ฐ ์‹œ์ž‘ ์œ„์น˜๋กœ ์ด๋™ํ•˜๊ณ  ๋‘ ๋ฒˆ์งธ ๋ฃจํ”„๋Š” ํ•ด๋‹น ์‹œ์ž‘ ์œ„์น˜์—์„œ ์‹œ์ž‘ํ•˜๋Š” ๋ชจ๋“  ํ•˜์œ„ ๋ฌธ์ž์—ด์„ ๊ฒ€ํ† ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์œ„ ๋ฌธ์ž์—ด์ด vocabulary์— ์žˆ๋Š” ๊ฒฝ์šฐ ํ•ด๋‹น ๋ ์œ„์น˜๊นŒ์ง€ ๋‹จ์–ด์˜ ์ƒˆ๋กœ์šด ๋ถ„ํ• ์ด ์žˆ์œผ๋ฉฐ ์ด๋ฅผ best_segmentations์— ์žˆ๋Š” ๊ฒƒ๊ณผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”์ธ ๋ฃจํ”„๊ฐ€ ๋๋‚˜๋ฉด ๋‹จ์–ด์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ํŠน์ • ์‹œ์ž‘ ์œ„์น˜์—์„œ ๋‹ค์Œ ์œ„์น˜๋กœ ์ด๋™ํ•˜๋ฉด์„œ ํ† ํฐ์„ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค:

def encode_word(word, model):
    best_segmentations = [{"start": 0, "score": 1}] + [
        {"start": None, "score": None} for _ in range(len(word))
    ]
    for start_idx in range(len(word)):
        # This should be properly filled by the previous steps of the loop
        best_score_at_start = best_segmentations[start_idx]["score"]
        for end_idx in range(start_idx + 1, len(word) + 1):
            token = word[start_idx:end_idx]
            if token in model and best_score_at_start is not None:
                score = model[token] + best_score_at_start
                # If we have found a better segmentation ending at end_idx, we update
                if (
                    best_segmentations[end_idx]["score"] is None
                    or best_segmentations[end_idx]["score"] > score
                ):
                    best_segmentations[end_idx] = {"start": start_idx, "score": score}
    
    segmentation = best_segmentations[-1]
    if segmentation["score"] is None:
        # We did not find a tokenization of the word -> unknown
        return ["<unk>"], None
    
    score = segmentation["score"]
    start = segmentation["start"]
    end = len(word)
    tokens = []
    while start != 0:
        tokens.insert(0, word[start:end])
        next_start = best_segmentations[start]["start"]
        end = start
        start = next_start
    tokens.insert(0, word[start:end])
    return tokens, score

๋ช‡ ๊ฐœ์˜ ๋‹จ์–ด๋“ค๋กœ ์œ„ ํ•จ์ˆ˜๋ฅผ ํ…Œ์ŠคํŠธํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

print(encode_word("Hopefully", model))
print(encode_word("This", model))

์ด์ œ ๋ง๋ญ‰์น˜์—์„œ ๋ชจ๋ธ์˜ ์†์‹ค์„ ์‰ฝ๊ฒŒ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

def compute_loss(model):
    loss = 0
    for word, freq in word_freqs.items():
        _, word_loss = encode_word(word, model)
        loss += freq * word_loss
    return loss

ํ˜„์žฌ ๋ชจ๋ธ์—์„œ ์ž‘๋™ํ•˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

compute_loss(model)

๊ฐ ํ† ํฐ์˜ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ๋„ ๊ทธ๋ฆฌ ์–ด๋ ต์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ฐ ํ† ํฐ์„ ์‚ญ์ œํ•˜์—ฌ ์–ป์€ ๋ชจ๋ธ์˜ ์†์‹ค์„ ๊ณ„์‚ฐํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

import copy

def compute_scores(model):
    scores = {}
    model_loss = compute_loss(model)
    for token, score in model.items():
        # We always keep tokens of length 1
        if len(token) == 1:
            continue
        model_without_token = copy.deepcopy(model)
        _ = model_without_token.pop(token)
        scores[token] = compute_loss(model_without_token) - model_loss
    return scores

๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด์„œ ์œ„ ํ•จ์ˆ˜๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค:

scores = compute_scores(model)
print(scores["ll"])
print(scores["his"])

"ll"์€ "Hopefully"์˜ ํ† ํฐํ™”์— ์‚ฌ์šฉ๋˜๋ฉฐ ์ด๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด ํ† ํฐ "l"์„ ๋Œ€์‹  ๋‘๋ฒˆ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฏ€๋กœ ์ถ”๊ฐ€์ ์ธ ์†์‹ค์ด ๋ฐœ์ƒํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค. "his"๋Š” ๊ทธ ์ž์ฒด๋กœ ํ† ํฐํ™”๋œ "This" ๋‹จ์–ด ๋‚ด์—์„œ๋งŒ ์‚ฌ์šฉ๋˜๋ฏ€๋กœ ์†์‹ค์ด 0์ผ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋งค์šฐ ๋น„ํšจ์œจ์ ์ด๋ฏ€๋กœ, SentencePiece๋Š” ํ† ํฐ X๊ฐ€ ์—†๋Š” ๋ชจ๋ธ ์†์‹ค์˜ ๊ทผ์‚ฌ์น˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋Œ€์‹  ๋‚จ์€ vocabulary์˜ ๋ถ„ํ• ๋กœ ํ† ํฐ X๋ฅผ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์‹์œผ๋กœ ๋ชจ๋“  ์ ์ˆ˜๋Š” ๋ชจ๋ธ ์†์‹ค๊ณผ ๋™์‹œ์— ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์ œ ๋งˆ์ง€๋ง‰์œผ๋กœ ํ•ด์•ผ ํ•  ์ผ์€ ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํŠน์ˆ˜ ํ† ํฐ์„ vocabulary์— ์ถ”๊ฐ€ํ•œ ๋‹ค์Œ ์›ํ•˜๋Š” ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ vocabulary์—์„œ ํ† ํฐ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ œ๊ฑฐํ•ด ๋‚˜๊ฐ€๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

percent_to_remove = 0.1
while len(model) > 100:
    scores = compute_scores(model)
    sorted_scores = sorted(scores.items(), key=lambda x: x[1])
    # Remove percent_to_remove tokens with the lowest scores.
    for i in range(int(len(model) * percent_to_remove)):
        _ = token_freqs.pop(sorted_scores[i][0])
        
    total_sum = sum([freq for token, freq in token_freqs.items()])
    model = {token: -log(freq / total_sum) for token, freq in token_freqs.items()}

์ž…๋ ฅ ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋ ค๋ฉด ์‚ฌ์ „ ํ† ํฐํ™”๋ฅผ ์ ์šฉํ•œ ๋‹ค์Œ encode_word() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

def tokenize(text, model):
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in words_with_offsets]
    encoded_words = [encode_word(word, model)[0] for word in pre_tokenized_text]
    return sum(encoded_words, [])

tokenize("This is the Hugging Face course.", model)

์œ ๋‹ˆ๊ทธ๋žจ์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ง€๊ธˆ์ฏค์ด๋ฉด ํ† ํฌ๋‚˜์ด์ €์— ๊ด€ํ•œ ์ „๋ฌธ๊ฐ€๊ฐ€ ๋˜์…จ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค. ๋‹ค์Œ ์„น์…˜์—์„œ๋Š” ๐Ÿค— Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ๋นŒ๋”ฉ ๋ธ”๋ก์„ ํƒ๊ตฌํ•˜๊ณ  ์ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž์‹ ๋งŒ์˜ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณต๋ถ€ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ