[๐Ÿค— ๊ฐ•์ขŒ 6.7] WordPiece ํ† ํฐํ™”

46920 ๋‹จ์–ด WordpieceWordpiece

WordPiece๋Š” Google์ด BERT๋ฅผ ์‚ฌ์ „ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ๊ฐœ๋ฐœํ•œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ๊ทธ ์ดํ›„๋กœ DitilBERT, MobileBERT, Funnel Transformers ๋ฐ MPNET๊ณผ ๊ฐ™์€ BERT ๊ธฐ๋ฐ˜์˜ ์ƒ๋‹นํžˆ ๋งŽ์€ Transformer ๋ชจ๋ธ์—์„œ ์žฌ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต ์ธก๋ฉด์—์„œ BPE์™€ ๋งค์šฐ ์œ ์‚ฌํ•˜์ง€๋งŒ ์‹ค์ œ ํ† ํฐํ™”๋Š” ๋‹ค๋ฅด๊ฒŒ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์ด ์„น์…˜์—์„œ๋Š” WordPiece๋ฅผ ์‹ฌ์ธต์ ์œผ๋กœ ๋‹ค๋ฃจ๋ฉฐ ์ „์ฒด ๊ตฌํ˜„ ๊ณผ์ •์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์ธ ๊ฐœ์š”๋ฅผ ์›ํ•˜๋Š” ๊ฒฝ์šฐ ์ƒ๋žตํ•ด๋„ ์ข‹์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜

โš ๏ธ Google์€ WordPiece์˜ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„์„ ์˜คํ”ˆ ์†Œ์Šค๋กœ ๊ณต๊ฐœํ•˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ ์ด๋ฒˆ ์„น์…˜์—์„œ๋Š” ๋ฐœํ‘œ๋œ ๋…ผ๋ฌธ ๋‚ด์šฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌํ˜„ ๊ณผ์ •์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ฑฐ์˜ 100% ์ •ํ™•ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

BPE์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ WordPiece๋Š” ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํŠน์ˆ˜ ํ† ํฐ๊ณผ ์ดˆ๊ธฐ ์•ŒํŒŒ๋ฒณ์„ ํฌํ•จํ•œ ์ž‘์€ vocabulary์—์„œ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ ‘๋‘์‚ฌ(์˜ˆ: BERT์˜ ##)๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•˜์œ„ ๋‹จ์–ด(subwords)๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ๋‹จ์–ด๋Š” ์ฒ˜์Œ์— ํ•ด๋‹น ์ ‘๋‘์‚ฌ๋ฅผ ๋‹จ์–ด ๋‚ด๋ถ€์˜ ๋ชจ๋“  ๋ฌธ์ž์— ์ถ”๊ฐ€ํ•˜์—ฌ ๋ถ„ํ• ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "word"๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ถ„ํ• ๋ฉ๋‹ˆ๋‹ค.

w ##o ##r ##d

๋”ฐ๋ผ์„œ ์ดˆ๊ธฐ ์•ŒํŒŒ๋ฒณ์—๋Š” ๋‹จ์–ด์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ์žˆ๋Š” ๋ชจ๋“  ๋ฌธ์ž๋“ค(์˜ˆ: 'w')๊ณผ WordPiece ์ ‘๋‘์‚ฌ๊ฐ€ ์„ ํ–‰ํ•˜๋Š” ๋‹จ์–ด ๋‚ด๋ถ€์— ์žˆ๋Š” ๋ฌธ์ž(์˜ˆ: 'o', 'r', 'd')๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ BPE์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ WordPiece๋„ ๋ณ‘ํ•ฉ ๊ทœ์น™์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ์ฐจ์ด์ ์€ ๋ณ‘ํ•ฉํ•  ์Œ์ด ์„ ํƒ๋˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ถœํ˜„ํ•˜๋Š” ์Œ์„ ์„ ํƒํ•˜๋Š” ๋Œ€์‹  WordPiece๋Š” ๋‹ค์Œ ๊ณต์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์Œ์— ๋Œ€ํ•œ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:

score=freq_of_pairfreq_of_first_elementร—freq_of_second_elementscore = \frac{freq\_of\_pair}{freq\_of\_first\_element \times freq\_of\_second\_element}

์Œ์˜ ๋นˆ๋„๋ฅผ ๊ฐ ๋ถ€๋ถ„์˜ ๋นˆ๋„์˜ ๊ณฑ์œผ๋กœ ๋‚˜๋ˆ”์œผ๋กœ์จ, ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ ๊ฐœ๋ณ„ ๋ถ€๋ถ„๋“ค์˜ ๋นˆ๋„๊ฐ€ ๋‚ฎ์€ ์Œ์˜ ๋ณ‘ํ•ฉ์— ๋†’์€ ์šฐ์„ ์ˆœ์œ„๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, vocabulary ๋‚ด์—์„œ์˜ ์ถœํ˜„ ๋นˆ๋„๊ฐ€ ๋†’์€ ("un", "##able") ์Œ์„ ๊ตณ์ด ๋ณ‘ํ•ฉํ•  ํ•„์š”๋Š” ์—†๋Š”๋ฐ, ๊ทธ ์ด์œ ๋Š” "un"๊ณผ "##able" ๊ฐ๊ฐ์ด ๋‹ค๋ฅธ ๋‹จ์–ด ๋‚ด์—์„œ ๋งค์šฐ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ถœํ˜„ํ•˜์—ฌ ๋†’์€ ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด์—, "hu"์™€ "##gging"์€ ๊ฐ๊ฐ์ด ์ž์ฃผ ์‚ฌ์šฉ๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ("hu", "##gging")๊ณผ ๊ฐ™์€ ์Œ์€ ์•„๋งˆ๋„ ๋” ๋นจ๋ฆฌ ๋ณ‘ํ•ฉ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค("hugging"์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์–ดํœ˜์— ์ž์ฃผ ๋“ฑ์žฅํ•œ๋‹ค๊ณ  ๊ฐ€์ •).

์•ž์„œ BPE ํ•™์Šต ์˜ˆ์‹œ์—์„œ ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ ๋™์ผํ•œ vocabulary๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค:

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

๋ถ„ํ•  ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##g" "##s", 5)

๋”ฐ๋ผ์„œ ์ดˆ๊ธฐ vocabulary๋Š” ["b", "h", "p", "##g", "##n", "##s", "##u"]๊ฐ€ ๋ฉ๋‹ˆ๋‹ค(ํŠน์ˆ˜ ํ† ํฐ์€ ์ผ๋‹จ ์žŠ์–ด๋ฒ„๋ฆฝ์‹œ๋‹ค). ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์Œ์€ ("##u", "##g")(ํ˜„์žฌ 20ํšŒ)์ด์ง€๋งŒ "##u"์˜ ๊ฐœ๋ณ„ ๋นˆ๋„๊ฐ€ ๋งค์šฐ ๋†’์•„ ์ ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค(1/36). "##u"๊ฐ€ ํฌํ•จ๋œ ๋ชจ๋“  ์Œ์€ ์‹ค์ œ๋กœ ๋™์ผํ•œ ์ ์ˆ˜(1/36)๋ฅผ ๊ฐ€์ง€๋ฏ€๋กœ ๊ฐ€์žฅ ์ข‹์€ ์ ์ˆ˜๋Š” ("##g", "##s")์ด ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค(1/20). ์ด๋Š” "##u"๊ฐ€ ์—†๋Š” ์œ ์ผํ•œ ์Œ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ•™์Šต๋œ ์ฒซ ๋ฒˆ์งธ ๋ณ‘ํ•ฉ์€ ("##g", "##s") -> ("##gs")์ž…๋‹ˆ๋‹ค.

๋ณ‘ํ•ฉํ•  ๋•Œ ๋‘ ํ† ํฐ ์‚ฌ์ด์˜ ##์„ ์ œ๊ฑฐํ•˜๋ฏ€๋กœ vocabulary์— "##gs"๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ๋ง๋ญ‰์น˜์˜ ๋ชจ๋“  ๋‹จ์–ด์— ํ•ด๋‹น ๋ณ‘ํ•ฉ์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค:

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs"]
Corpus: ("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##gs", 5)

์ด ์‹œ์ ์—์„œ "##u"๋Š” ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์Œ์— ์žˆ์œผ๋ฏ€๋กœ ๋ชจ๋‘ ๋™์ผํ•œ ์ ์ˆ˜๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ์ฒซ ๋ฒˆ์งธ ์Œ์ด ๋ณ‘ํ•ฉ๋˜๋ฏ€๋กœ ("h", "##u") -> "hu"๊ฐ€ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu"]
Corpus: ("hu" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("hu" "##gs", 5)

์ด์ œ ์ตœ๊ณ  ์ ์ˆ˜๋Š” ("hu", "##g") ๋ฐ ("hu", "##gs")๊ฐ€ ๋™์ผํ•˜๊ฒŒ ๊ณ„์‚ฐ๋˜๋ฏ€๋กœ(๋‹ค๋ฅธ ๋ชจ๋“  ์Œ์˜ ๊ฒฝ์šฐ 1/21์ด๊ณ  ์ด ๋‘ ์Œ์€ 1/15) ๊ฐ€์žฅ ํฐ ์ ์ˆ˜๋ฅผ ๊ฐ€์ง„ ์ฒซ ๋ฒˆ์งธ ์Œ์ด ๋ณ‘ํ•ฉ๋ฉ๋‹ˆ๋‹ค.

Vocabulary: ["b", "h", "p", "##g", "##n", "##s", "##u", "##gs", "hu", "hug"]
Corpus: ("hug", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("hu" "##gs", 5)

์›ํ•˜๋Š” ์–ดํœ˜ ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ์ด ๋‹จ๊ณ„๊ฐ€ ๊ณ„์† ๋ฐ˜๋ณต๋ฉ๋‹ˆ๋‹ค.

โœ๏ธ Now your turn! ๋‹ค์Œ ๋ณ‘ํ•ฉ ๊ทœ์น™์€ ๋ฌด์—‡์ผ๊นŒ์š”?

ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜

ํ† ํฐํ™”๋Š” WordPiece๊ฐ€ ํ•™์Šต๋œ ๋ณ‘ํ•ฉ ๊ทœ์น™์€ ์ œ์™ธํ•˜๊ณ  ์ตœ์ข… vocabulary๋งŒ ์ €์žฅํ•œ๋‹ค๋Š” ์ ์—์„œ BPE์™€๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ํ† ํฐํ™”ํ•  ๋‹จ์–ด์—์„œ ์‹œ์ž‘ํ•˜์—ฌ WordPiece๋Š” vocabulary์— ์žˆ๋Š” ๊ฐ€์žฅ ๊ธด ํ•˜์œ„ ๋‹จ์–ด๋ฅผ ์ฐพ์€ ๋‹ค์Œ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์œ„์˜ ์˜ˆ์—์„œ ํ•™์Šตํ•œ vocabulary๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, ๋‹จ์–ด "hugs"์˜ ๊ฒฝ์šฐ ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๊ฐ€์žฅ ๊ธด ํ•˜์œ„ ๋‹จ์–ด๋Š” vocabulary ๋‚ด๋ถ€์— ์žˆ๋Š” "hug"์ด๋ฏ€๋กœ ๊ฑฐ๊ธฐ์—์„œ ๋ถ„ํ• ํ•˜์—ฌ ["hug", "##s"]๋กœ ๋ถ„ํ• ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ "##s"๊ฐ€ vocabulary์— ์กด์žฌํ•˜๊ณ  ์ด๋ฅผ ๊ณ„์† ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ "hugs"์˜ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋Š” ["hug", "##s"]์ž…๋‹ˆ๋‹ค.

BPE๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ•™์Šต๋œ ๋ณ‘ํ•ฉ(merges)์„ ์ˆœ์„œ๋Œ€๋กœ ์ ์šฉํ•˜๊ณ  ์ด๋ฅผ ["hu", "##gs"]๋กœ ํ† ํฐํ™”ํ•˜๋ฏ€๋กœ ์ธ์ฝ”๋”ฉ์ด ๋‹ค๋ฅด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋˜ ๋‹ค๋ฅธ ์˜ˆ๋กœ "bugs"๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์–ด๋–ป๊ฒŒ ํ† ํฐํ™”๋˜๋Š”์ง€ ๋ด…์‹œ๋‹ค. "b"๋Š” vocabulary์— ์กด์žฌํ•˜๋Š” ๋‹จ์–ด์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์—์„œ ์‹œ์ž‘ํ•˜๋Š” ๊ฐ€์žฅ ๊ธด ํ•˜์œ„ ๋‹จ์–ด์ด๋ฏ€๋กœ ๊ฑฐ๊ธฐ์„œ ๋ถ„ํ• ํ•˜์—ฌ ["b", "##ugs"]๋ผ๋Š” ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๊ฐ€ ๋„์ถœ๋ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ "##u"๋Š” vocabulary์— ์žˆ๋Š” "##ugs"์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์—์„œ ์‹œ์ž‘ํ•˜๋Š” ๊ฐ€์žฅ ๊ธด ํ•˜์œ„ ๋‹จ์–ด์ด๋ฏ€๋กœ ๊ฑฐ๊ธฐ์—์„œ ๋ถ„ํ• ํ•˜์—ฌ ["b", "##u, "##gs"]๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ "##gs"๊ฐ€ vocabulary์— ์žˆ์œผ๋ฏ€๋กœ ["b", "##u, "##gs"]์ด "bugs"์˜ ํ† ํฐํ™” ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

ํ† ํฐํ™”๊ฐ€ vocabulary์—์„œ ํ•˜์œ„ ๋‹จ์–ด(subword)๋ฅผ ๋”์ด์ƒ ์ฐพ์„ ์ˆ˜ ์—†๋Š” ๋‹จ๊ณ„์— ๋„๋‹ฌํ•˜๋ฉด ์ „์ฒด ๋‹จ์–ด๋ฅผ "unknown"์œผ๋กœ ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "mug"๋Š” "bum"๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ["[UNK]"]๋กœ ํ† ํฐํ™”๋ฉ๋‹ˆ๋‹ค("b"์™€ "##u"๋กœ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๋”๋ผ๋„ "##m"์ด vocabulary์— ์กด์žฌํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๊ฒฐ๊ณผ ํ† ํฐํ™”๋Š” ["b", "##u", "[UNK]"]๊ฐ€ ์•„๋‹ˆ๋ผ ["[UNK]"]์ž…๋‹ˆ๋‹ค). ์ด๊ฒƒ์€ vocabulary์— ์—†๋Š” ๊ฐœ๋ณ„ ๋ฌธ์ž๋งŒ "unknwon"์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” BPE์™€์˜ ๋˜ ๋‹ค๋ฅธ ์ฐจ์ด์ ์ž…๋‹ˆ๋‹ค.

โœ๏ธ Now your turn! "pugs"๋ผ๋Š” ๋‹จ์–ด๋Š” ์–ด๋–ป๊ฒŒ ํ† ํฐํ™”๋ ๊นŒ์š”?

WordPiece ๊ตฌํ˜„

์ด์ œ WordPiece ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ตฌํ˜„์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. BPE์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์•„๋ž˜ ์ฝ”๋“œ๋Š” ์ดํ•ด๋ฅผ ๋•๊ธฐ ์œ„ํ•ด์„œ ๊ตฌํ˜„ํ•œ ๊ฒƒ์ด๋ฉฐ ๋Œ€๊ทœ๋ชจ ๋ง๋ญ‰์น˜์—์„œ๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” BPE ์˜ˆ์‹œ์—์„œ์™€ ๋™์ผํ•œ ๋ง๋ญ‰์น˜๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค:

corpus = [
    "This is the Hugging Face course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

๋จผ์ € ๋ง๋ญ‰์น˜๋ฅผ ๋‹จ์–ด๋กœ ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenization)ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. BERT์™€ ๊ฐ™์€ WordPiece ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ ์‚ฌ์ „ ํ† ํฐํ™”์— bert-base-cased ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

๊ทธ๋Ÿฐ ๋‹ค์Œ ์‚ฌ์ „ ํ† ํฐํ™” ์ˆ˜ํ–‰ ๊ณผ์ •์—์„œ ๋ง๋ญ‰์น˜์— ์žˆ๋Š” ๊ฐ ๋‹จ์–ด์˜ ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:

from collections import defaultdict

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    new_words = [word for word, offset in words_with_offsets]
    for word in new_words:
        word_freqs[word] += 1
        
word_freqs

์ด์ „์— ๋ณด์•˜๋“ฏ์ด ์•ŒํŒŒ๋ฒณ์€ ๋‹จ์–ด์˜ ๋ชจ๋“  ์ฒซ ๊ธ€์ž์™€ ## ์ ‘๋‘์‚ฌ๊ฐ€ ๋ถ™์€ ๋‹จ์–ด์— ๋‚˜ํƒ€๋‚˜๋Š” ๋‹ค๋ฅธ ๋ชจ๋“  ๊ธ€์ž๋กœ ๊ตฌ์„ฑ๋œ ๊ณ ์œ ํ•œ ์ง‘ํ•ฉ์ž…๋‹ˆ๋‹ค:

alphabet = []
for word in word_freqs.keys():
    if word[0] not in alphabet:
        alphabet.append(word[0])
    for letter in word[1:]:
        if f"##{letter}" not in alphabet:
            alphabet.append(f"##{letter}")
            
alphabet.sort()
alphabet

print(alphabet)

๋˜ํ•œ ํ•ด๋‹น vocabulary์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•˜๋Š” ํŠน์ˆ˜ ํ† ํฐ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. BERT์˜ ๊ฒฝ์šฐ ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]์ž…๋‹ˆ๋‹ค:

vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy()

๋‹ค์Œ์œผ๋กœ vocabulary์— ์กด์žฌํ•˜๋Š” ์ ‘๋‘์‚ฌ๊ฐ€ ##์ด ์•„๋‹Œ ๋ชจ๋“  ๋ฌธ์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋‹จ์–ด๋ฅผ ๋ถ„ํ• ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

splits = {
    word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)]
    for word in word_freqs.keys()
}

์ด์ œ ํ•™์Šตํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์œผ๋ฏ€๋กœ ๊ฐ ์Œ์˜ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ํ•™์Šต์˜ ๊ฐ ๋‹จ๊ณ„์—์„œ ์ด ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

def compute_pair_scores(splits):
    letter_freqs = defaultdict(int)
    pair_freqs = defaultdict(int)
    for word, freq in word_freqs.items():
        split = splits[word]
        if len(split) == 1:
            letter_freqs[split[0]] += freq
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            letter_freqs[split[i]] += freq
            pair_freqs[pair] += freq
        letter_freqs[split[-1]] += freq
        
    scores = {
        pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]])
        for pair, freq in pair_freqs.items()
    }
    return scores

์ดˆ๊ธฐ ๋ถ„ํ•  ํ›„ pair_scores์˜ ์ผ๋ถ€๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

pair_scores = compute_pair_scores(splits)
for i, key in enumerate(pair_scores.keys()):
    print(f"{key}: {pair_scores[key]}")
    if i >= 5:
        break

์ด์ œ ์ตœ๊ณ ์˜ ์ ์ˆ˜๋ฅผ ๊ฐ€์ง„ ์Œ์„ ์ฐพ๋Š” ๊ฐ„๋‹จํ•œ ๋ฃจํ”„๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

best_pair = ""
max_score = None
for pair, score in pair_scores.items():
    if max_score is None or max_score < score:
        best_pair = pair
        max_score = score

print(best_pair, max_score)

๋”ฐ๋ผ์„œ ํ•™์Šตํ•  ์ฒซ ๋ฒˆ์งธ ๋ณ‘ํ•ฉ์€ ('a', '##b') -> 'ab'์ด๊ณ  vocabulary์— 'ab'๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

vocab.append("ab")

๊ณ„์†ํ•˜๋ ค๋ฉด splits ๋”•์…”๋„ˆ๋ฆฌ์— ํ•ด๋‹น ๋ณ‘ํ•ฉ์„ ์ ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค๋ฅธ ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

def merge_pair(a, b, splits):
    for word in word_freqs:
        split = splits[word]
        if len(split) == 1:
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2 :]
            else:
                i += 1
        splits[word] = split
    return splits

์ด์ œ ์ฒซ ๋ฒˆ์งธ ๋ณ‘ํ•ฉ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

splits = merge_pair("a", "##b", splits)
splits["about"]

์ด์ œ ์›ํ•˜๋Š” ๋ชจ๋“  ๋ณ‘ํ•ฉ์„ ๋ชจ๋‘ ํ•™์Šตํ• ๋•Œ ๊นŒ์ง€ ๋ฐ˜๋ณตํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ๋ชจ๋“  ๊ฒƒ์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชฉํ‘œ vocabulary ํฌ๊ธฐ๋ฅผ 70์œผ๋กœ ํ•ฉ์‹œ๋‹ค:

vocab_size = 70
while len(vocab) < vocab_size:
    scores = compute_pair_scores(splits)
    best_pair, max_score = "", None
    for pair, score in scores.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    splits = merge_pair(*best_pair, splits)
    new_token = (
        best_pair[0] + best_pair[1][2:]
        if best_pair[1].startswith("##")
        else best_pair[0] + best_pair[1]
    )
    vocab.append(new_token)

์ƒ์„ฑ๋œ vocabulary๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

print(vocab)

๋ณด์‹œ๋‹ค์‹œํ”ผ BPE์— ๋น„ํ•ด ์ด ํ† ํฌ๋‚˜์ด์ €๋Š” ๋‹จ์–ด์˜ ์ผ๋ถ€๋ฅผ ํ† ํฐ์œผ๋กœ ๋” ๋นจ๋ฆฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ๋™์ผํ•œ ๋ง๋ญ‰์น˜์—์„œ train_new_from_iterator()๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋˜‘๊ฐ™์€ vocabulary๊ฐ€ ๋‚˜์˜ค์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๐Ÿค—Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ํ•™์Šต์„ ์œ„ํ•ด WordPiece๋ฅผ ๊ตฌํ˜„ํ•˜์ง€ ์•Š๊ณ (๋‚ด๋ถ€์— ๋Œ€ํ•ด ์™„์ „ํžˆ ํ™•์‹ ํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์—) ๋Œ€์‹  BPE๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ƒˆ๋กœ์šด ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์ „ ํ† ํฐํ™”ํ•˜๊ณ (pre-tokenization), ๋ถ„ํ• ํ•œ ๋‹ค์Œ(split), ๊ฐ ๋‹จ์–ด์— ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ฒซ๋ฒˆ์งธ ๋‹จ์–ด์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์—์„œ ์‹œ์ž‘ํ•˜๋Š” ๊ฐ€์žฅ ํฐ ํ•˜์œ„ ๋‹จ์–ด๋ฅผ ์ฐพ์•„ ๋ถ„ํ• ํ•œ ๋‹ค์Œ, ๋‘๋ฒˆ์งธ ๋ถ€๋ถ„์—์„œ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋ฐ˜๋ณตํ•˜๊ณ  ๋‚˜๋จธ์ง€ ๋‹จ์–ด์™€ ํ…์ŠคํŠธ์˜ ๋‹ค์Œ ๋‹จ์–ด์— ๋Œ€ํ•ด ๊ณ„์† ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค:

def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

Vocabulary์— ์กด์žฌํ•˜๋Š” ๋‹จ์–ด์™€ ๊ทธ๋ ‡์ง€ ์•Š์€ ๋‹จ์–ด์— ๋Œ€ํ•ด ํ…Œ์ŠคํŠธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

print(encode_word("Hugging"))
print(encode_word("HOgging"))

์ด์ œ ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

์ด์ œ ์–ด๋–ค ํ…์ŠคํŠธ๋ฅผ ๊ฐ€์ง€๊ณ ๋„ ํ…Œ์ŠคํŠธํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenize("This is the Hugging Face course!")

WordPiece ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ์„ค๋ช…์€ ์—ฌ๊ธฐ๊นŒ์ง€์ž…๋‹ˆ๋‹ค! ์ด์ œ Unigram์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ