[๐Ÿค— ๊ฐ•์ขŒ 6.5] ์ •๊ทœํ™”(Normalization) ๋ฐ ์‚ฌ์ „ ํ† ํฐํ™”(Pre-tokenization)

ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๋Š” ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ 3๊ฐ€์ง€ ํ•˜์œ„ ๋‹จ์–ด(subwword) ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜(Byte-Pair Encoding[BPE], WordPiece, Unigram)์— ๋Œ€ํ•ด ๋” ์ž์„ธํžˆ ์•Œ์•„๋ณด๊ธฐ ์ „์—, ๋จผ์ € ๊ฐ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ํ…์ŠคํŠธ์— ์ ์šฉํ•˜๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ๊ทธ๋ฆผ์€ ํ† ํฐํ™” ํŒŒ์ดํ”„๋ผ์ธ์˜ ๋‹จ๊ณ„์— ๋Œ€ํ•œ ์ƒ์œ„ ์ˆ˜์ค€์˜ ๊ฐœ์š”๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

ํ…์ŠคํŠธ๋ฅผ ํ•˜์œ„ ํ† ํฐ(subtokens)์œผ๋กœ ๋ถ„ํ• ํ•˜๊ธฐ ์ „์—(๋ชจ๋ธ์— ๋”ฐ๋ผ), ํ† ํฌ๋‚˜์ด์ €๋Š” ์ •๊ทœํ™”(normalization) ๋ฐ ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenization) ๋‘ ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์ •๊ทœํ™”(Normalization)

์ •๊ทœํ™” ๋‹จ๊ณ„์—๋Š” ๋ถˆํ•„์š”ํ•œ ๊ณต๋ฐฑ ์ œ๊ฑฐ, ์†Œ๋ฌธ์ž ๋ณ€ํ™˜(lowercasing) ๋ฐ ์•…์„ผํŠธ ์ œ๊ฑฐ ๋“ฑ๊ณผ ๊ฐ™์€ ๋ช‡๊ฐ€์ง€ ์ผ๋ฐ˜์ ์ธ ์ •์ œ ์ž‘์—…์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. NFC ๋˜๋Š” NFKC์™€ ๊ฐ™์€ ์œ ๋‹ˆ์ฝ”๋“œ ์ •๊ทœํ™”(Unicode normalization) ์ž‘์—…๊ณผ ๊ฑฐ์˜ ๋™์ผํ•œ ์ž‘์—…์ด ์ด ๊ณผ์ •์—์„œ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

๐Ÿค—Transformers์˜ tokenizer๋Š” ๐Ÿค—Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ํ•˜๋ถ€ ํ† ํฌ๋‚˜์ด์ €์— ๋Œ€ํ•œ ์•ก์„ธ์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š” backend_tokenizer๋ผ๋Š” ์†์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tokenizer.backend_tokenizer))

ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด์˜ normalizer ์†์„ฑ์—๋Š” normalize_str() ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฉ”์„œ๋“œ๋Š” ์ •๊ทœํ™”๊ฐ€ ์ˆ˜ํ–‰๋˜๋Š” ๋ฐฉ์‹์„ ํ™•์ธํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

print(tokenizer.backend_tokenizer.normalizer.normalize_str("Hรฉllรฒ hรดw are รผ?"))

์ด ์˜ˆ์—์„œ๋Š” bert-base-uncased ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์„ ํƒํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ •๊ทœํ™” ๊ณผ์ •์—์„œ ์†Œ๋ฌธ์žํ™”(lowercasing)๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ์•…์„ผํŠธ๋ฅผ ์ œ๊ฑฐํ–ˆ์Šต๋‹ˆ๋‹ค.

โœ๏ธ Try it out! bert-base-cased ์ฒดํฌํฌ์ธํŠธ์—์„œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋กœ๋“œํ•˜๊ณ  ๋™์ผํ•œ ๋ฌธ์ž์—ด์„ ์ž…๋ ฅํ•ด๋ณด์„ธ์š”. ํ† ํฌ๋‚˜์ด์ €์˜ ๋Œ€์†Œ๋ฌธ์ž ๊ตฌ๋ถ„์ด ์žˆ๋Š” ๋ฒ„์ „๊ณผ ์†Œ๋ฌธ์ž ๋ณ€ํ™˜์ด ๋œ ๋ฒ„์ „ ๊ฐ„์— ๋ณผ ์ˆ˜ ์žˆ๋Š” ์ฃผ์š” ์ฐจ์ด์ ์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

์‚ฌ์ „ํ† ํฐํ™”(Pre-tokenization)

๋‹ค์Œ ์„น์…˜์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ํ† ํฌ๋‚˜์ด์ €๋Š” ์›์‹œ ํ…์ŠคํŠธ๋งŒ์œผ๋กœ๋Š” ํ•™์Šต๋  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋Œ€์‹ ์— ๋จผ์ € ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด์™€ ๊ฐ™์€ ์ž‘์€ ๊ฐœ์ฒด๋“ค๋กœ ๋ถ„ํ• ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenization) ๋‹จ๊ณ„๊ฐ€ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. 2์žฅ์—์„œ ๋ณด์•˜๋“ฏ์ด ๋‹จ์–ด ๊ธฐ๋ฐ˜ ํ† ํฌ๋‚˜์ด์ €๋Š” ์›์‹œ ํ…์ŠคํŠธ๋ฅผ ๋‹จ์ˆœํžˆ ๊ณต๋ฐฑ๊ณผ ๊ตฌ๋‘์ ์„ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด๋กœ ๋ถ„ํ• ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋‹จ์–ด๋“ค์€ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ํ•™์Šต ๊ณผ์ •์—์„œ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ํ•˜์œ„ ํ† ํฐ(subtokens)์˜ ๊ฒฝ๊ณ„๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

๋น ๋ฅธ ํ† ํฌ๋‚˜์ด์ €(fast tokenizer)๊ฐ€ ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenization)๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณผ์ •์„ ๋ณด๋ ค๋ฉด tokenizer ๊ฐ์ฒด์˜ pre_tokenizer ์†์„ฑ์ด ๊ฐ€์ง„ pre_tokenize_str() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ์˜คํ”„์…‹(offsets)์„ ์–ด๋–ป๊ฒŒ ์œ ์ง€ํ•˜๊ณ  ์žˆ๋Š”์ง€์— ์ฃผ๋ชฉํ•˜์„ธ์š”. ์ด๋Š” ์ด์ „ ์„น์…˜์—์„œ ์‚ฌ์šฉํ•œ ์˜คํ”„์…‹ ๋งคํ•‘์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ํ† ํฌ๋‚˜์ด์ €๋Š” ๋‘ ๊ฐœ์˜ ๊ณต๋ฐฑ("are"์™€ "you" ์‚ฌ์ด์— ์žˆ๋Š”)์„ ๋ฌด์‹œํ•˜๊ณ  ํ•˜๋‚˜์˜ ๊ณต๋ฐฑ์œผ๋กœ ๋ฐ”๊พธ์ง€๋งŒ, "are"์™€ "you" ์‚ฌ์ด์˜ ์˜คํ”„์…‹ ์ ํ”„(14์—์„œ 16)๋Š” ๊ณ„์† ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” BERT ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenization)์—๋Š” ๊ณต๋ฐฑ(whitespace)๊ณผ ๊ตฌ๋‘์ (puntuation) ๋ถ„ํ• ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ํ† ํฌ๋‚˜์ด์ €๋“ค์€ ์ด ๋‹จ๊ณ„์—์„œ ๋‹ค๋ฅธ ๊ทœ์น™์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด GPT-2 ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ:

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

์œ„ ์ฝ”๋“œ์˜ ์‹คํ–‰ ๊ฒฐ๊ณผ์™€ ๊ฐ™์ด, ๊ณต๋ฐฑ๊ณผ ๊ตฌ๋‘์ ์—์„œ๋„ ๋ถ„ํ• ๋˜์ง€๋งŒ ๊ณต๋ฐฑ์€ ์—†์• ์ง€ ์•Š๊ณ  ฤ  ๊ธฐํ˜ธ๋กœ ๋Œ€์ฒดํ•˜๋ฏ€๋กœ ํ† ํฐ์„ ๋””์ฝ”๋”ฉํ•˜๋ฉด ์›๋ž˜ ๊ณต๋ฐฑ์„ ๋ณต๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ BERT ํ† ํฌ๋‚˜์ด์ €์™€ ๋‹ฌ๋ฆฌ ์ด ํ† ํฌ๋‚˜์ด์ €๋Š” ์ด์ค‘ ๊ณต๋ฐฑ์„ ๋ฌด์‹œํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰ ์˜ˆ๋กœ SentencePiece ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” T5 ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")

GPT-2 ํ† ํฌ๋‚˜์ด์ €์™€ ๊ฐ™์ด T5 ํ† ํฌ๋‚˜์ด์ €๋Š” ๊ณต๋ฐฑ์„ ์œ ์ง€ํ•˜๊ณ  ํŠน์ • ํ† ํฐ(_)์œผ๋กœ ๋Œ€์ฒดํ•˜์ง€๋งŒ ๊ตฌ๋‘์ ์ด ์•„๋‹Œ ๊ณต๋ฐฑ์—์„œ๋งŒ ํ† ํฐ์„ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ฌธ์žฅ ์‹œ์ž‘ ๋ถ€๋ถ„("Hello" ์•ž๋ถ€๋ถ„)์— ๊ณต๋ฐฑ์„ ์ถ”๊ฐ€ํ•˜๊ณ  "are"์™€ "you" ์‚ฌ์ด์˜ ์ด์ค‘ ๊ณต๋ฐฑ์„ ๋ฌด์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.

์ง€๊ธˆ๊นŒ์ง€ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฐ๊ฐ์˜ ๋ฐฉ๋ฒ•์„ ์กฐ๊ธˆ ์‚ดํŽด๋ณด์•˜์œผ๋ฏ€๋กœ ํ•˜๋ถ€์˜ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ž์ฒด๋ฅผ ๊ณต๋ถ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์šฐ์„  ๋‹ค์–‘ํ•œ ๋ชฉ์ ์œผ๋กœ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” SentencePiece๋ฅผ ๊ฐ„๋‹จํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‹ค์Œ 3๊ฐœ์˜ ์„น์…˜์—์„œ ํ•˜์œ„ ๋‹จ์–ด(subword) ํ† ํฐํ™”์— ์‚ฌ์šฉ๋˜๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ž‘๋™ํ•˜๋Š” ๋ฐฉ์‹์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

SentencePiece

SentencePiece๋Š” ๋‹ค์Œ ์„ธ ์„น์…˜์—์„œ ๋ณด๊ฒŒ ๋  ๋ชจ๋“  ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ํ…์ŠคํŠธ๋ฅผ ์ผ๋ จ์˜ ์œ ๋‹ˆ์ฝ”๋“œ ๋ฌธ์ž๋“ค๋กœ ๊ฐ„์ฃผํ•˜๊ณ  ๊ณต๋ฐฑ์„ ํŠน์ˆ˜ ๋ฌธ์ž์ธ _๋กœ ์น˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. Unigram ์•Œ๊ณ ๋ฆฌ์ฆ˜(์„น์…˜ 7 ์ฐธ์กฐ)๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenization) ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๊ณต๋ฐฑ ๋ฌธ์ž๊ฐ€ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ์–ธ์–ด(์˜ˆ: ์ค‘๊ตญ์–ด ๋˜๋Š” ์ผ๋ณธ์–ด)์— ๋งค์šฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

SentencePiece์˜ ๋˜ ๋‹ค๋ฅธ ์ฃผ์š” ๊ธฐ๋Šฅ์€ ๊ฐ€์—ญ์  ํ† ํฐํ™”(reversible tokenization)์ž…๋‹ˆ๋‹ค. ๊ณต๋ฐฑ์— ๋Œ€ํ•œ ํŠน๋ณ„ํ•œ ์ฒ˜๋ฆฌ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํ† ํฐ ๋””์ฝ”๋”ฉ์€ ํ† ํฐ์„ ์—ฐ๊ฒฐํ•˜๊ณ  _s๋ฅผ ๊ณต๋ฐฑ์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์œผ๋กœ ๊ฐ„๋‹จํžˆ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์ •๊ทœํ™”๋œ ํ…์ŠคํŠธ๊ฐ€ ๋„์ถœ๋ฉ๋‹ˆ๋‹ค. ์•ž์—์„œ ๋ณด์•˜๋“ฏ์ด BERT ํ† ํฌ๋‚˜์ด์ €๋Š” ๋ฐ˜๋ณต๋˜๋Š” ๊ณต๋ฐฑ์„ ์ œ๊ฑฐํ•˜๋ฏ€๋กœ ํ† ํฐํ™”๋Š” ๋˜๋Œ๋ฆด ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ์š”

์ดํ›„ ์„น์…˜์—์„œ๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ํ•˜์œ„ ๋‹จ์–ด ํ† ํฐํ™”(subword tokenization) ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ BPE(GPT-2 ๋“ฑ์—์„œ ์‚ฌ์šฉ), WordPiece(์˜ˆ: BERT์—์„œ ์‚ฌ์šฉ) ๋ฐ Unigram(T5 ๋ฐ ๊ทธ ์™ธ.)์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ๊ฒฉ์ ์œผ๋กœ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๊ฐ๊ฐ์˜ ์ž‘๋™ ๋ฐฉ์‹์— ๋Œ€ํ•œ ๊ฐ„๋žตํ•œ ๊ฐœ์š”๋ฅผ ์•„๋ž˜ ํ‘œ์—์„œ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ์„น์…˜์„ ๊ฐ๊ฐ ์ฝ์€ ํ›„ ๊ทธ๋ž˜๋„ ์ดํ•ด๊ฐ€ ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์—๋Š” ์ฃผ์ €ํ•˜์ง€ ๋ง๊ณ  ์ด ํ‘œ๋กœ ๋‹ค์‹œ ๋Œ์•„์˜ค์„ธ์š”.

๋ชจ๋ธBPEWordPieceUnigram
ํ•™์Šต ๊ณผ์ •(Training)์†Œ๊ทœ๋ชจ vocabulary์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ํ† ํฐ ๋ณ‘ํ•ฉ ๊ทœ์น™์„ ๋ฐฐ์›๋‹ˆ๋‹ค.์†Œ๊ทœ๋ชจ vocabulary์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ํ† ํฐ ๋ณ‘ํ•ฉ ๊ทœ์น™์„ ๋ฐฐ์›๋‹ˆ๋‹ค.๋Œ€๊ทœ๋ชจ vocabulary์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ํ† ํฐ์„ ์ œ๊ฑฐํ•˜๋Š” ๊ทœ์น™์„ ๋ฐฐ์›๋‹ˆ๋‹ค.
ํ•™์Šต ๋‹จ๊ณ„(Training step)๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋ฐœ์ƒ๋˜๋Š” ํ† ํฐ ์Œ์„ ๋ณ‘ํ•ฉํ•ฉ๋‹ˆ๋‹ค.ํ† ํฐ ์Œ์˜ ๋นˆ๋„์ˆ˜์— ๊ธฐ๋ฐ˜ํ•œ ์ตœ๊ณ  ์ ์ˆ˜๋ฅผ ๊ฐ€์ง„ ์Œ์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ์„ ๋ณ‘ํ•ฉํ•˜๊ณ  ๊ฐ ๊ฐœ๋ณ„ ํ† ํฐ์˜ ๋นˆ๋„๊ฐ€ ๋‚ฎ์€ ์Œ์— ํŠน๊ถŒ์„ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.์ „์ฒด ์ฝ”ํผ์Šค์—์„œ ๊ณ„์‚ฐ๋œ ์†์‹ค(loss)์„ ์ตœ์†Œํ™”ํ•˜๋Š” vocabulary์˜ ๋ชจ๋“  ํ† ํฐ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
ํ•™์Šต ๊ฒฐ๊ณผ(Learns)ํ† ํฐ ๋ณ‘ํ•ฉ ๊ทœ์น™ ๋ฐ vocabularyVocabulary๊ฐ ํ† ํฐ์— ๋Œ€ํ•œ ์ ์ˆ˜๊ฐ€ ์žˆ๋Š” vocabulary
์ธ์ฝ”๋”ฉ(Encoding)๋‹จ์–ด๋ฅผ ๋ฌธ์ž๋“ค๋กœ ๋ถ„ํ• ํ•˜๊ณ  ํ•™์Šต ๊ณผ์ •์—์„œ ์Šต๋“ํ•œ ๋ณ‘ํ•ฉ ๊ทœ์น™ ์ ์šฉVocabulary์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ๊ฐ€์žฅ ๊ธด ํ•˜์œ„ ๋‹จ์–ด(longest subword)๋ฅผ ์ฐพ์€ ๋‹ค์Œ ๋‚˜๋จธ์ง€ ๋‹จ์–ด์— ๋Œ€ํ•ด ๋™์ผํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•™์Šต ๊ณผ์ •์—์„œ ํš๋“ํ•œ ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ ์žˆ๋Š” ํ† ํฐ ๋ถ„ํ• ์„ ์ฐพ์Œ

์ด์ œ BPE์— ๋Œ€ํ•ด์„œ ์ž์„ธํžˆ ์•Œ์•„๋ด…์‹œ๋‹ค!

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ