[๐Ÿค— ๊ฐ•์ขŒ 6.9] ๋ธ”๋ก ๋‹จ์œ„๋กœ ํ† ํฌ๋‚˜์ด์ € ๋นŒ๋”ฉํ•˜๊ธฐ

์ด์ „ ์„น์…˜์—์„œ ๋ณด์•˜๋“ฏ์ด ํ† ํฐํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹จ๊ณ„๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค:

  • ์ •๊ทœํ™” (๊ณต๋ฐฑ์ด๋‚˜ ์•…์„ผํŠธ ์ œ๊ฑฐ, ์œ ๋‹ˆ์ฝ”๋“œ ์ •๊ทœํ™” ๋“ฑ๊ณผ ๊ฐ™์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ์—ฌ๊ฒจ์ง€๋Š” ๋ชจ๋“  ํ…์ŠคํŠธ ์ •์ œ ์ž‘์—…)

  • ์‚ฌ์ „ ํ† ํฐํ™” (์ž…๋ ฅ์„ ๋‹จ์–ด๋“ค๋กœ ๋ถ„๋ฆฌ)

  • ๋ชจ๋ธ์„ ํ†ตํ•œ ์ž…๋ ฅ ์‹คํ–‰ (์‚ฌ์ „ ํ† ํฐํ™”๋œ ๋‹จ์–ด๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ํ† ํฐ ์‹œํ€€์Šค ์ƒ์„ฑ)

  • ํ›„์ฒ˜๋ฆฌ (ํ† ํฐ๋‚˜์ด์ €์˜ ํŠน์ˆ˜ ํ† ํฐ ์ถ”๊ฐ€, attention mask ๋ฐ ํ† ํฐ ์œ ํ˜• ID ์ƒ์„ฑ)

๋‹ค์‹œ ๋งํ•˜์ง€๋งŒ, ์ „์ฒด ํ”„๋กœ์„ธ์Šค๋ฅผ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๐Ÿค—Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ์œ„์˜ ๊ฐœ๋ณ„ ๋‹จ๊ณ„์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์˜ต์…˜์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด์กŒ์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ์˜ต์…˜๋“ค์€ ๋ชฉ์ ์— ๋”ฐ๋ผ ์งœ๋งž์ถฐ์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์„น์…˜์—์„œ๋Š” ์„น์…˜ 2์—์„œ ์„ค๋ช…ํ–ˆ๋˜ ๊ธฐ์กด ํ† ํฌ๋‚˜์ด์ €์—์„œ ์ƒˆ๋กœ์šด ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๊ณผ๋Š” ๋‹ฌ๋ฆฌ ์•„์˜ˆ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ, ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ์ข…๋ฅ˜์˜ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

๋ณด๋‹ค ์ •ํ™•ํ•˜๊ฒŒ ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ์ค‘์‹ฌ์— Tokenizer ํด๋ž˜์Šค๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ๋‹ค์–‘ํ•œ ํ•˜๋ถ€ ๋ชจ๋“ˆ๋“ค์ด ๊ธฐ๋Šฅ๋ณ„๋กœ ๊ฒฐํ•ฉ๋œ ๊ตฌ์„ฑ ์š”์†Œ(building blocks)๊ฐ€ ๊ฒฐํ•ฉ๋œ ํ˜•ํƒœ๋กœ ๊ตฌ์ถ•๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • normalizers์—๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ Normalizer ์œ ํ˜•์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค(์ „์ฒด ๋ชฉ๋ก์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•˜์„ธ์š”).

  • pre_tokenizers์—๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ์œ ํ˜•์˜ PreTokenizer๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค (์ „์ฒด ๋ชฉ๋ก์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•˜์„ธ์š”).

  • models์—๋Š” BPE, WordPiece ๋ฐ Unigram๊ณผ ๊ฐ™์ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ Model์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค(์ „์ฒด ๋ชฉ๋ก์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•˜์„ธ์š”).

  • trainer์—๋Š” ๋ง๋ญ‰์น˜์—์„œ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ Trainer๊ฐ€ ๋ชจ๋‘ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค(๋ชจ๋ธ ์œ ํ˜•๋‹น ํ•˜๋‚˜์”ฉ, ์ „์ฒด ๋ชฉ๋ก์€ ์—ฌ๊ธฐ).

  • post_processors์—๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ PostProcessor๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค(์ „์ฒด ๋ชฉ๋ก์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•˜์„ธ์š”).

  • decoders์—๋Š” ํ† ํฐํ™” ์ถœ๋ ฅ์„ ๋””์ฝ”๋”ฉํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ Decoder๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค(์ „์ฒด ๋ชฉ๋ก์€ ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•˜์„ธ์š”).

์—ฌ๊ธฐ์—์„œ ๊ตฌ์„ฑ ์š”์†Œ(building blocks)์˜ ์ „์ฒด ๋ชฉ๋ก์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ง๋ญ‰์น˜ ํ™•๋ณด

์ƒˆ๋กœ์šด ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ž‘์€ ํ…์ŠคํŠธ ๋ง๋ญ‰์น˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค(์˜ˆ์ œ ์‹คํ–‰ ์‹œ๊ฐ„์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ). ๋ง๋ญ‰์น˜๋ฅผ ํš๋“ํ•˜๋Š” ๋‹จ๊ณ„๋Š” ์ด ์žฅ์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์—์„œ ์ˆ˜ํ–‰ํ•œ ๋‹จ๊ณ„์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ด๋ฒˆ์—๋Š” WikiText-2 ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

from datasets import load_dataset


dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")


def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

get_training_corpus() ํ•จ์ˆ˜๋Š” 1,000๊ฐœ์˜ ํ…์ŠคํŠธ ๋ฐฐ์น˜(batch)๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ƒ์„ฑ์ž(generator)์ด๋ฉฐ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๐Ÿค—Tokenizers๋Š” ํ…์ŠคํŠธ ํŒŒ์ผ์—์„œ ์ง์ ‘ ํ•™์Šต๋  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋กœ์ปฌ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” WikiText-2์˜ ๋ชจ๋“  ํ…์ŠคํŠธ/์ž…๋ ฅ์„ ํฌํ•จํ•˜๋Š” ํ…์ŠคํŠธ ํŒŒ์ผ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

with open("wikitext-2.txt", "w", encoding="utf-8") as f:
    for i in range(len(dataset)):
        f.write(dataset[i]["text"] + "\n")

์ด์ œ ๋ธ”๋ก ๋‹จ์œ„๋กœ BERT, GPT-2 ๋ฐ XLNet ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋นŒ๋“œํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๊ฐ๊ฐ WordPiece, BPE ๋ฐ Unigram์˜ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ๊ฐ์˜ ์‹ค์ œ ์˜ˆ์‹œ๊ฐ€ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. BERT๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค!

WordPiece ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋นŒ๋”ฉํ•˜๊ธฐ

๐Ÿค—Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋นŒ๋“œํ•˜๋ ค๋ฉด ๋จผ์ € model๋กœ Tokenizer ๊ฐ์ฒด๋ฅผ ์ธ์Šคํ„ด์Šคํ™”ํ•œ ๋‹ค์Œ normalizer, pre_tokenizer, post_processor ๋ฐ decoder ์†์„ฑ์„ ์›ํ•˜๋Š” ๊ฐ’์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

์ด ์˜ˆ์—์„œ๋Š” WordPiece ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ Tokenizer๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค:

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

๋ชจ๋ธ์ด ์ด์ „์— ๋ณธ ์ ์ด ์—†๋Š” ๋ฌธ์ž๋“ค์„ ๋งŒ๋‚ฌ์„ ๋•Œ ๋ฌด์—‡์„ ๋ฐ˜ํ™˜ํ• ์ง€ ์•Œ ์ˆ˜ ์žˆ๋„๋ก unk_token์„ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค๋ฅธ ์ธ์ˆ˜์—๋Š” ๋ชจ๋ธ์˜ vocab(์—ฌ๊ธฐ์„œ๋Š” ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๊ฒƒ์ด๋ฏ€๋กœ ์ด๊ฒƒ์„ ์„ค์ •ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค)๊ณผ ๊ฐ ๋‹จ์–ด์˜ ์ตœ๋Œ€ ๊ธธ์ด๋ฅผ ์ง€์ •ํ•˜๋Š” max_input_chars_per_word(์ด ๊ธธ์ด๋ณด๋‹ค ๋” ๊ธด ๋‹จ์–ด๋Š” ๋ถ„ํ• ๋จ)๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

ํ† ํฐํ™”์˜ ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ์ •๊ทœํ™”(normalization)์ž…๋‹ˆ๋‹ค. BERT๊ฐ€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— BERT์— ๋Œ€ํ•ด ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ณธ์ ์ธ ์˜ต์…˜์ด ํฌํ•จ๋œ BertNormalizer๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋Š” ์˜ต์…˜์—๋Š” lowercase, strip_accents ๋“ฑ์„ ๋น„๋กฏํ•˜์—ฌ, ๋ชจ๋“  ์ œ์–ด ๋ฌธ์ž๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฐ˜๋ณต๋˜๋Š” ๊ณต๋ฐฑ์„ ๋‹จ์ผ ๋ฌธ์ž๋กœ ๋ฐ”๊พธ๋Š” clean_text ๊ทธ๋ฆฌ๊ณ  ํ•œ์ž(Chinese characters) ์ฃผ์œ„์— ๊ณต๋ฐฑ์„ ๋ฐฐ์น˜ํ•˜๋Š” handle_chinese_chars๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. bert-base-uncased ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ณต์ œํ•˜๋ ค๋ฉด ์ด ๋…ธ๋ฉ€๋ผ์ด์ €(normalizer)๋ฅผ ์„ค์ •ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ฐ˜์ ์œผ๋กœ ์ƒˆ๋กœ์šด ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋นŒ๋“œํ•  ๋•Œ ๐Ÿค—Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ์ด๋ฏธ ๊ตฌํ˜„๋œ ํŽธ๋ฆฌํ•œ ๋…ธ๋ฉ€๋ผ์ด์ €์— ์ ‘๊ทผํ•  ์ˆ˜๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ BERT ๋…ธ๋ฉ€๋ผ์ด์ €๋ฅผ ์ง์ ‘ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ๋Š” Lowercase ๋…ธ๋ฉ€๋ผ์ด์ €์™€ StripAccents ๋…ธ๋ฉ€๋ผ์ด์ €๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ Sequence๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋…ธ๋ฉ€๋ผ์ด์ €๋ฅผ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

๋˜ํ•œ ์—ฌ๊ธฐ์„œ๋Š” NFD ์œ ๋‹ˆ์ฝ”๋“œ ๋…ธ๋ฉ€๋ผ์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด, StripAccents ๋…ธ๋ฉ€๋ผ์ด์ €๊ฐ€ ์•…์„ผํŠธ๊ฐ€ ์žˆ๋Š” ๋ฌธ์ž๋ฅผ ์ œ๋Œ€๋กœ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๋ฏ€๋กœ ์ œ๋Œ€๋กœ ๋™์ž‘ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์ด์ „์— ๋ณด์•˜๋“ฏ์ด normalize_str() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ์˜ ๋ณ€ํ™” ์–‘์ƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

print(tokenizer.normalizer.normalize_str("Hรฉllรฒ hรดw are รผ?"))

To go further ์œ ๋‹ˆ์ฝ”๋“œ ๋ฌธ์ž u"\u0085"๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์ž์—ด์—์„œ ์œ„ 2๊ฐ€์ง€ ๋…ธ๋ฉ€๋ผ์ด์ €๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ฉด ๊ทธ ๊ฒฐ๊ณผ๊ฐ€ ๋™์ผํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ํ™•์‹คํžˆ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. normalizers.Sequence ๋ฒ„์ „์„ ๋„ˆ๋ฌด ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“ค์ง€ ์•Š๊ธฐ ์œ„ํ•ด, clean_text ์ธ์ˆ˜๊ฐ€ ๊ธฐ๋ณธ ๋™์ž‘์ธ True๋กœ ์„ค์ •๋  ๋•Œ, BertNormalizer๊ฐ€ ํ•„์š”๋กœ ํ•˜๋Š” ์ •๊ทœ์‹ ๋Œ€์ฒด(Regex replacements)๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ฑฑ์ •ํ•˜์ง€ ๋งˆ์‹ญ์‹œ์˜ค. ๋‘ ๊ฐœ์˜ normalizer.Replace๋ฅผ ๋…ธ๋ฉ€๋ผ์ด์ € ์‹œํ€€์Šค์— ์ถ”๊ฐ€ํ•˜์—ฌ ํŽธํ•œ BertNormalizer๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ ๋„ ์ •ํ™•ํžˆ ๋™์ผํ•œ ์ •๊ทœํ™”๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ์‚ฌ์ „ ํ† ํฐํ™” ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋„ ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์‚ฌ์ „ ๋นŒ๋“œ๋œ BertPreTokenizer๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()

๋˜๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ๋นŒ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

Whitespace ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €(pre-tokenizer)๋Š” ๊ณต๋ฐฑ์€ ๋ฌผ๋ก , ๋ฌธ์ž, ์ˆซ์ž ๋˜๋Š” ๋ฐ‘์ค„ ๋ฌธ์ž๊ฐ€ ์•„๋‹Œ ๋ชจ๋“  ๋ฌธ์ž๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ถ„ํ• ํ•˜๋ฏ€๋กœ, ๊ฒฐ๊ตญ์€ ๊ณต๋ฐฑ๊ณผ ๊ตฌ๋‘์ ์œผ๋กœ ๋ถ„ํ• ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

๊ณต๋ฐฑ๋งŒ์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„ํ• ํ•˜๋ ค๋ฉด WhitespaceSplit ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €(pre-tokenizer) ๋Œ€์‹  ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

๋…ธ๋ฉ€๋ผ์ด์ €์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Sequence๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €(pre-tokenizer)๋“ค์„ ๊ฒฐํ•ฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

pre_tokenizer = pre_tokenizers.Sequence(
    [pre_tokenizers.WhitespaceSplit(), pre_tokenizers.Punctuation()]
)
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")

ํ† ํฐํ™” ํŒŒ์ดํ”„๋ผ์ธ์˜ ๋‹ค์Œ ๋‹จ๊ณ„๋Š” ๋ชจ๋ธ์„ ํ†ตํ•ด ์ž…๋ ฅ์„ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ ์ดˆ๊ธฐํ™” ๋‹จ๊ณ„์—์„œ ๋ชจ๋ธ์„ ์ง€์ •ํ•˜๊ธด ํ–ˆ์ง€๋งŒ ์ด์ œ ํ•™์Šต์ด ํ•„์š”ํ•˜๋ฉฐ ์ด๋ฅผ ์œ„ํ•ด์„œ WordPieceTrainer๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๐Ÿค—Tokenizers์—์„œ ํŠธ๋ ˆ์ด๋„ˆ(trainer)๋ฅผ ์ธ์Šคํ„ด์Šคํ™”ํ•  ๋•Œ ๊ธฐ์–ตํ•ด์•ผ ํ•  ์ค‘์š”ํ•œ ์ ์€ ์‚ฌ์šฉํ•˜๋ ค๋Š” ๋ชจ๋“  ํŠน์ˆ˜ ํ† ํฐ์„ ์ „๋‹ฌํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์ด๋“ค ํ† ํฐ์ด ํ•™์Šต ์ฝ”ํผ์Šค์— ์—†๊ธฐ ๋•Œ๋ฌธ์— ์–ดํœ˜์ง‘(vocabulary)์— ์ถ”๊ฐ€๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค:

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

vocab_size ๋ฐ special_tokens๋ฅผ ์ง€์ •ํ•˜๋Š” ๊ฒƒ ์™ธ์—๋„ min_frequency(ํ† ํฐ์ด vocabulary์— ํฌํ•จ๋˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ์ถœํ˜„ ๋นˆ๋„)๋ฅผ ์„ค์ •ํ•˜๊ฑฐ๋‚˜ continue_subword_prefix(##๋ง๊ณ  ๋‹ค๋ฅธ ๊ธฐํ˜ธ๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋Š” ๊ฒฝ์šฐ)๋ฅผ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์•ž์—์„œ ์ •์˜ํ•œ ๋ฐ˜๋ณต์ž(iterator)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋ ค๋ฉด ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

๋˜ํ•œ ํ…์ŠคํŠธ ํŒŒ์ผ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์ฒ˜๋Ÿผ ์‹คํ–‰ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๋ฏธ๋ฆฌ ๋น„์–ด์žˆ๋Š” WordPiece๋กœ ๋ชจ๋ธ์„ ๋‹ค์‹œ ์ดˆ๊ธฐํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค:

tokenizer.model = models.WordPiece(unk_token="[UNK]")
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

๋‘ ๊ฒฝ์šฐ ๋ชจ๋‘ encode() ๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ํ…์ŠคํŠธ์—์„œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

encoding = tokenizer.encode("Let's test this tokenizer")
print(encoding.tokens)

์œ„์—์„œ encoding์€ ๋‹ค์–‘ํ•œ ์†์„ฑ(ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing)์— ํ† ํฌ๋‚˜์ด์ €์˜ ๋ชจ๋“  ํ•„์š”ํ•œ ์ถœ๋ ฅ์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” Encoding ํด๋ž˜์Šค์˜ ๊ฐ์ฒด์ž…๋‹ˆ๋‹ค.

ํ† ํฐํ™” ํŒŒ์ดํ”„๋ผ์ธ์˜ ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„๋Š” ํ›„์ฒ˜๋ฆฌ(post-processing)์ž…๋‹ˆ๋‹ค. ์‹œ์ž‘ ๋ถ€๋ถ„์— [CLS] ํ† ํฐ์„ ์ถ”๊ฐ€ํ•˜๊ณ  ๋ ๋ถ€๋ถ„์— [SEP] ํ† ํฐ์„ ์ถ”๊ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ•œ ์Œ์˜ ๋ฌธ์žฅ์ด ์žˆ๋Š” ๊ฒฝ์šฐ์—๋Š” ๊ฐ ๋ฌธ์žฅ ๋’ค์— [SEP]๋ฅผ ๋ถ™์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด์„œ, TemplateProcessor๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ ๋จผ์ € ์–ดํœ˜์ง‘(vocabulary)์—์„œ [CLS] ๋ฐ [SEP] ํ† ํฐ์˜ ID๋ฅผ ์•Œ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค:

cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

TemplateProcessor์šฉ ํ…œํ”Œ๋ฆฟ์„ ์ž‘์„ฑํ•˜๋ ค๋ฉด ๋‹จ์ผ ๋ฌธ์žฅ๊ณผ ๋ฌธ์žฅ ์Œ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‘˜ ๋‹ค ์‚ฌ์šฉํ•˜๋ ค๋Š” ํŠน์ˆ˜ ํ† ํฐ์„ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ(๋˜๋Š” ๋‹จ์ผ) ๋ฌธ์žฅ์€ \A๋กœํ‘œ์‹œ๋˜๊ณ ๋‘๋ฒˆ์งธ๋ฌธ์žฅ(์Œ์„์ธ์ฝ”๋”ฉํ•˜๋Š”๊ฒฝ์šฐ)์€A๋กœ ํ‘œ์‹œ๋˜๊ณ  ๋‘ ๋ฒˆ์งธ ๋ฌธ์žฅ(์Œ์„ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๊ฒฝ์šฐ)์€ \\B๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค. ์ด๋“ค ๊ฐ๊ฐ(ํŠน์ˆ˜ ํ† ํฐ ๋ฐ ๋ฌธ์žฅ)์— ๋Œ€ํ•ด ์ฝœ๋ก (colon) ๋’ค์— ํ•ด๋‹น ํ† ํฐ ์œ ํ˜• ID๋„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๊ณ ์ „์ ์ธ BERT ํ…œํ”Œ๋ฆฟ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)]
)

ํŠน์ˆ˜ ํ† ํฐ์˜ ID๋ฅผ ์ „๋‹ฌํ•ด์•ผ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ์ด๋ฅผ ํ•ด๋‹น ID๋กœ ์ ์ ˆํ•˜๊ฒŒ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ผ๋‹จ ์ถ”๊ฐ€๋˜๋ฉด ์ด์ „ ์˜ˆ์ œ์˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

๋˜ํ•œ, 2๊ฐœ์˜ ๋ฌธ์žฅ์„ ์ž…๋ ฅํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค:

encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)

์ƒˆ๋กœ์šด ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ฑฐ์˜ ์™„์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„๋Š” ๋””์ฝ”๋”๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

tokenizer.decoder = decoders.WordPiece(prefix="##")

์œ„์˜ encoding์„ ์ด์šฉํ•ด์„œ ํ…Œ์ŠคํŠธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

tokenizer.decode(encoding.ids)

์ข‹์Šต๋‹ˆ๋‹ค! ์ด์ œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹จ์ผ JSON ํŒŒ์ผ์— ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.save("tokenizer.json")

๊ทธ๋Ÿฐ ๋‹ค์Œ from_file() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Tokenizer ๊ฐ์ฒด์—์„œ ํ•ด๋‹น ํŒŒ์ผ์„ ๋‹ค์‹œ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

new_tokenizer = Tokenizer.from_file("tokenizer.json")

๐Ÿค—Transformers์—์„œ ์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“  ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด PreTrainedTokenizerFast๋กœ ๋ž˜ํ•‘ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ œ๋„ˆ๋ฆญ ํด๋ž˜์Šค(generic class)๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ๊ธฐ์กด ๋ชจ๋ธ์— ํ•ด๋‹นํ•˜๋Š” ๊ฒฝ์šฐ, ํ•ด๋‹น ํด๋ž˜์Šค(์—ฌ๊ธฐ์„œ๋Š” BertTokenizerFast)๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•์˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒฝ์šฐ ์ฒซ๋ฒˆ์งธ ์˜ต์…˜์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

PreTrainedTokenizerFast๋ฅผ ๊ฐ€์ง€๊ณ  ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ž˜ํ•‘ํ•˜๋ ค๋ฉด ์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“  ํ† ํฌ๋‚˜์ด์ €๋ฅผ tokenizer_object๋กœ ์ „๋‹ฌํ•˜๊ฑฐ๋‚˜ tokenizer_file๋กœ ์ €์žฅํ•œ ํ† ํฌ๋‚˜์ด์ € ํŒŒ์ผ์„ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์–ตํ•ด์•ผ ํ•  ์ค‘์š”ํ•œ ์ ์€ ๋ชจ๋“  ํŠน์ˆ˜ ํ† ํฐ์„ ์ˆ˜๋™์œผ๋กœ ์ง์ ‘ ์„ค์ •ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. PreTrainedTokenizerFast ํด๋ž˜์Šค๋Š” tokenizer ๊ฐ์ฒด๋กœ๋ถ€ํ„ฐ ์–ด๋–ค ํ† ํฐ์ด ๋งˆ์Šคํฌ ํ† ํฐ์ธ์ง€ ์•„๋‹ˆ๋ฉด [CLS] ํ† ํฐ ๋“ฑ์ธ์ง€ ์ถ”๋ก ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค:

from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

ํŠน์ • ํ† ํฌ๋‚˜์ด์ € ํด๋ž˜์Šค(์˜ˆ: BertTokenizerFast)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, ๊ธฐ๋ณธ์ ์œผ๋กœ ์ง€์ •๋œ ํ† ํฐ(์—ฌ๊ธฐ์„œ๋Š” ์—†์Œ)๊ณผ ๋‹ค๋ฅธ ํŠน์ˆ˜ ํ† ํฐ๋งŒ ์ง€์ •ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

from transformers import BertTokenizerFast

wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

์ด์ œ ์œ„ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋‹ค๋ฅธ ๐Ÿค—Transformers ํ† ํฌ๋‚˜์ด์ €์ฒ˜๋Ÿผ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, save_pretrained() ๋ฉ”์„œ๋“œ๋กœ ์ €์žฅํ•˜๊ฑฐ๋‚˜ push_to_hub() ๋ฉ”์„œ๋“œ๋กœ ํ—ˆ๋ธŒ์— ์—…๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์ œ๊นŒ์ง€ WordPiece ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋นŒ๋“œํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์•˜์œผ๋ฏ€๋กœ BPE ํ† ํฌ๋‚˜์ด์ €์— ๋Œ€ํ•ด์„œ๋„ ๋™์ผํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ์•Œ๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์กฐ๊ธˆ ๋” ๋น ๋ฅด๊ฒŒ ์ง„ํ–‰ํ•˜๊ณ  ์ฐจ์ด์ ๋งŒ ๊ฐ•์กฐ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.

BPE ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋นŒ๋”ฉํ•˜๊ธฐ

์ด์ œ GPT-2 ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋นŒ๋“œํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. BERT ํ† ํฌ๋‚˜์ด์ €์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ BPE ๋ชจ๋ธ๋กœ Tokenizer๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:

tokenizer = Tokenizer(models.BPE())

์•ž์—์„œ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ธฐ์กด์— ์–ดํœ˜์ง‘(vocabulary)์ด ์žˆ๋Š” ๊ฒฝ์šฐ ๋ชจ๋ธ์„ ํ•ด๋‹น ์–ดํœ˜์ง‘(vocabulary)์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ(์ด ๊ฒฝ์šฐ vocab๊ณผ merges๋ฅผ ์ „๋‹ฌํ•ด์•ผ ํ•จ) ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ๋Ÿด ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ GPT-2๋Š” byte-level BPE(๋ฐ”์ดํŠธ ์ˆ˜์ค€ BPE)๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— unk_token์„ ์ง€์ •ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

GPT-2๋Š” ๋…ธ๋ฉ€๋ผ์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ํ•ด๋‹น ๋‹จ๊ณ„๋ฅผ ๊ฑด๋„ˆ๋›ฐ๊ณ  ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenization)๋กœ ๋ฐ”๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค:

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

์—ฌ๊ธฐ์—์„œ ByteLevel์— ์ถ”๊ฐ€ํ•œ ์˜ต์…˜์€ ๋ฌธ์žฅ ์‹œ์ž‘ ๋ถ€๋ถ„์— ๊ณต๋ฐฑ์„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๋„๋ก ํ•˜๋Š” ์˜ต์…˜์ž…๋‹ˆ๋‹ค. ์ด์ „๊ณผ ๋™์ผํ•œ ์‚ฌ์ „ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ฅผ ์•„๋ž˜์—์„œ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")

๋‹ค์Œ์€ ํ•™์Šต์ด ํ•„์š”ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. GPT-2์˜ ๊ฒฝ์šฐ, ์œ ์ผํ•œ ํŠน์ˆ˜ ํ† ํฐ์€ ํ…์ŠคํŠธ ๋(end-of-text) ํ† ํฐ์ž…๋‹ˆ๋‹ค:

trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

WordPieceTrainer์™€ vocab_size ๋ฐ special_tokens์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์›ํ•˜๋Š” ๊ฒฝ์šฐ min_frequency๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜๋Š” ๋‹จ์–ด ๋ ์ ‘๋ฏธ์‚ฌ(end-of-word suffix)๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ(์˜ˆ: </w>) end_of_word_suffix ์ธ์ˆ˜๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ํ† ํฌ๋‚˜์ด์ €๋Š” ํ…์ŠคํŠธ ํŒŒ์ผ์— ๋Œ€ํ•ด์„œ๋„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.model = models.BPE()
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

์ƒ˜ํ”Œ ํ…์ŠคํŠธ๋กœ ํ† ํฐํ™” ํ…Œ์ŠคํŠธ๋ฅผ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

๋‹ค์Œ๊ณผ ๊ฐ™์ด GPT-2 ํ† ํฌ๋‚˜์ด์ €์— ๋Œ€ํ•œ ๋ฐ”์ดํŠธ ์ˆ˜์ค€(byte-level) ํ›„์ฒ˜๋ฆฌ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค:

tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

trim_offsets = False ์˜ต์…˜์€ ํ›„์ฒ˜๋ฆฌ๊ธฐ(post-processor)๊ฐ€ 'ฤ '๋กœ ์‹œ์ž‘ํ•˜๋Š” ํ† ํฐ์˜ ์˜คํ”„์…‹์„ ๊ทธ๋Œ€๋กœ ๋‘์–ด์•ผ ํ•จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์˜คํ”„์…‹์˜ ์‹œ์ž‘์€ ๋‹จ์–ด์˜ ์ฒซ๋ฒˆ์งธ ๋ฌธ์ž๊ฐ€ ์•„๋‹ˆ๋ผ ๋‹จ์–ด ์•ž์˜ ๊ณต๋ฐฑ์„ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋Š” ๊ณต๋ฐฑ๋„ ๊ธฐ์ˆ ์ ์œผ๋กœ ํ† ํฐ์˜ ์ผ๋ถ€์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋ฐฉ๊ธˆ ์ธ์ฝ”๋”ฉํ•œ ํ…์ŠคํŠธ๋กœ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ 'ฤ test'๋Š” ์ธ๋ฑ์Šค 4์˜ ํ† ํฐ์ž…๋‹ˆ๋‹ค:

sentence = "Let's test this tokenizer."
encoding = tokenizer.encode(sentence)
start, end = encoding.offsets[4]
sentence[start:end]

๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ฐ”์ดํŠธ ์ˆ˜์ค€(byte-level) ๋””์ฝ”๋”๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

tokenizer.decoder = decoders.ByteLevel()

์ด๊ฒƒ์ด ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜๋Š”์ง€ ๋‹ค์‹œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.decode(encoding.ids)

์ด์ œ ์™„๋ฃŒ๋˜์—ˆ์œผ๋ฏ€๋กœ, ์ด์ „๊ณผ ๊ฐ™์ด ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ €์žฅํ•˜๊ณ , ์ด๋ฅผ ๐Ÿค—Transformers์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ๋ฅผ ์›ํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” PreTrainedTokenizerFast ๋˜๋Š” GPT2TokenizerFast๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ž˜ํ•‘(wrapping)ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>",
)
from transformers import GPT2TokenizerFast

wrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

์ด์ œ ๋งˆ์ง€๋ง‰ ์˜ˆ์ œ๋กœ Unigram ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋นŒ๋“œํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.

Unigram ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋นŒ๋”ฉํ•˜๊ธฐ

์ด์ œ XLNet ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋นŒ๋“œํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด์ „ ํ† ํฌ๋‚˜์ด์ €์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Unigram ๋ชจ๋ธ๋กœ Tokenizer๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:

tokenizer = Tokenizer(models.Unigram())

์—ญ์‹œ ์—ฌ๊ธฐ์„œ๋„, ์–ดํœ˜์ง‘(vocabulary)์ด ์žˆ๋Š” ๊ฒฝ์šฐ ๋ชจ๋ธ์„ ์–ดํœ˜์ง‘์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ •๊ทœํ™”(normalization)๋ฅผ ์œ„ํ•ด XLNet์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ช‡ ๊ฐ€์ง€ ๋Œ€์ฒด๊ทœ์น™(relpacements, SentencePiece์—์„œ ์ œ๊ณต)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

from tokenizers import Regex

tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFKD(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
    ]
)

์œ„ ๋Œ€์ฒด๊ทœ์น™์€ โ€œ ๋ฐ โ€๋ฅผ โ€๋กœ ๋Œ€์ฒดํ•˜๊ณ , ๋‘˜ ์ด์ƒ์˜ ๊ณต๋ฐฑ ์‹œํ€€์Šค๋ฅผ ๋‹จ์ผ ๊ณต๋ฐฑ์œผ๋กœ ๋Œ€์ฒดํ•˜๋ฉฐ, ํ† ํฐํ™”ํ•  ํ…์ŠคํŠธ์˜ ์•…์„ผํŠธ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  SentencePiece ํ† ํฌ๋‚˜์ด์ €์— ์‚ฌ์šฉ๋˜๋Š” ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €(pre-tokenizer)๋Š” Metaspace์ž…๋‹ˆ๋‹ค.

tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

์ด์ „์˜ ์˜ˆ์ œ ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ์‚ฌ์ „ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ฅผ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!")

๋‹ค์Œ์œผ๋กœ ๋ชจ๋ธ ํ•™์Šต์ž…๋‹ˆ๋‹ค. XLNet์—๋Š” ๋ช‡ ๊ฐ€์ง€ ํŠน๋ณ„ํ•œ ํ† ํฐ๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค:

special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(
    vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

UnigramTrainer์—์„œ ๋ปฌ๋จน์ง€ ๋ง์•„์•ผ ํ•  ๋งค์šฐ ์ค‘์š”ํ•œ ์ธ์ˆ˜๋Š” unk_token์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ํ† ํฐ์„ ์ œ๊ฑฐํ•˜๋Š” ๊ฐ ๋‹จ๊ณ„์— ๋Œ€ํ•œ shrinking_factor(๊ธฐ๋ณธ๊ฐ’์€ 0.75), ๋˜๋Š” ์ฃผ์–ด์ง„ ํ† ํฐ์˜ ์ตœ๋Œ€ ๊ธธ์ด๋ฅผ ์ง€์ •ํ•˜๊ธฐ ์œ„ํ•œ max_piece_length(๊ธฐ๋ณธ๊ฐ’์€ 16) ๋“ฑ๊ณผ ๊ฐ™์€ Unigram ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํŠนํ™”๋œ ์ถ”๊ฐ€ ๋งค๊ฐœ๋ณ€์ˆ˜๋“ค์„ ์ „๋‹ฌํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ํ† ํฌ๋‚˜์ด์ €๋Š” ์—ญ์‹œ ํ…์ŠคํŠธ ํŒŒ์ผ์— ๋Œ€ํ•ด์„œ๋„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.model = models.Unigram()
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

์ƒ˜ํ”Œ ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

XLNet์˜ ํŠน์ง•์€ ํ† ํฐ ํƒ€์ž… ID๊ฐ€ 2์ธ <cls> ํ† ํฐ์„ ๋ฌธ์žฅ ๋์— ์ถ”๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ขŒ์ธก ํŒจ๋”ฉ์ž…๋‹ˆ๋‹ค. BERT์™€ ๊ฐ™์ด ํ…œํ”Œ๋ฆฟ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋“  ํŠน์ˆ˜ ํ† ํฐ๊ณผ ํ† ํฐ ํƒ€์ž… ID๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๊ทธ์ „์— ๋จผ์ € <cls> ๋ฐ <sep> ํ† ํฐ์˜ ID๋ฅผ ๊ฐ€์ ธ์™€์•ผ ํ•ฉ๋‹ˆ๋‹ค:

cls_token_id = tokenizer.token_to_id("<cls>")
sep_token_id = tokenizer.token_to_id("<sep>")
print(cls_token_id, sep_token_id)

ํ…œํ”Œ๋ฆฟ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ๊ตฌํ˜„๋ฉ๋‹ˆ๋‹ค:

tokenizer.post_processor = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)],
)

์ด์ œ ํ•œ ์Œ์˜ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์ธ์ฝ”๋”ฉ์ด ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜๋Š”์ง€ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences!")
print(encoding.tokens)
print(encoding.type_ids)

๋งˆ์ง€๋ง‰์œผ๋กœ Metaspace ๋””์ฝ”๋”๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

tokenizer.decoder = decoders.Metaspace()

ํ† ํฌ๋‚˜์ด์ € ๋นŒ๋“œ๋ฅผ ๋๋ƒˆ์Šต๋‹ˆ๋‹ค! ์ด์ „๊ณผ ๊ฐ™์ด ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ €์žฅํ•˜๊ณ  ๐Ÿค—Transformers ๋‚ด์—์„œ ์‚ฌ์šฉํ•˜๋ ค๋ฉด PreTrainedTokenizerFast ๋˜๋Š” XLNetTokenizerFast๋กœ ๋ž˜ํ•‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. PreTrainedTokenizerFast๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ํ•œ ๊ฐ€์ง€ ์ฃผ์˜ํ•ด์•ผ ํ•  ์ ์€ ํŠน์ˆ˜ ํ† ํฐ ์ง€์ •๊ณผ ๋”๋ถˆ์–ด ๐Ÿค—Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ์™ผ์ชฝ์„ ์ฑ„์šฐ๋„๋ก ์ง€์‹œ(padding_side="left")ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)
from transformers import XLNetTokenizerFast

wrapped_tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer)

์ง€๊ธˆ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์š”์†Œ ๋ชจ๋“ˆ๋“ค์ด ๊ธฐ์กด ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ƒˆ๋กญ๊ฒŒ ๋นŒ๋“œํ•˜๋Š” ๋ฐ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋˜๋Š”์ง€ ๋ณด์•˜์œผ๋ฏ€๋กœ, ์ด ์„ค๋ช…์„ ํ™œ์šฉํ•˜์—ฌ ๐Ÿค—Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ์›ํ•˜๋Š” ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ  ๐Ÿค—Transformers์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ