[๐Ÿค— ๊ฐ•์ขŒ 6.3] "๋น ๋ฅธ(fast)" ํ† ํฌ๋‚˜์ด์ €์˜ ํŠน๋ณ„ํ•œ ๋Šฅ๋ ฅ

27164 ๋‹จ์–ด fast tokenizerfast tokenizer

์ด ์„น์…˜์—์„œ๋Š” ๐Ÿค—Transformers์—์„œ ํ† ํฌ๋‚˜์ด์ €์˜ ๊ธฐ๋Šฅ์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ง€๊ธˆ๊นŒ์ง€๋Š” ์ž…๋ ฅ์„ ํ† ํฐํ™”ํ•˜๊ฑฐ๋‚˜ ํ† ํฐ ์•„์ด๋””๋ฅผ ๋‹ค์‹œ ํ…์ŠคํŠธ๋กœ ๋””์ฝ”๋”ฉํ•˜๋Š”๋ฐ๋งŒ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ ํ† ํฌ๋‚˜์ด์ €, ํŠนํžˆ ๐Ÿค—Tokenizers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์ง€์›ํ•˜๋Š” ํ† ํฌ๋‚˜์ด์ €๋Š” ํ›จ์”ฌ ๋” ๋งŽ์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ถ”๊ฐ€ ๊ธฐ๋Šฅ์„ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด 1์žฅ์—์„œ ์ฒ˜์Œ ์ ‘ํ•œ token-classification(NER) ๋ฐ question-answering ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฒฐ๊ณผ๋ฅผ ์žฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

๋‹ค์Œ ๋…ผ์˜์—์„œ ์šฐ๋ฆฌ๋Š” ์ข…์ข… "๋Š๋ฆฐ(slow)" ํ† ํฌ๋‚˜์ด์ €์™€ "๋น ๋ฅธ(fast)" ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ตฌ๋ถ„ํ•ด์„œ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. "๋Š๋ฆฐ(slow)" ํ† ํฌ๋‚˜์ด์ €๋Š” ๐Ÿค—Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋‚ด๋ถ€์—์„œ Python์œผ๋กœ ์ž‘์„ฑ๋œ ๊ฒƒ์ด๊ณ , ๋น ๋ฅธ ๋ฒ„์ „์€ Rust๋กœ ์ž‘์„ฑ๋˜์–ด ๐Ÿค—Tokenizers์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์•ฝ๋ฌผ ๊ฒ€ํ† (drug review) ๋ฐ์ดํ„ฐ์…‹์„ ํ† ํฐํ™”ํ•˜๋Š”๋ฐ ๋น ๋ฅธ ํ˜น์€ ๋Š๋ฆฐ ํ† ํฌ๋‚˜์ด์ €์˜ ์‹คํ–‰ ์†๋„๋ฅผ ๋ณด๊ณ ํ•œ 5์žฅ์˜ ํ‘œ๋ฅผ ๊ธฐ์–ตํ•œ๋‹ค๋ฉด ์šฐ๋ฆฌ๊ฐ€ ์ด๋ฅผ ๋น ๋ฅด๊ณ  ๋Š๋ฆฐ ๊ฒƒ์œผ๋กœ ๋ถ€๋ฅด๋Š” ์ด์œ ๋ฅผ ์•Œ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹จ์ผ ๋ฌธ์žฅ์„ ํ† ํฐํ™”ํ•  ๋•Œ ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ €์˜ ๋Š๋ฆฐ ๋ฒ„์ „๊ณผ ๋น ๋ฅธ ๋ฒ„์ „ ๊ฐ„์˜ ์†๋„ ์ฐจ์ด๊ฐ€ ํ•ญ์ƒ ๋‚˜๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ์‚ฌ์‹ค, ๋น ๋ฅธ ๋ฒ„์ „์€ ์‹ค์ œ๋กœ ๋” ๋Š๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ๋งŽ์€ ํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ํ† ํฐํ™”ํ•  ๋•Œ๋งŒ ์ฐจ์ด๋ฅผ ๋ช…ํ™•ํ•˜๊ฒŒ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐฐ์น˜ ์ธ์ฝ”๋”ฉ (Batch encoding)

ํ† ํฌ๋‚˜์ด์ €์˜ ์ถœ๋ ฅ์€ ๋‹จ์ˆœํ•œ Python ๋”•์…”๋„ˆ๋ฆฌ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์–ป๋Š” ๊ฒƒ์€ ์‹ค์ œ๋กœ ํŠน๋ณ„ํ•œ BatchEncoding ๊ฐ์ฒด์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋”•์…”๋„ˆ๋ฆฌ์˜ ํ•˜์œ„ ํด๋ž˜์Šค์ด์ง€๋งŒ(์ด๊ฒƒ์ด ์ด์ „์— ์šฐ๋ฆฌ๊ฐ€ ๋ฌธ์ œ์—†์ด ํ•ด๋‹น ๊ฒฐ๊ณผ๋ฅผ ์ƒ‰์ธํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ ์ด์œ ์ž…๋‹ˆ๋‹ค), ๋น ๋ฅธ ํ† ํฌ๋‚˜์ด์ €์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ถ”๊ฐ€ ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ณ‘๋ ฌํ™”(parallelization) ๊ธฐ๋Šฅ ์™ธ์—๋„, ๋น ๋ฅธ ํ† ํฌ๋‚˜์ด์ €์˜ ์ฃผ์š” ๊ธฐ๋Šฅ์€ ์ตœ์ข… ํ† ํฐ์ด ์›๋ณธ ํ…์ŠคํŠธ์—์„œ ์–ด๋””์— ์œ„์น˜ํ•˜๋Š”์ง€ ๋ฒ”์œ„(span)๋ฅผ ํ•ญ์ƒ ์ถ”์ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ์˜คํ”„์…‹ ๋งคํ•‘(offset mapping) ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ฐจ๋ก€๋Œ€๋กœ ๊ฐ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑ๋œ ํ† ํฐ์— ๋งคํ•‘ํ•˜๊ฑฐ๋‚˜ ์›๋ณธ ํ…์ŠคํŠธ์˜ ๊ฐ ๋ฌธ์ž๋ฅผ ๋‚ด๋ถ€ ํ† ํฐ์— ๋งคํ•‘ํ•˜๊ฑฐ๋‚˜ ๊ทธ ๋ฐ˜๋Œ€๋กœ ๋งคํ•‘ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๊ธฐ๋Šฅ๋“ค์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))

์œ„์—์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด, ํ† ํฌ๋‚˜์ด์ €์˜ ์ถœ๋ ฅ์—์„œ BatchEncoding ๊ฐ์ฒด๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. AutoTokenizer ํด๋ž˜์Šค๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๋น ๋ฅธ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์„ ํƒํ•˜๋ฏ€๋กœ ์ด BatchEncoding ๊ฐ์ฒด๊ฐ€ ์ œ๊ณตํ•˜๋Š” ์ถ”๊ฐ€ ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ๋น ๋ฅธ์ง€ ๋Š๋ฆฐ์ง€ ํ™•์ธํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์šฐ์„ , ํ† ํฌ๋‚˜์ด์ €์˜ is_fast ์†์„ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

tokenizer.is_fast

๋˜๋Š” encoding์˜ is_fast ์†์„ฑ์„ ํ™•์ธํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

encoding.is_fast

๋น ๋ฅธ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ฐ€์ง€๊ณ  ์šฐ๋ฆฌ๊ฐ€ ๋ฌด์—‡์„ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ด…์‹œ๋‹ค. ์ฒซ์งธ, ํ† ํฐ ์•„์ด๋””๋ฅผ ๋‹ค์‹œ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์ง€ ์•Š๊ณ ๋„ ํ† ํฐ์— ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

encoding.tokens()

์ด ๊ฒฝ์šฐ ์ธ๋ฑ์Šค 5์˜ ํ† ํฐ์€ ##yl์ด๋ฉฐ, ์ด๋Š” ์›๋ž˜ ๋ฌธ์žฅ์—์„œ "Sylvain"์ด๋ผ๋Š” ๋‹จ์–ด์˜ ์ผ๋ถ€์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ word_ids() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํ† ํฐ์ด ์œ ๋ž˜๋œ ํ•ด๋‹น ๋‹จ์–ด์˜ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

encoding.word_ids()

ํ† ํฌ๋‚˜์ด์ €์˜ ํŠน์ˆ˜ ํ† ํฐ [CLS] ๋ฐ [SEP]๊ฐ€ None์œผ๋กœ ๋งคํ•‘๋œ ๋‹ค์Œ, ๊ฐœ๋ณ„ ํ† ํฐ๋“ค์ด ํ•ด๋‹น ํ† ํฐ์ด ์œ ๋ž˜ํ•œ ๋‹จ์–ด์— ๋งคํ•‘๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฉ”์„œ๋“œ๋Š” ๋‘ ๊ฐœ์˜ ํ† ํฐ์ด ๊ฐ™์€ ๋‹จ์–ด์— ์žˆ๋Š”์ง€ ์•„๋‹ˆ๋ฉด ํ† ํฐ์ด ๋‹จ์–ด์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š”๋ฐ ํŠนํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ## ์ ‘๋‘์‚ฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ด๋Š” BERT์™€ ๊ฐ™์€ ์œ ํ˜•์˜ ํ† ํฌ๋‚˜์ด์ €์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์†๋„๊ฐ€ ๋น ๋ฅธ ๋ชจ๋“  ์œ ํ˜•์˜ ํ† ํฌ๋‚˜์ด์ €์—์„œ ์œ ํšจํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์žฅ์—์„œ๋Š” ์ด ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐœ์ฒด๋ช… ์ธ์‹(NER) ๋ฐ ํ’ˆ์‚ฌ(POS) ํƒœ๊น…๊ณผ ๊ฐ™์€ ์ž‘์—…์—์„œ ๊ฐ ๋‹จ์–ด์— ํ•ด๋‹นํ•˜๋Š” ๋ ˆ์ด๋ธ”์„ ํ† ํฐ์— ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋งˆ์Šคํฌ ์–ธ์–ด ๋ชจ๋ธ๋ง(masked language modeling), ์ฆ‰ ์ „์ฒด ๋‹จ์–ด ๋งˆ์Šคํ‚น(whole word masking)์ด๋ผ๊ณ  ํ•˜๋Š” ๊ธฐ๋ฒ•์—์„œ, ๋™์ผํ•œ ๋‹จ์–ด์—์„œ ๋ถ„๋ฆฌ๋œ ๋ชจ๋“  ํ† ํฐ๋“ค์„ ๋งˆ์Šคํ‚นํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€์— ๋Œ€ํ•œ ๊ฐœ๋…์€ ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, "I'll"("I will"์˜ ์ถ•์•ฝํ˜•)์€ ํ•˜๋‚˜์˜ ๋‹จ์–ด์ผ๊นŒ์š”? ์•„๋‹ˆ๋ฉด ๋‘๊ฐœ์ผ๊นŒ์š”? ์ด๋Š” ์‹ค์ œ๋กœ ํ† ํฌ๋‚˜์ด์ €์™€ ์ ์šฉ๋˜๋Š” ์‚ฌ์ „ ํ† ํฐํ™”(pre-tokenization) ์ž‘์—…์— ๋”ฐ๋ผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์ผ๋ถ€ ํ† ํฌ๋‚˜์ด์ €๋Š” ๊ณต๋ฐฑ ๊ธฐ์ค€์œผ๋กœ ๋ถ„ํ• ํ•˜๋ฏ€๋กœ ์ด๋ฅผ ํ•œ ๋‹จ์–ด๋กœ ๊ฐ„์ฃผํ•ฉ๋‹ˆ๋‹ค. ๋˜ ๋‹ค๋ฅธ ํ† ํฌ๋‚˜์ด์ €๋“ค์€ ๊ณต๋ฐฑ ์œ„์— ๊ตฌ๋‘์ ์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ๋‘ ๋‹จ์–ด๋กœ ๊ฐ„์ฃผํ•ฉ๋‹ˆ๋‹ค.

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ํ† ํฐ์„ ๊ฐ€์ ธ์˜จ ๋ฌธ์žฅ์— ํ•ด๋‹น ํ† ํฐ์„ ๋งคํ•‘ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” sentence_ids() ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค(์ด ๊ฒฝ์šฐ ํ† ํฌ๋‚˜์ด์ €์—์„œ ๋ฐ˜ํ™˜๋œ token_type_ids๊ฐ€ ๋™์ผํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค).

๋งˆ์ง€๋ง‰์œผ๋กœ word_to_chars() ๋˜๋Š” token_to_chars() ๋ฐ char_to_word() ๋˜๋Š” char_to_token() ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ๋‹จ์–ด ๋˜๋Š” ํ† ํฐ์„ ์›๋ณธ ํ…์ŠคํŠธ์˜ ๋ฌธ์ž์— ๋งคํ•‘ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๊ทธ ๋ฐ˜๋Œ€๋กœ๋„ ๋งคํ•‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, word_ids() ๋ฉ”์„œ๋“œ๋Š” ##yl์ด ์ธ๋ฑ์Šค 3์— ์žˆ๋Š” ๋‹จ์–ด์˜ ์ผ๋ถ€๋ผ๊ณ  ์•Œ๋ ค์คฌ์ง€๋งŒ ์ •ํ™•ํžˆ ๋ฌธ์žฅ ๋‚ด์—์„œ ์–ด๋–ค ๋‹จ์–ด์— ํ•ด๋‹นํ•˜๋Š” ๊ฑธ๊นŒ์š”? ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

start, end = encoding.word_to_chars(3)
example[start:end]

์ด์ „์— ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ์ด ๋ชจ๋“  ๊ฒƒ์€ ๋น ๋ฅธ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ์˜คํ”„์…‹(offset) ๋ชฉ๋ก์—์„œ ๊ฐ ํ† ํฐ์ด ๊ฐ€์ ธ์˜จ ํ…์ŠคํŠธ ๋ฒ”์œ„(span)๋ฅผ ์ถ”์ ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ตฌ๋™๋ฉ๋‹ˆ๋‹ค. ํ™œ์šฉ ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ์œผ๋กœ token-classification ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜๋™์œผ๋กœ ๋ณต์ œํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

token-classification ํŒŒ์ดํ”„๋ผ์ธ์˜ ๋‚ด๋ถ€ ๋™์ž‘

1์žฅ์—์„œ ์šฐ๋ฆฌ๋Š” ๐Ÿค—Transformers์˜ pipeline() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ํ…์ŠคํŠธ์˜ ์–ด๋Š ๋ถ€๋ถ„์ด ์‚ฌ๋žŒ(person), ์œ„์น˜(location) ๋˜๋Š” ์กฐ์ง(organization)๊ณผ ๊ฐ™์€ ์—”ํ„ฐํ‹ฐ(entities)์— ํ•ด๋‹นํ•˜๋Š”์ง€ ์‹๋ณ„ํ•˜๋Š” ์ž‘์—…์ธ NER์„ ์ฒ˜์Œ์œผ๋กœ ์‚ดํŽด๋ดค์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ 2์žฅ์—์„œ ํŒŒ์ดํ”„๋ผ์ธ์ด ์›์‹œ ํ…์ŠคํŠธ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ์„ธ ๋‹จ๊ณ„ ์ฆ‰, ํ† ํฐํ™”(tokenization), ๋ชจ๋ธ์„ ํ†ตํ•œ ์ž…๋ ฅ ์ „๋‹ฌ, ํ›„์ฒ˜๋ฆฌ(post-processing)๋ฅผ ์–ด๋–ป๊ฒŒ ๊ทธ๋ฃนํ™”ํ•˜๋Š”์ง€๋ฅผ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. token-classification ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ฒ˜์Œ ๋‘ ๋‹จ๊ณ„๋Š” ๋‹ค๋ฅธ ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๋™์ผํ•˜์ง€๋งŒ ํ›„์ฒ˜๋ฆฌ(post-processing)๋Š” ์กฐ๊ธˆ ๋” ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ์‚ดํŽด๋ด…์‹œ๋‹ค!

ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๊ธฐ๋ณธ ์‹คํ–‰ ๊ฒฐ๊ณผ ๋„์ถœํ•˜๊ธฐ

๋จผ์ €, ์ˆ˜์ž‘์—…์œผ๋กœ ๋น„๊ตํ•  ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋„๋ก token-classification ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌํ˜„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์€ dbmdz/bert-large-cased-finetuned-conll03-english์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋ฌธ์žฅ์— ๋Œ€ํ•ด NER๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

from transformers import pipeline

token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

๋ชจ๋ธ์€ "Sylvain"์—์„œ ๋ถ„๋ฆฌ๋œ ๊ฐ ํ† ํฐ๋“ค์„ ๋ชจ๋‘ ์‚ฌ๋žŒ(person)์œผ๋กœ, "Hugging Face"์—์„œ ๋ถ„๋ฆฌ๋œ ๊ฐ ํ† ํฐ๋“ค์„ ๋ชจ๋‘ ์กฐ์ง(organization)์œผ๋กœ, "Brooklyn" ํ† ํฐ์„ ์œ„์น˜(location)๋กœ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์‹๋ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŒŒ์ดํ”„๋ผ์ธ์— ๋™์ผํ•œ ์—”ํ„ฐํ‹ฐ์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ์„ ๊ทธ๋ฃนํ™”ํ•˜๋„๋ก ์š”์ฒญํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

aggregation_strategy๋ฅผ ์œ„์™€ ๊ฐ™์ด ์ง€์ •ํ•˜๋ฉด ํ† ํฐ๋“ค์ด ํ•˜๋‚˜๋กœ ํ•ฉ์ณ์ง„ ์—”ํ„ฐํ‹ฐ์— ๋Œ€ํ•ด ์ƒˆ๋กญ๊ฒŒ ๊ณ„์‚ฐ๋œ ์Šค์ฝ”์–ด๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. "simple"์˜ ๊ฒฝ์šฐ ์Šค์ฝ”์–ด๋Š” ํ•ด๋‹น ๊ฐœ์ฒด๋ช… ๋‚ด์˜ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•œ ์Šค์ฝ”์–ด์˜ ํ‰๊ท ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, "Sylvain"์˜ ์Šค์ฝ”์–ด๋Š” ์ด์ „ ์˜ˆ์—์„œ S, ##yl, ##va ๋ฐ ##in ํ† ํฐ์— ๋Œ€ํ•ด ๊ณ„์‚ฐ๋œ ์Šค์ฝ”์–ด์˜ ํ‰๊ท ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋‹ค๋ฅธ ์ง€์ •์ž๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • "first", ์—ฌ๊ธฐ์„œ ๊ฐ ๊ฐœ์ฒด๋ช…์˜ ์Šค์ฝ”์–ด๋Š” ํ•ด๋‹น ๊ฐœ์ฒด๋ช…์˜ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ์˜ ์Šค์ฝ”์–ด์ž…๋‹ˆ๋‹ค(๋”ฐ๋ผ์„œ "Sylvain"์˜ ๊ฒฝ์šฐ ํ† ํฐ S์˜ ์ ์ˆ˜์ธ 0.993828์ด ๋จ).

  • "max", ์—ฌ๊ธฐ์„œ ๊ฐ ์—”ํ„ฐํ‹ฐ์˜ ์Šค์ฝ”์–ด๋Š” ํ•ด๋‹น ์—”ํ„ฐํ‹ฐ๋‚ด์˜ ํ† ํฐ๋“ค ์ค‘์˜ ์ตœ๋Œ€๊ฐ’ ์Šค์ฝ”์–ด์ž…๋‹ˆ๋‹ค("Hugging Face"์˜ ๊ฒฝ์šฐ "Face"์˜ ์ ์ˆ˜๋Š” 0.98879766์ด ๋จ).

  • "average", ์—ฌ๊ธฐ์„œ ๊ฐ ํ•ญ๋ชฉ์˜ ์Šค์ฝ”์–ด๋Š” ํ•ด๋‹น ํ•ญ๋ชฉ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋‹จ์–ด(ํ† ํฐ์ด ์•„๋‹™๋‹ˆ๋‹ค) ์Šค์ฝ”์–ด์˜ ํ‰๊ท ์ž…๋‹ˆ๋‹ค(๋”ฐ๋ผ์„œ "Sylvain"์˜ ๊ฒฝ์šฐ "simple" ์ง€์ •์ž์™€ ์ฐจ์ด๊ฐ€ ์—†์ง€๋งŒ "Hugging Face"์˜ ์ ์ˆ˜๋Š” 0.9819์ด๋ฉฐ "Hugging"์€ 0.975์ด๊ณ  "Face"๋Š” 0.98879์ž…๋‹ˆ๋‹ค).

์ด์ œ pipeline() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

์ž…๋ ฅ(inputs)์—์„œ ์˜ˆ์ธก(predictions)๊นŒ์ง€

๋จผ์ € ์ž…๋ ฅ์„ ํ† ํฐํ™”ํ•˜๊ณ  ๋ชจ๋ธ์„ ํ†ตํ•ด ์ „๋‹ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” 2์žฅ์—์„œ ์„ค๋ช…ํ•œ ๋‚ด์šฉ๊ณผ ๋™์ผํ•˜๊ฒŒ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. AutoXxx ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ† ํฌ๋‚˜์ด์ €์™€ ๋ชจ๋ธ์„ ์ธ์Šคํ„ด์Šคํ™”ํ•œ ํ›„์— ์ด๋ฅผ ์˜ˆ์ œ์—์„œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

์—ฌ๊ธฐ์—์„œ AutoModelForTokenClassification์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ํ•˜๋‚˜์˜ logits ์„ธํŠธ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค:

print(inputs["input_ids"].shape)
print(outputs.logits.shape)

19๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ 1๊ฐœ์˜ ์‹œํ€€์Šค๊ฐ€ ์žˆ๋Š” ๋ฐฐ์น˜(batch)๊ฐ€ ์žˆ๊ณ  ๋ชจ๋ธ์—๋Š” 9๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ด๋ธ”์ด ์กด์žฌํ•˜๋ฏ€๋กœ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์€ 1 x 19 x 9์˜ ๋ชจ์–‘์„ ๊ฐ–์Šต๋‹ˆ๋‹ค. text-classification ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ softmax ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด๋‹น logits์„ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  argmax๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(softmax๋Š” ์ˆœ์„œ๋ฅผ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— logits์— ๋Œ€ํ•ด์„œ argmax๋ฅผ ์ทจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค):

import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
print(probabilities)
print(predictions)

model.config.id2label ์†์„ฑ์—๋Š” ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์ธ๋ฑ์Šค ๋งคํ•‘์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

model.config.id2label

์œ„์—์„œ ๋ณด๋“ฏ์ด ์ด 9๊ฐœ์˜ ๋ ˆ์ด๋ธ”์ด ์žˆ์Šต๋‹ˆ๋‹ค. O๋Š” ๊ฐœ์ฒด๋ช…์— ํฌํ•จ๋˜์ง€ ์•Š๋Š” ํ† ํฐ์— ๋Œ€ํ•œ ๋ ˆ์ด๋ธ”("outside"๋ฅผ ๋‚˜ํƒ€๋ƒ„)์ด๊ณ  ๊ฐ ๊ฐœ์ฒด๋ช… ์œ ํ˜•, ์ฆ‰ ๊ธฐํƒ€(miscellaneous), ์ธ๋ช…(person), ๊ธฐ๊ด€๋ช…(organization), ์ง€๋ช…(location) ๊ฐ๊ฐ์— ๋Œ€ํ•ด ๋‘ ๊ฐœ์˜ ๋ ˆ์ด๋ธ”์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ ˆ์ด๋ธ” B-XXX๋Š” ํ† ํฐ์ด ๊ฐœ์ฒด๋ช… XXX์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ์žˆ์Œ์„ ๋‚˜ํƒ€๋‚ด๊ณ , ๋ ˆ์ด๋ธ” I-XXX๋Š” ํ† ํฐ์ด ๊ฐœ์ฒด๋ช… XXX์˜ ๋‚ด๋ถ€์— ์žˆ์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ด์ „ ์˜ˆ์‹œ์—์„œ ์šฐ๋ฆฌ๋Š” ๋ชจ๋ธ์ด ํ† ํฐ "S"๋ฅผ B-PER(์ธ๋ช… ๊ฐœ์ฒด๋ช…์˜ ์‹œ์ž‘)์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ณ  "##yl", "##va" ๋ฐ "##in" ํ† ํฐ์„ I-PER(์ธ๋ช… ๊ฐœ์ฒด๋ช…์˜ ๋‚ด๋ถ€)๋กœ ๋ถ„๋ฅ˜ํ•˜๊ธฐ๋ฅผ ๊ธฐ๋Œ€ํ–ˆ์„ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์œ„ ๊ฒฐ๊ณผ์—์„œ ๋ณด๋Š” ๋ฐ”์™€ ๊ฐ™์ด 4๊ฐœ ํ† ํฐ ๋ชจ๋‘์— I-PER์ด๋ผ๋Š” ๋ ˆ์ด๋ธ”์„ ๋ถ€์—ฌํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์˜ˆ์ธก์— ์˜ค๋ฅ˜๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๊ทธ๋ ‡์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ์ด๋Ÿฌํ•œ B- ๋ฐ I- ๋ ˆ์ด๋ธ” ํ‘œ๊ธฐ ๋ฐฉ์‹์—๋Š” IOB1 ๋ฐ IOB2์˜ ๋‘ ๊ฐ€์ง€ ํ˜•์‹์ด ์žˆ์Šต๋‹ˆ๋‹ค. IOB2 ํ˜•์‹(์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ ๋ถ„ํ™์ƒ‰ ํƒœ๊ทธ)์€ ์šฐ๋ฆฌ๊ฐ€ ๋„์ž…ํ•œ ํ˜•์‹์ธ ๋ฐ˜๋ฉด, IOB1 ํ˜•์‹(ํŒŒ๋ž€์ƒ‰ ํƒœ๊ทธ)์—์„œ B-๋กœ ์‹œ์ž‘ํ•˜๋Š” ๋ ˆ์ด๋ธ”์€ ๋™์ผํ•œ ์œ ํ˜•์˜ ์ธ์ ‘ํ•œ ๋‘ ์—”ํ„ฐํ‹ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐ๋งŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์€ IOB1 ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ์œผ๋ฏ€๋กœ "S" ํ† ํฐ์— ๋ ˆ์ด๋ธ” I-PER์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค.

์ด ๋งต์„ ์‚ฌ์šฉํ•˜์—ฌ token-classification ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฒฐ๊ณผ๋ฅผ (๊ฑฐ์˜ ์™„์ „ํžˆ) ์žฌํ˜„ํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. O๋กœ ๋ถ„๋ฅ˜๋˜์ง€ ์•Š์€ ๊ฐ ํ† ํฐ์˜ ์ ์ˆ˜์™€ ๋ ˆ์ด๋ธ”๋งŒ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )
        
print(results)

์ด ๊ฒฐ๊ณผ๋Š” ์ด์ „ ๊ฒฐ๊ณผ์™€ ๋งค์šฐ ์œ ์‚ฌํ•˜์ง€๋งŒ ํ•œ๊ฐ€์ง€ ์ฐจ์ด์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํŒŒ์ดํ”„๋ผ์ธ์€ ๋˜ํ•œ ์›๋ณธ ๋ฌธ์žฅ์—์„œ ๊ฐ ์—”ํ„ฐํ‹ฐ์˜ ์‹œ์ž‘๊ณผ ๋์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ ์˜คํ”„์…‹ ๋งคํ•‘(offset mapping)์ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์˜คํ”„์…‹(offset)์„ ์–ป์œผ๋ ค๋ฉด ์ž…๋ ฅ์— ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ ์šฉํ•  ๋•Œ return_offsets_mapping=True๋ฅผ ์„ค์ •ํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
inputs_with_offsets["offset_mapping"]

๊ฐ ํŠœํ”Œ์€ ๊ฐ ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ํ…์ŠคํŠธ ๋ฒ”์œ„์ด๋ฉฐ, ์—ฌ๊ธฐ์„œ (0, 0)์€ ํŠน์ˆ˜ ํ† ํฐ์šฉ์œผ๋กœ ์˜ˆ์•ฝ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ „์— ์ธ๋ฑ์Šค 5์˜ ํ† ํฐ์ด "##yl"์ด๊ณ  ํ•ด๋‹น ์˜คํ”„์…‹์ด (12, 14)๋กœ ์ง€์ •๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด ์˜คํ”„์…‹์œผ๋กœ ์Šฌ๋ผ์ด์‹ฑ์„ ํ•˜๋ฉด:

example[12:14]

'##' ์—†์ด ์ ์ ˆํ•œ ํ…์ŠคํŠธ ๋ฒ”์œ„(text span)๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ œ ์ด์ „ ๊ฒฐ๊ณผ์˜ ์žฌํ˜„(reproduction)์„ ์™„๋ฃŒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != 'O':
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )
        
print(results)

์ด ๊ฒฐ๊ณผ๋Š” ์šฐ๋ฆฌ๊ฐ€ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ–‰์„ ํ†ตํ•ด ์–ป์€ ๊ฒฐ๊ณผ์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค!

์—”ํ„ฐํ‹ฐ ๊ทธ๋ฃนํ™”

์˜คํ”„์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์—”ํ„ฐํ‹ฐ์˜ ์‹œ์ž‘ ๋ฐ ๋ ํ‚ค๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์€ ํŽธ๋ฆฌํ•˜์ง€๋งŒ ํ•ด๋‹น ์ •๋ณด๊ฐ€ ๊ผญ ํ•„์š”ํ•œ ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์—”ํ„ฐํ‹ฐ ํ† ํฐ์„ ๊ทธ๋ฃนํ™”ํ•˜๋ ค๋Š” ๊ฒฝ์šฐ ์˜คํ”„์…‹์„ ์‚ฌ์šฉํ•˜๋ฉด ์ง€์ €๋ถ„ํ•œ ์ฝ”๋“œ๋ฅผ ๋งŽ์ด ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด Hu, ##gging ๋ฐ Face ํ† ํฐ์„ ํ•˜๋‚˜๋กœ ๊ทธ๋ฃนํ™”ํ•˜๋ ค๋Š” ๊ฒฝ์šฐ, ์ฒ˜์Œ ๋‘ ํ† ํฐ์€ ## ์—†์ด ํ•ฉ์ณ์•ผ ํ•˜๊ณ  Face๋Š” ##๋กœ ์‹œ์ž‘ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์•ž์— ๊ณต๋ฐฑ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ฒฐํ•ฉํ•ด์•ผ ํ•œ๋‹ค๋Š” ํŠน์ˆ˜ ๊ทœ์น™์„ ๋งŒ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๊ทœ์น™๋“ค์€ ํŠน์ • ์œ ํ˜•์˜ ํ† ํฌ๋‚˜์ด์ €์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. SentencePiece ๋˜๋Š” Byte-Pair-Encoding ํ† ํฌ๋‚˜์ด์ €(์ด ์žฅ์˜ ๋’ท๋ถ€๋ถ„์—์„œ ์„ค๋ช…)์—์„œ๋Š” ๋˜๋‹ค๋ฅธ ๊ทœ์น™ ์ง‘ํ•ฉ์„ ์ž‘์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์˜คํ”„์…‹์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋“  ์‚ฌ์šฉ์ž ์ •์˜ ์ฝ”๋“œ๊ฐ€ ์‚ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ๋‹จ์ง€ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ์œผ๋กœ ์‹œ์ž‘ํ•˜๊ณ  ๋งˆ์ง€๋ง‰ ํ† ํฐ์œผ๋กœ ๋๋‚˜๋Š” ์›๋ณธ ํ…์ŠคํŠธ์˜ ๋ฒ”์œ„๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Hu, ##gging ๋ฐ Face ํ† ํฐ์˜ ๊ฒฝ์šฐ, ์ด๋ฅผ ํ•ฉ์น˜๊ธฐ ์œ„ํ•ด์„œ 33๋ฒˆ์งธ ๋ฌธ์ž(Hu์˜ ์‹œ์ž‘ ๋ถ€๋ถ„)์—์„œ ์‹œ์ž‘ํ•˜์—ฌ 45๋ฒˆ์งธ ๋ฌธ์ž(Face์˜ ๋ ๋ถ€๋ถ„) ์•ž๊นŒ์ง€ ์Šฌ๋ผ์ด์‹ฑ์„ ํ•ด์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค:

example[33:45]

ํŠน์ • ์—”ํ„ฐํ‹ฐ์— ํฌํ•จ๋œ ํ† ํฐ๋“ค์„ ๊ทธ๋ฃนํ™”ํ•˜๋Š” ๋™์•ˆ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํ›„์ฒ˜๋ฆฌํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ ์œ„ํ•ด B-XXX ๋˜๋Š” I-XXX๋กœ ๋ ˆ์ด๋ธ”์ด ์ง€์ •๋  ์ˆ˜ ์žˆ๋Š” ์ฒซ ๋ฒˆ์งธ ์—”ํ„ฐํ‹ฐ๋ฅผ ์ œ์™ธํ•˜๊ณ  ์—ฐ์†์ ์ด๊ณ  I-XXX๋กœ ๋ ˆ์ด๋ธ”์ด ์ง€์ •๋œ ์—”ํ„ฐํ‹ฐ๋ฅผ ํ•จ๊ป˜ ๊ทธ๋ฃนํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ O, ์ƒˆ๋กœ์šด ์œ ํ˜•์˜ ์—”ํ„ฐํ‹ฐ ๋˜๋Š” ๋™์ผํ•œ ์œ ํ˜•์˜ ์—”ํ„ฐํ‹ฐ๊ฐ€ ์‹œ์ž‘๋˜๊ณ  ์žˆ์Œ์„ ์•Œ๋ฆฌ๋Š” B-XXX๋ฅผ ๋ฐ›์œผ๋ฉด ์—”ํ„ฐํ‹ฐ ๊ทธ๋ฃนํ™”๋ฅผ ์ค‘์ง€ํ•ฉ๋‹ˆ๋‹ค:

import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # Remove the B- or I-
        label = label[2:]
        start, _ = offsets[idx]
        
        # Grab all the tokens labeled with I-label
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1
        
        # The score is the mean of all the scores of the tokens in that grouped entity
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

์ด๋ ‡๊ฒŒ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ–‰ ๊ฒฐ๊ณผ์™€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค!

์ด ์˜คํ”„์…‹์ด ๋งค์šฐ ์œ ์šฉํ•˜๊ฒŒ ํ™œ์šฉ๋˜๋Š” ๋˜ ๋‹ค๋ฅธ ์˜ˆ๋Š” ์งˆ์˜ ์‘๋‹ต(question answering)์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์„น์…˜์—์„œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ž์„ธํžˆ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๐Ÿค—Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ์žˆ๋Š” ํ† ํฌ๋‚˜์ด์ €์˜ ๋งˆ์ง€๋ง‰ ๊ธฐ๋Šฅ์ธ ์ž…๋ ฅ์„ ์ฃผ์–ด์ง„ ๊ธธ์ด๋กœ ์ž๋ฅผ ๋•Œ ๋„˜์น˜๋Š” ํ† ํฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ