[๐Ÿค— ๊ฐ•์ขŒ 2.6] 2์žฅ์—์„œ ๋ฐฐ์šด ๊ฒƒ ์ด์ •๋ฆฌ

์ง€๋‚œ ๋ช‡ ์„น์…˜์—์„œ ์šฐ๋ฆฌ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์ž‘์—…์„ ์ง์ ‘ ์„ธ๋ถ€์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์„ ์„ ๋‹คํ–ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ํ† ํฌ๋‚˜์ด์ €์˜ ์ž‘๋™ ๋ฐฉ์‹๊ณผ ํ† ํฐํ™”(tokenization), ์ž…๋ ฅ ์‹๋ณ„์ž(input IDs)๋กœ์˜ ๋ณ€ํ™˜, ํŒจ๋”ฉ(padding), ์ ˆ๋‹จ(truncation) ๋ฐ ์–ดํ…์…˜ ๋งˆ์Šคํฌ(attention mask) ๋“ฑ์„ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์„น์…˜ 2์—์„œ ๋ณด์•˜๋“ฏ์ด ๐Ÿค—Transformers API๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์—ฌ๊ธฐ์—์„œ ์ž์„ธํžˆ ๋‹ค๋ฃฐ ์˜ˆ์ •์ธ ๊ณ ์ˆ˜์ค€ ํ•จ์ˆ˜๋“ค(high-level functions)๋กœ ์ด ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ ์ง์ ‘ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋ชจ๋ธ์— ์ „๋‹ฌ๋  ์ค€๋น„๊ฐ€ ๋œ ์ตœ์ข… ์ž…๋ ฅ ํ˜•ํƒœ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

์—ฌ๊ธฐ์—์„œ model_inputs ๋ณ€์ˆ˜๋Š” ๋ชจ๋ธ์ด ์ œ๋Œ€๋กœ ๋™์ž‘ํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ๋ชจ๋“  ์ •๋ณด๋“ค์„ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. DistilBERT์˜ ๊ฒฝ์šฐ, model_inputs์—๋Š” ์ž…๋ ฅ ์‹๋ณ„์ž(input IDs)์™€ ์–ดํ…์…˜ ๋งˆ์Šคํฌ(attention mask)๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์— ๋”ฐ๋ผ์„œ tokenizer ๊ฐ์ฒด๋Š” ๋ชจ๋ธ์— ํ•„์š”ํ•œ ์ž…๋ ฅ ์ •๋ณด๋“ค์„ ์•Œ์•„์„œ ์ œ๊ณตํ•ด ์ค๋‹ˆ๋‹ค.

์•„๋ž˜์˜ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด tokenizer ๋ฉ”์„œ๋“œ๋Š” ๋งค์šฐ ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์šฐ์„  ๋‹จ์ผ ์‹œํ€€์Šค๋ฅผ ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

๋˜ํ•œ ๋ณ€๊ฒฝ ์‚ฌํ•ญ ์—†์ด ํ•œ ๋ฒˆ์— ์—ฌ๋Ÿฌ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

๊ทธ๋ฆฌ๊ณ  ๋‹ค์–‘ํ•œ ๋ชจ๋“œ์— ๋”ฐ๋ผ ํŒจ๋”ฉ(padding) ์ฒ˜๋ฆฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ํ•ด๋‹น ์‹œํ€€์Šค๋ฅผ ๋ฆฌ์ŠคํŠธ ๋‚ด์˜ ์ตœ๋Œ€ ์‹œํ€€์Šค ๊ธธ์ด๊นŒ์ง€ ํŒจ๋”ฉ(padding) ํ•ฉ๋‹ˆ๋‹ค.
model_inputs = tokenizer(sequences, padding="longest")

# ์‹œํ€€์Šค๋ฅผ ๋ชจ๋ธ ์ตœ๋Œ€ ๊ธธ์ด(model max length)๊นŒ์ง€ ํŒจ๋”ฉ(padding) ํ•ฉ๋‹ˆ๋‹ค.
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# ์ง€์ •๋œ ์ตœ๋Œ€ ๊ธธ์ด๊นŒ์ง€ ์‹œํ€€์Šค๋ฅผ ํŒจ๋”ฉ(padding) ํ•ฉ๋‹ˆ๋‹ค.
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

์‹œํ€€์Šค๋ฅผ ์ž๋ฅผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# ๋ชจ๋ธ ์ตœ๋Œ€ ๊ธธ์ด(model max length)๋ณด๋‹ค ๊ธด ์‹œํ€€์Šค๋ฅผ ์ž๋ฆ…๋‹ˆ๋‹ค.
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# ์ง€์ •๋œ ์ตœ๋Œ€ ๊ธธ์ด๋ณด๋‹ค ๊ธด ์‹œํ€€์Šค๋ฅผ ์ž๋ฆ…๋‹ˆ๋‹ค.
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

ํŠน์ˆ˜ ํ† ํฐ๋“ค (Special tokens)

ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ๋ฐ˜ํ™˜ํ•œ ์ž…๋ ฅ ์‹๋ณ„์ž(input IDs)๋ฅผ ์‚ดํŽด๋ณด๋ฉด ์ด์ „๊ณผ ์•ฝ๊ฐ„ ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

์ƒˆ๋กœ์šด ํ† ํฐ ์‹๋ณ„์ž๊ฐ€ ์ฒ˜์Œ๊ณผ ๋งˆ์ง€๋ง‰์— ํ•˜๋‚˜์”ฉ ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ๋‘ ๊ฐ€์ง€ ์‹๋ณ„์ž ์‹œํ€€์Šค๋ฅผ ๋””์ฝ”๋”ฉ(decoding)ํ•˜์—ฌ ์ด๊ฒƒ์ด ๋ฌด์—‡์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ธ์ง€ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))
[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.

ํ† ํฌ๋‚˜์ด์ €๋Š” ์‹œ์ž‘ ๋ถ€๋ถ„์— ํŠน์ˆ˜ ๋‹จ์–ด [CLS]๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ๋์— ํŠน์ˆ˜ ๋‹จ์–ด [SEP]๋ฅผ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ํ•ด๋‹น ํŠน์ˆ˜ ํ† ํฐ๋“ค๋กœ ์‚ฌ์ „ ํ•™์Šต(pre-training)๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๋ก ์— ๋Œ€ํ•ด ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ๋ ค๋ฉด ์ด๋ฅผ ์ถ”๊ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ ๋ชจ๋ธ์€ ์ด๋“ค ํŠน์ˆ˜ ํ† ํฐ๋“ค์ด๋‚˜ ๋‹ค๋ฅธ ํŠน๋ณ„ํ•œ ๋‹จ์–ด๋“ค์„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ ์–ด๋–ค ๋ชจ๋ธ์€ ์‹œ์ž‘ ๋ถ€๋ถ„์—๋งŒ ์ด๋Ÿฌํ•œ ํŠน์ˆ˜ ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ํ˜น์€ ๋ ๋ถ€๋ถ„์—๋งŒ ์ถ”๊ฐ€ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ด์จŒ๋“  ํ† ํฌ๋‚˜์ด์ €๋Š” ์ž…๋ ฅ์ด ์˜ˆ์ƒ๋˜๋Š” ํ† ํฐ๋“ค์„ ์ด๋ฏธ ์•Œ๊ณ  ์žˆ์œผ๋ฉฐ ์ด๋ฅผ ์ฒ˜๋ฆฌํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋งˆ๋ฌด๋ฆฌ: ํ† ํฌ๋‚˜์ด์ €์—์„œ ๋ชจ๋ธ๋กœ...

์ด์ œ ํ† ํฌ๋‚˜์ด์ € ๊ฐ์ฒด๊ฐ€ ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ์‹คํ–‰๋˜๋Š” ๋ชจ๋“  ๊ฐœ๋ณ„ ๋‹จ๊ณ„๋“ค์„ ์‚ดํŽด๋ณด์•˜์œผ๋ฏ€๋กœ, ์ฃผ์š” API๋กœ ๋‹ค์ค‘ ์‹œํ€€์Šค(ํŒจ๋”ฉ, padding!), ๋งค์šฐ ๊ธด ์‹œํ€€์Šค(์ ˆ๋‹จ, truncation!), ์—ฌ๋Ÿฌ ์œ ํ˜•์˜ ํ…์„œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ• ๋“ฑ์„ ๋งˆ์ง€๋ง‰์œผ๋กœ ํ•œ ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
print(output)
SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ