[🤗 강좌 2.6] 2장에서 배운 것 총정리

지난 몇 섹션에서 우리는 대부분의 작업을 직접 세부적으로 수행하기 위해 최선을 다했습니다. 우리는 토크나이저의 작동 방식과 토큰화(tokenization), 입력 식별자(input IDs)로의 변환, 패딩(padding), 절단(truncation) 및 어텐션 마스크(attention mask) 등을 살펴보았습니다.

그러나 섹션 2에서 보았듯이 🤗Transformers API를 사용하면 여기에서 자세히 다룰 예정인 고수준 함수들(high-level functions)로 이 모든 것을 처리할 수 있습니다. 문장에 대해서 직접 토크나이저를 호출하면 모델에 전달될 준비가 된 최종 입력 형태를 만들 수 있습니다:

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

여기에서 model_inputs 변수는 모델이 제대로 동작하는데 필요한 모든 정보들을 포함하고 있습니다. DistilBERT의 경우, model_inputs에는 입력 식별자(input IDs)와 어텐션 마스크(attention mask)가 포함됩니다. 모델에 따라서 tokenizer 객체는 모델에 필요한 입력 정보들을 알아서 제공해 줍니다.

아래의 몇 가지 예에서 볼 수 있듯이 tokenizer 메서드는 매우 강력합니다. 우선 단일 시퀀스를 토큰화할 수 있습니다:

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

또한 변경 사항 없이 한 번에 여러 시퀀스를 처리합니다:

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

그리고 다양한 모드에 따라 패딩(padding) 처리를 할 수 있습니다:

# 해당 시퀀스를 리스트 내의 최대 시퀀스 길이까지 패딩(padding) 합니다.
model_inputs = tokenizer(sequences, padding="longest")

# 시퀀스를 모델 최대 길이(model max length)까지 패딩(padding) 합니다.
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# 지정된 최대 길이까지 시퀀스를 패딩(padding) 합니다.
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

시퀀스를 자를 수도 있습니다:

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# 모델 최대 길이(model max length)보다 긴 시퀀스를 자릅니다.
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# 지정된 최대 길이보다 긴 시퀀스를 자릅니다.
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

특수 토큰들 (Special tokens)

토크나이저가 반환한 입력 식별자(input IDs)를 살펴보면 이전과 약간 다르다는 것을 알 수 있습니다:

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

새로운 토큰 식별자가 처음과 마지막에 하나씩 추가되었습니다. 위의 두 가지 식별자 시퀀스를 디코딩(decoding)하여 이것이 무엇을 나타내는 것인지 알아보겠습니다:

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.

토크나이저는 시작 부분에 특수 단어 [CLS]를 추가하고 끝에 특수 단어 [SEP]를 추가했습니다. 이는 모델이 해당 특수 토큰들로 사전 학습(pre-training)되었기 때문에 추론에 대해 동일한 결과를 얻으려면 이를 추가해야 합니다. 일부 모델은 이들 특수 토큰들이나 다른 특별한 단어들을 추가하지 않는 경우도 있습니다. 또 어떤 모델은 시작 부분에만 이러한 특수 단어를 추가하거나 혹은 끝 부분에만 추가할 수도 있습니다. 어쨌든 토크나이저는 입력이 예상되는 토큰들을 이미 알고 있으며 이를 처리할 것입니다.

마무리: 토크나이저에서 모델로...

이제 토크나이저 객체가 텍스트를 처리할 때 실행되는 모든 개별 단계들을 살펴보았으므로, 주요 API로 다중 시퀀스(패딩, padding!), 매우 긴 시퀀스(절단, truncation!), 여러 유형의 텐서를 처리하는 방법 등을 마지막으로 한 번 보겠습니다.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
print(output)

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Author And Source

이 문제에 관하여([🤗 강좌 2.6] 2장에서 배운 것 총정리), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@spasis/강좌-2.5-2장에서-배운-것-총정리

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

[🤗 강좌 2.6] 2장에서 배운 것 총정리

특수 토큰들 (Special tokens)

마무리: 토크나이저에서 모델로...

Author And Source

좋은 웹페이지 즐겨찾기