[🤗 강좌 6.4] QA 파이프라인에서의 "빠른(fast)" 토크나이저

이제 question-answering 파이프라인을 살펴보고, 이전 섹션에서 그룹화된 엔터티를 구하기 위해서 수행했던 것처럼 오프셋을 활용하여 컨텍스트에서 입력 질문에 대한 답변을 직접 구하는 방법을 살펴보겠습니다. 그런 다음 절단(truncation)될 수 밖에 없는 매우 긴 컨텍스트를 처리하는 방법을 볼 것입니다. 질의 응답(question answering) 작업에 관심이 없다면 이 섹션을 건너뛸 수 있습니다.

`question-answering` 파이프라인 사용하기

1장에서 보았듯이 우리는 질문에 대한 답을 얻기 위해 다음과 같은 question-answering 파이프라인을 사용할 수 있습니다:

from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch, and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)

모델이 허용하는 최대 길이보다 긴 텍스트를 자르거나 분할할 수 없는(따라서 문서 끝에 있는 정보를 놓칠 수 있는) 다른 파이프라인과 달리, 이 파이프라인은 매우 긴 컨텍스트를 처리할 수 있으며 질문에 대한 답이 컨텍스트의 마지막에 있더라도 그 답변을 추출할 수 있습니다:

long_context = """
🤗 Transformers: State of the Art NLP

🤗 Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

🤗 Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)

이 모든 작업을 어떻게 수행하는지 봅시다!

질의 응답을 위한 사전학습 모델 사용하기

다른 파이프라인과 마찬가지로 우선 입력을 토큰화한 다음 모델로 전달합니다. question-answering 파이프라인에 디폴트로 사용되는 체크포인트는 distillbert-base-cased-distilled-squad입니다. 체크포인트 이름 내의 "squad"는 모델이 미세 조정된 데이터셋의 명칭입니다. 7장에서 SQuAD 데이터셋에 대해 더 이야기할 것입니다:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

위 코드에서 질문과 컨텍스트를 순서대로 배치시켜 쌍(pair)으로 토큰화합니다. 아래 그림을 보면 이해가 빠를겁니다.

질의 응답 모델은 지금까지 본 모델과 조금 다르게 작동합니다. 위의 그림을 예시로 보면, 모델은 정답 시작 토큰의 인덱스(여기서는 21)와 정답 마지막 토큰의 인덱스(여기서는 24)를 예측하도록 학습되었습니다. 해당 모델이 하나의 로짓(logits) 텐서를 반환하지 않고 두 개의 텐서를 반환하는 이유입니다. 하나는 정답의 시작 토큰에 해당하는 로짓(logit)이고 다른 하나는 정답의 마지막 토큰에 해당하는 로짓(logit)입니다. 이 경우 66개의 토큰이 포함된 입력이 하나가 존재하므로 다음을 얻습니다:

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

이러한 로짓들을 확률로 변환하기 위해 softmax 함수를 적용해야 하나, 그 전에 컨텍스트(context)가 아닌 토큰 인덱스를 마스킹(masking)해야 합니다. 입력이 [CLS] question [SEP] context [SEP]이므로 질문에 포함된 토큰과 [SEP] 토큰을 마스킹해야 합니다. 그러나 일부 모델에서는 컨텍스트에 답이 없음을 나타내기 위해 사용할수도 있으므로 [CLS] 토큰은 마스킹하지 않고 그대로 유지합니다.

나중에 softmax를 적용할 것이기 때문에 마스킹(masking)하려는 로짓을 큰 음수로 바꾸면 됩니다. 여기서는 -10000을 사용합니다:

import torch

sequence_ids = inputs.sequence_ids()
# 컨텍스트 토큰들을 제외하고는 모두 마스킹한다.
mask = [i != 1 for i in sequence_ids]
# [CLS] 토큰은 마스킹하지 않는다.
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

이제 예측하고 싶지 않은 위치에 해당하는 로짓을 적절하게 마스킹했으므로 softmax를 적용할 수 있습니다:

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

이 단계에서 시작 및 종료 확률의 argmax를 취할 수 있지만 시작 인덱스가 종료 인덱스보다 클 수 있으므로 몇 가지 예방 조치를 더 취해야 합니다. start_index <= end_index를 만족하는 가능한 start_index 및 end_index의 확률을 계산한 다음 가장 높은 확률을 가진 튜플 (start_index, end_index)을 선택합니다.

"The answer starts at start_index" 및 "The answer ends at end_index" 이벤트가 독립적이라고 가정할 때, 답변이 start_index에서 시작하여 end_index에서 끝날 확률은 다음과 같습니다:

$start\_probabilities[start\_index] \times end\_probabilities[end\_index]$

따라서 모든 점수를 계산하려면 start_index <= end_index을 만족하는 모든 $start\_probabilities[start\_index] \times end\_probabilities[end\_index]$

scores = start_probabilities[:, None] * end_probabilities[None, :]

그런 다음 start_index > end_index를 만족하는 값들을 0으로 설정하여 값을 마스킹합니다(다른 확률은 모두 양수임). torch.triu() 함수는 인수로 전달된 2D 텐서의 위쪽 삼각형 부분을 반환하므로 해당 마스킹을 수행할 수 있습니다:

scores = torch.triu(scores)

이제 최대값의 인덱스만 구하면 됩니다. PyTorch는 평탄화된 텐서(flattened tensor)의 인덱스를 반환하므로 나머지 없는 나누기, // 와 나머지 연산, %을 사용하여 start_index 및 end_index를 가져와야 합니다:

max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

아직 완료되지 않았지만 적어도 추출된 응답에 대한 정확한 점수는 계산했습니다(이전 섹션의 첫 번째 결과와 비교하여 이를 확인할 수 있습니다).

✏️ Try it out! 가장 가능성이 높은 5개의 응답에 대한 시작 및 종료 인덱스를 출력해 봅시다.

응답들의 토큰 단위 start_index 및 end_index를 구했기 때문에, 이를 기반으로 이제 컨텍스트 내에서의 문자 단위 인덱스로 변환해야 합니다. 여기에서 오프셋(offset)이 매우 유용할 것입니다. 토큰 분류(token classification) 작업에서처럼 이들 인덱스들을 사용할 수 있습니다:

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

이제 결과를 표시하기 위해서 출력 형식을 지정할 수 있습니다:

result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index]
}
print(result)

좋습니다! 위 결과는 앞에서 실행했던 파이프라인의 결과와 동일합니다.

✏️ Try it out! 이전에 계산한 최고 점수를 사용하여 가능성이 가장 높은 5개의 응답을 표시해 봅시다. 결과를 확인하기 위해서 이전의 파이프라인으로 돌아가서 호출할 때 top_k=5를 전달하면 됩니다.

길이가 긴 컨텍스트 다루기

위에서 예제로 사용한 질문 및 길이가 긴 컨텍스트를 토큰화 해보면 question-answering 파이프라인에서 사용된 최대 길이(384)보다 더 많은 토큰들이 출력됩니다:

inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))

따라서 최대 길이만큼 입력을 절단해야 합니다. 이를 수행할 수 있는 방법은 여러 가지가 있지만 우선 주의해야할 것은 질문을 절단해서는 안되고 컨텍스트만 절단해야 한다는 점입니다. 컨텍스트는 두 번째 문장이므로 "only_second" 절단 옵션을 사용할 수 있습니다. 이때 발생하는 문제는 질문에 대한 정답이 잘려져 나간 컨텍스트에 있을 수 있다는 것입니다. 예를 들어, 아래 예시에서 정답이 컨텍스트의 끝 부분에 있는 질문을 입력했다면, 해당 질문에 대한 답변은 존재하지 않습니다:

inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

이것은 모델이 정답을 선택하는데 어려움을 겪을 것임을 의미합니다. 이 문제를 해결하기 위해 question-answering 파이프라인을 사용하면 컨텍스트를 더 작은 청크로 분할하여 최대 길이를 지정할 수 있습니다. 정답을 찾을 수 있도록 컨텍스트를 잘못된 위치에서 분할하지 않도록 하기 위해 청크 사이에 약간의 겹침(overlap)도 포함됩니다.

return_overflowing_tokens=True를 추가하여 토크나이저("빠른" 또는 "느린")가 이를 수행하도록 할 수 있으며 stride 인수로 원하는 겹침 정도를 지정할 수 있습니다. 다음은 길이가 비교적 짧은 문장을 이용한 예시입니다:

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)
for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

우리가 볼 수 있듯이, 입력 문장은 inputs["input_ids"]의 각 항목이 최대 6개의 토큰을 갖는 청크들로 분할되었습니다(마지막 항목이 다른 항목과 같은 크기가 되도록 padding을 추가해야 합니다. ) 각 항목 사이에 2개씩의 토큰이 겹칩니다.

토큰화 결과를 자세히 살펴보겠습니다:

print(inputs.keys())

예상대로 input_IDs와 attention_mask가 담겨져 있습니다. 마지막 키인 overflow_to_sample_mapping은 각 결과가 어느 문장에 해당하는지 알려주는 맵(map)입니다. 여기에는 우리가 토크나이저로 전달한 (유일한) 문장에서 나온 7개의 결과가 있습니다:

print(inputs["overflow_to_sample_mapping"])

이것은 여러 문장을 함께 토큰화할 때 더 유용합니다. 예를 들어,

sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

결과는 첫번째 문장이 이전과 같이 7개의 청크로 분할되고 다음 4개의 청크가 두 번째 문장에서 온다는 것을 의미합니다.

이제 길이가 긴 컨텍스트로 돌아가 보겠습니다. 기본적으로 question-answering 파이프라인은 앞에서 언급한 것처럼 최대 길이 384와 모델이 미세 조정된 방식과 동일한 128의 보폭(stride)을 사용합니다. 파이프라인을 호출할 때 max_seq_len 및 stride 인수를 전달하여 해당 매개변수를 조정할 수 있습니다. 따라서 토큰화할 때 이러한 매개변수를 사용합니다. 또한 패딩(padding)을 추가하고(텐서를 구성할 수 있도록 샘플들의 길이를 동일하게 하기 위해서) 오프셋을 요청할 것입니다:

inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

위에서 inputs에는 모델로 입력되는 input_IDs와 attention_mask 뿐만 아니라, 방금 언급한 오프셋(offset) 및 overflow_to_sample_mapping이 포함됩니다. 마지막 두가지는 모델에서 사용하는 매개변수가 아니므로 텐서로 변환하기 전에 inputs에서 이를 제거(pop)합니다:

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

길이가 긴 컨텍스트는 두 개로 분할되었습니다. 따라서, 모델의 출력은 두가지 종류의 시작 및 마지막 로짓(logits)으로 구성됩니다:

outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

이전과 마찬가지로 softmax를 취하기 전에 컨텍스트의 일부가 아닌 토큰을 먼저 마스킹합니다. 또한 모든 패딩 토큰을 마스킹합니다(attention_mask로 표시된 대로):

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [1 != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

그런 다음 softmax를 사용하여 로짓(logits)을 확률로 변환할 수 있습니다:

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

다음 단계는 앞에서 길이가 짧은 컨텍스트에 대해 수행한 작업과 유사하지만 청크가 2개이므로 이를 반복합니다. 가능한 모든 답변(answer spans)에 점수를 부여한 다음 가장 좋은 점수를 받은 답변을 선택합니다:

candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()
    
    start_idx = idx // scores.shape[0]
    end_idx = idx % scores.shape[0]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

위에서 출력된 2개의 후보는 모델이 각 청크(chunk, 길이가 길어서 분할된 컨텍스트)에서 찾을 수 있었던 최상의 답변에 해당합니다. 모델은 정답이 두번째라고 확실히 더 확신합니다(좋은 징조입니다!). 이제 두 토큰 범위(token spans)를 컨텍스트의 문자 범위(character spans)에 매핑하기만 하면 됩니다. 우리는 가장 확실한 답변을 얻기 위해 두번째 후보만 매핑하면 되지만, 첫번째 청크에서 모델이 선택한 답변을 보는 것도 재미있을 것 같습니다.

✏️ Try it out! 위의 코드를 수정하여 가장 가능성이 높은 5개의 답변에 대한 점수와 범위를 반환해 보세요(청크별로가 아니라 총합적으로).

우리가 이전에 가져온 오프셋은 실제로는 오프셋 리스트(list of offsets)이며 텍스트 청크당 하나의 리스트가 존재합니다:

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start":start_char, "end":end_char, "score":score}
    print(result)

첫번째 결과를 무시하면 이 길이가 긴 컨텍스트(long_context)에 대한 파이프라인 실행 결과와 동일한 결과를 얻습니다.

✏️ Try it out! 이전에 계산한 최고 점수를 사용하여 가장 가능성이 높은 5개의 답변을 표시해보세요(각 청크가 아닌 전체 컨텍스트에 대해). 결과를 확인하기 위해서 첫번째 파이프라인으로 돌아가서 호출할 때 top_k=5를 전달해봅시다.

이것으로 토크나이저의 기능에 대한 심층 분석을 마칩니다. 다음 장에서 일반적인 NLP 작업에서 모델을 미세 조정하는 방법을 보여줄 때 여기서 배운 모든 것들을 실제로 적용할 것입니다.

Author And Source

이 문제에 관하여([🤗 강좌 6.4] QA 파이프라인에서의 "빠른(fast)" 토크나이저), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@spasis/강좌-6.4-QA-파이프라인에서의-빠른fast-토크나이저

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

[🤗 강좌 6.4] QA 파이프라인에서의 "빠른(fast)" 토크나이저

question-answering 파이프라인 사용하기

질의 응답을 위한 사전학습 모델 사용하기

길이가 긴 컨텍스트 다루기

Author And Source

좋은 웹페이지 즐겨찾기

`question-answering` 파이프라인 사용하기