[๐Ÿค— ๊ฐ•์ขŒ 6.4] QA ํŒŒ์ดํ”„๋ผ์ธ์—์„œ์˜ "๋น ๋ฅธ(fast)" ํ† ํฌ๋‚˜์ด์ €

์ด์ œ question-answering ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ดํŽด๋ณด๊ณ , ์ด์ „ ์„น์…˜์—์„œ ๊ทธ๋ฃนํ™”๋œ ์—”ํ„ฐํ‹ฐ๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ˆ˜ํ–‰ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ์˜คํ”„์…‹์„ ํ™œ์šฉํ•˜์—ฌ ์ปจํ…์ŠคํŠธ์—์„œ ์ž…๋ ฅ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ง์ ‘ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ ˆ๋‹จ(truncation)๋  ์ˆ˜ ๋ฐ–์— ์—†๋Š” ๋งค์šฐ ๊ธด ์ปจํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณผ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์งˆ์˜ ์‘๋‹ต(question answering) ์ž‘์—…์— ๊ด€์‹ฌ์ด ์—†๋‹ค๋ฉด ์ด ์„น์…˜์„ ๊ฑด๋„ˆ๋›ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

question-answering ํŒŒ์ดํ”„๋ผ์ธ ์‚ฌ์šฉํ•˜๊ธฐ

1์žฅ์—์„œ ๋ณด์•˜๋“ฏ์ด ์šฐ๋ฆฌ๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต์„ ์–ป๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ question-answering ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import pipeline

question_answerer = pipeline("question-answering")
context = """
๐Ÿค— Transformers is backed by the three most popular deep learning libraries โ€” Jax, PyTorch, and TensorFlow โ€” with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back ๐Ÿค— Transformers?"
question_answerer(question=question, context=context)

๋ชจ๋ธ์ด ํ—ˆ์šฉํ•˜๋Š” ์ตœ๋Œ€ ๊ธธ์ด๋ณด๋‹ค ๊ธด ํ…์ŠคํŠธ๋ฅผ ์ž๋ฅด๊ฑฐ๋‚˜ ๋ถ„ํ• ํ•  ์ˆ˜ ์—†๋Š”(๋”ฐ๋ผ์„œ ๋ฌธ์„œ ๋์— ์žˆ๋Š” ์ •๋ณด๋ฅผ ๋†“์น  ์ˆ˜ ์žˆ๋Š”) ๋‹ค๋ฅธ ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๋‹ฌ๋ฆฌ, ์ด ํŒŒ์ดํ”„๋ผ์ธ์€ ๋งค์šฐ ๊ธด ์ปจํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต์ด ์ปจํ…์ŠคํŠธ์˜ ๋งˆ์ง€๋ง‰์— ์žˆ๋”๋ผ๋„ ๊ทธ ๋‹ต๋ณ€์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

long_context = """
๐Ÿค— Transformers: State of the Art NLP

๐Ÿค— Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction,
question answering, summarization, translation, text generation and more in over 100 languages.
Its aim is to make cutting-edge NLP easier to use for everyone.

๐Ÿค— Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets and
then share them with the community on our model hub. At the same time, each python module defining an architecture is fully standalone and
can be modified to enable quick research experiments.

Why should I use transformers?

1. Easy-to-use state-of-the-art models:
  - High performance on NLU and NLG tasks.
  - Low barrier to entry for educators and practitioners.
  - Few user-facing abstractions with just three classes to learn.
  - A unified API for using all our pretrained models.
  - Lower compute costs, smaller carbon footprint:

2. Researchers can share trained models instead of always retraining.
  - Practitioners can reduce compute time and production costs.
  - Dozens of architectures with over 10,000 pretrained models, some in more than 100 languages.

3. Choose the right framework for every part of a model's lifetime:
  - Train state-of-the-art models in 3 lines of code.
  - Move a single model between TF2.0/PyTorch frameworks at will.
  - Seamlessly pick the right framework for training, evaluation and production.

4. Easily customize a model or an example to your needs:
  - We provide examples for each architecture to reproduce the results published by its original authors.
  - Model internals are exposed as consistently as possible.
  - Model files can be used independently of the library for quick experiments.

๐Ÿค— Transformers is backed by the three most popular deep learning libraries โ€” Jax, PyTorch and TensorFlow โ€” with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question_answerer(question=question, context=long_context)

์ด ๋ชจ๋“  ์ž‘์—…์„ ์–ด๋–ป๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š”์ง€ ๋ด…์‹œ๋‹ค!

์งˆ์˜ ์‘๋‹ต์„ ์œ„ํ•œ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ ์‚ฌ์šฉํ•˜๊ธฐ

๋‹ค๋ฅธ ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์šฐ์„  ์ž…๋ ฅ์„ ํ† ํฐํ™”ํ•œ ๋‹ค์Œ ๋ชจ๋ธ๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. question-answering ํŒŒ์ดํ”„๋ผ์ธ์— ๋””ํดํŠธ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์ฒดํฌํฌ์ธํŠธ๋Š” distillbert-base-cased-distilled-squad์ž…๋‹ˆ๋‹ค. ์ฒดํฌํฌ์ธํŠธ ์ด๋ฆ„ ๋‚ด์˜ "squad"๋Š” ๋ชจ๋ธ์ด ๋ฏธ์„ธ ์กฐ์ •๋œ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ช…์นญ์ž…๋‹ˆ๋‹ค. 7์žฅ์—์„œ SQuAD ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๋” ์ด์•ผ๊ธฐํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering

model_checkpoint = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

์œ„ ์ฝ”๋“œ์—์„œ ์งˆ๋ฌธ๊ณผ ์ปจํ…์ŠคํŠธ๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ๋ฐฐ์น˜์‹œ์ผœ ์Œ(pair)์œผ๋กœ ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์ดํ•ด๊ฐ€ ๋น ๋ฅผ๊ฒ๋‹ˆ๋‹ค.

์งˆ์˜ ์‘๋‹ต ๋ชจ๋ธ์€ ์ง€๊ธˆ๊นŒ์ง€ ๋ณธ ๋ชจ๋ธ๊ณผ ์กฐ๊ธˆ ๋‹ค๋ฅด๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์„ ์˜ˆ์‹œ๋กœ ๋ณด๋ฉด, ๋ชจ๋ธ์€ ์ •๋‹ต ์‹œ์ž‘ ํ† ํฐ์˜ ์ธ๋ฑ์Šค(์—ฌ๊ธฐ์„œ๋Š” 21)์™€ ์ •๋‹ต ๋งˆ์ง€๋ง‰ ํ† ํฐ์˜ ์ธ๋ฑ์Šค(์—ฌ๊ธฐ์„œ๋Š” 24)๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์ด ํ•˜๋‚˜์˜ ๋กœ์ง“(logits) ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š๊ณ  ๋‘ ๊ฐœ์˜ ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ์ด์œ ์ž…๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” ์ •๋‹ต์˜ ์‹œ์ž‘ ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ๋กœ์ง“(logit)์ด๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์ •๋‹ต์˜ ๋งˆ์ง€๋ง‰ ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ๋กœ์ง“(logit)์ž…๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ 66๊ฐœ์˜ ํ† ํฐ์ด ํฌํ•จ๋œ ์ž…๋ ฅ์ด ํ•˜๋‚˜๊ฐ€ ์กด์žฌํ•˜๋ฏ€๋กœ ๋‹ค์Œ์„ ์–ป์Šต๋‹ˆ๋‹ค:

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

์ด๋Ÿฌํ•œ ๋กœ์ง“๋“ค์„ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด softmax ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•ด์•ผ ํ•˜๋‚˜, ๊ทธ ์ „์— ์ปจํ…์ŠคํŠธ(context)๊ฐ€ ์•„๋‹Œ ํ† ํฐ ์ธ๋ฑ์Šค๋ฅผ ๋งˆ์Šคํ‚น(masking)ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ์ด [CLS] question [SEP] context [SEP]์ด๋ฏ€๋กœ ์งˆ๋ฌธ์— ํฌํ•จ๋œ ํ† ํฐ๊ณผ [SEP] ํ† ํฐ์„ ๋งˆ์Šคํ‚นํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ถ€ ๋ชจ๋ธ์—์„œ๋Š” ์ปจํ…์ŠคํŠธ์— ๋‹ต์ด ์—†์Œ์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ• ์ˆ˜๋„ ์žˆ์œผ๋ฏ€๋กœ [CLS] ํ† ํฐ์€ ๋งˆ์Šคํ‚นํ•˜์ง€ ์•Š๊ณ  ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

๋‚˜์ค‘์— softmax๋ฅผ ์ ์šฉํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋งˆ์Šคํ‚น(masking)ํ•˜๋ ค๋Š” ๋กœ์ง“์„ ํฐ ์Œ์ˆ˜๋กœ ๋ฐ”๊พธ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” -10000์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

import torch

sequence_ids = inputs.sequence_ids()
# ์ปจํ…์ŠคํŠธ ํ† ํฐ๋“ค์„ ์ œ์™ธํ•˜๊ณ ๋Š” ๋ชจ๋‘ ๋งˆ์Šคํ‚นํ•œ๋‹ค.
mask = [i != 1 for i in sequence_ids]
# [CLS] ํ† ํฐ์€ ๋งˆ์Šคํ‚นํ•˜์ง€ ์•Š๋Š”๋‹ค.
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

์ด์ œ ์˜ˆ์ธกํ•˜๊ณ  ์‹ถ์ง€ ์•Š์€ ์œ„์น˜์— ํ•ด๋‹นํ•˜๋Š” ๋กœ์ง“์„ ์ ์ ˆํ•˜๊ฒŒ ๋งˆ์Šคํ‚นํ–ˆ์œผ๋ฏ€๋กœ softmax๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)[0]

์ด ๋‹จ๊ณ„์—์„œ ์‹œ์ž‘ ๋ฐ ์ข…๋ฃŒ ํ™•๋ฅ ์˜ argmax๋ฅผ ์ทจํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์‹œ์ž‘ ์ธ๋ฑ์Šค๊ฐ€ ์ข…๋ฃŒ ์ธ๋ฑ์Šค๋ณด๋‹ค ํด ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ๋ฐฉ ์กฐ์น˜๋ฅผ ๋” ์ทจํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. start_index <= end_index๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฐ€๋Šฅํ•œ start_index ๋ฐ end_index์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•œ ๋‹ค์Œ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํŠœํ”Œ (start_index, end_index)์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

"The answer starts at start_index" ๋ฐ "The answer ends at end_index" ์ด๋ฒคํŠธ๊ฐ€ ๋…๋ฆฝ์ ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•  ๋•Œ, ๋‹ต๋ณ€์ด start_index์—์„œ ์‹œ์ž‘ํ•˜์—ฌ end_index์—์„œ ๋๋‚  ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

start_probabilities[start_index]ร—end_probabilities[end_index]start\_probabilities[start\_index] \times end\_probabilities[end\_index]

๋”ฐ๋ผ์„œ ๋ชจ๋“  ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋ ค๋ฉด start_index <= end_index์„ ๋งŒ์กฑํ•˜๋Š” ๋ชจ๋“  start_probabilities[start_index]ร—end_probabilities[end_index]start\_probabilities[start\_index] \times end\_probabilities[end\_index]

scores = start_probabilities[:, None] * end_probabilities[None, :]

๊ทธ๋Ÿฐ ๋‹ค์Œ start_index > end_index๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ฐ’๋“ค์„ 0์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ๊ฐ’์„ ๋งˆ์Šคํ‚นํ•ฉ๋‹ˆ๋‹ค(๋‹ค๋ฅธ ํ™•๋ฅ ์€ ๋ชจ๋‘ ์–‘์ˆ˜์ž„). torch.triu() ํ•จ์ˆ˜๋Š” ์ธ์ˆ˜๋กœ ์ „๋‹ฌ๋œ 2D ํ…์„œ์˜ ์œ„์ชฝ ์‚ผ๊ฐํ˜• ๋ถ€๋ถ„์„ ๋ฐ˜ํ™˜ํ•˜๋ฏ€๋กœ ํ•ด๋‹น ๋งˆ์Šคํ‚น์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

scores = torch.triu(scores)

์ด์ œ ์ตœ๋Œ€๊ฐ’์˜ ์ธ๋ฑ์Šค๋งŒ ๊ตฌํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. PyTorch๋Š” ํ‰ํƒ„ํ™”๋œ ํ…์„œ(flattened tensor)์˜ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋ฏ€๋กœ ๋‚˜๋จธ์ง€ ์—†๋Š” ๋‚˜๋ˆ„๊ธฐ, // ์™€ ๋‚˜๋จธ์ง€ ์—ฐ์‚ฐ, %์„ ์‚ฌ์šฉํ•˜์—ฌ start_index ๋ฐ end_index๋ฅผ ๊ฐ€์ ธ์™€์•ผ ํ•ฉ๋‹ˆ๋‹ค:

max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
print(scores[start_index, end_index])

์•„์ง ์™„๋ฃŒ๋˜์ง€ ์•Š์•˜์ง€๋งŒ ์ ์–ด๋„ ์ถ”์ถœ๋œ ์‘๋‹ต์— ๋Œ€ํ•œ ์ •ํ™•ํ•œ ์ ์ˆ˜๋Š” ๊ณ„์‚ฐํ–ˆ์Šต๋‹ˆ๋‹ค(์ด์ „ ์„น์…˜์˜ ์ฒซ ๋ฒˆ์งธ ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•˜์—ฌ ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค).

โœ๏ธ Try it out! ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ 5๊ฐœ์˜ ์‘๋‹ต์— ๋Œ€ํ•œ ์‹œ์ž‘ ๋ฐ ์ข…๋ฃŒ ์ธ๋ฑ์Šค๋ฅผ ์ถœ๋ ฅํ•ด ๋ด…์‹œ๋‹ค.

์‘๋‹ต๋“ค์˜ ํ† ํฐ ๋‹จ์œ„ start_index ๋ฐ end_index๋ฅผ ๊ตฌํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด์ œ ์ปจํ…์ŠคํŠธ ๋‚ด์—์„œ์˜ ๋ฌธ์ž ๋‹จ์œ„ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ ์˜คํ”„์…‹(offset)์ด ๋งค์šฐ ์œ ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ† ํฐ ๋ถ„๋ฅ˜(token classification) ์ž‘์—…์—์„œ์ฒ˜๋Ÿผ ์ด๋“ค ์ธ๋ฑ์Šค๋“ค์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

inputs_with_offsets = tokenizer(question, context, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = context[start_char:end_char]

์ด์ œ ๊ฒฐ๊ณผ๋ฅผ ํ‘œ์‹œํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ถœ๋ ฅ ํ˜•์‹์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index]
}
print(result)

์ข‹์Šต๋‹ˆ๋‹ค! ์œ„ ๊ฒฐ๊ณผ๋Š” ์•ž์—์„œ ์‹คํ–‰ํ–ˆ๋˜ ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฒฐ๊ณผ์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

โœ๏ธ Try it out! ์ด์ „์— ๊ณ„์‚ฐํ•œ ์ตœ๊ณ  ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€๋Šฅ์„ฑ์ด ๊ฐ€์žฅ ๋†’์€ 5๊ฐœ์˜ ์‘๋‹ต์„ ํ‘œ์‹œํ•ด ๋ด…์‹œ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ด์ „์˜ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๋Œ์•„๊ฐ€์„œ ํ˜ธ์ถœํ•  ๋•Œ top_k=5๋ฅผ ์ „๋‹ฌํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๊ธธ์ด๊ฐ€ ๊ธด ์ปจํ…์ŠคํŠธ ๋‹ค๋ฃจ๊ธฐ

์œ„์—์„œ ์˜ˆ์ œ๋กœ ์‚ฌ์šฉํ•œ ์งˆ๋ฌธ ๋ฐ ๊ธธ์ด๊ฐ€ ๊ธด ์ปจํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™” ํ•ด๋ณด๋ฉด question-answering ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ์‚ฌ์šฉ๋œ ์ตœ๋Œ€ ๊ธธ์ด(384)๋ณด๋‹ค ๋” ๋งŽ์€ ํ† ํฐ๋“ค์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

inputs = tokenizer(question, long_context)
print(len(inputs["input_ids"]))

๋”ฐ๋ผ์„œ ์ตœ๋Œ€ ๊ธธ์ด๋งŒํผ ์ž…๋ ฅ์„ ์ ˆ๋‹จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€๊ฐ€ ์žˆ์ง€๋งŒ ์šฐ์„  ์ฃผ์˜ํ•ด์•ผํ•  ๊ฒƒ์€ ์งˆ๋ฌธ์„ ์ ˆ๋‹จํ•ด์„œ๋Š” ์•ˆ๋˜๊ณ  ์ปจํ…์ŠคํŠธ๋งŒ ์ ˆ๋‹จํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์ปจํ…์ŠคํŠธ๋Š” ๋‘ ๋ฒˆ์งธ ๋ฌธ์žฅ์ด๋ฏ€๋กœ "only_second" ์ ˆ๋‹จ ์˜ต์…˜์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ •๋‹ต์ด ์ž˜๋ ค์ ธ ๋‚˜๊ฐ„ ์ปจํ…์ŠคํŠธ์— ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์•„๋ž˜ ์˜ˆ์‹œ์—์„œ ์ •๋‹ต์ด ์ปจํ…์ŠคํŠธ์˜ ๋ ๋ถ€๋ถ„์— ์žˆ๋Š” ์งˆ๋ฌธ์„ ์ž…๋ ฅํ–ˆ๋‹ค๋ฉด, ํ•ด๋‹น ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์€ ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค:

inputs = tokenizer(question, long_context, max_length=384, truncation="only_second")
print(tokenizer.decode(inputs["input_ids"]))

์ด๊ฒƒ์€ ๋ชจ๋ธ์ด ์ •๋‹ต์„ ์„ ํƒํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์„ ๊ฒƒ์ž„์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด question-answering ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•˜๋ฉด ์ปจํ…์ŠคํŠธ๋ฅผ ๋” ์ž‘์€ ์ฒญํฌ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ์ตœ๋Œ€ ๊ธธ์ด๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ •๋‹ต์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋„๋ก ์ปจํ…์ŠคํŠธ๋ฅผ ์ž˜๋ชป๋œ ์œ„์น˜์—์„œ ๋ถ„ํ• ํ•˜์ง€ ์•Š๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด ์ฒญํฌ ์‚ฌ์ด์— ์•ฝ๊ฐ„์˜ ๊ฒน์นจ(overlap)๋„ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

return_overflowing_tokens=True๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ํ† ํฌ๋‚˜์ด์ €("๋น ๋ฅธ" ๋˜๋Š” "๋Š๋ฆฐ")๊ฐ€ ์ด๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ stride ์ธ์ˆ˜๋กœ ์›ํ•˜๋Š” ๊ฒน์นจ ์ •๋„๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ๊ธธ์ด๊ฐ€ ๋น„๊ต์  ์งง์€ ๋ฌธ์žฅ์„ ์ด์šฉํ•œ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค:

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)
for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))

์šฐ๋ฆฌ๊ฐ€ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ์ž…๋ ฅ ๋ฌธ์žฅ์€ inputs["input_ids"]์˜ ๊ฐ ํ•ญ๋ชฉ์ด ์ตœ๋Œ€ 6๊ฐœ์˜ ํ† ํฐ์„ ๊ฐ–๋Š” ์ฒญํฌ๋“ค๋กœ ๋ถ„ํ• ๋˜์—ˆ์Šต๋‹ˆ๋‹ค(๋งˆ์ง€๋ง‰ ํ•ญ๋ชฉ์ด ๋‹ค๋ฅธ ํ•ญ๋ชฉ๊ณผ ๊ฐ™์€ ํฌ๊ธฐ๊ฐ€ ๋˜๋„๋ก padding์„ ์ถ”๊ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ) ๊ฐ ํ•ญ๋ชฉ ์‚ฌ์ด์— 2๊ฐœ์”ฉ์˜ ํ† ํฐ์ด ๊ฒน์นฉ๋‹ˆ๋‹ค.

ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ฅผ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

print(inputs.keys())

์˜ˆ์ƒ๋Œ€๋กœ input_IDs์™€ attention_mask๊ฐ€ ๋‹ด๊ฒจ์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ํ‚ค์ธ overflow_to_sample_mapping์€ ๊ฐ ๊ฒฐ๊ณผ๊ฐ€ ์–ด๋Š ๋ฌธ์žฅ์— ํ•ด๋‹นํ•˜๋Š”์ง€ ์•Œ๋ ค์ฃผ๋Š” ๋งต(map)์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—๋Š” ์šฐ๋ฆฌ๊ฐ€ ํ† ํฌ๋‚˜์ด์ €๋กœ ์ „๋‹ฌํ•œ (์œ ์ผํ•œ) ๋ฌธ์žฅ์—์„œ ๋‚˜์˜จ 7๊ฐœ์˜ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

print(inputs["overflow_to_sample_mapping"])

์ด๊ฒƒ์€ ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ ํ•จ๊ป˜ ํ† ํฐํ™”ํ•  ๋•Œ ๋” ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด,

sentences = [
    "This sentence is not too long but we are going to split it anyway.",
    "This sentence is shorter but will still get split.",
]
inputs = tokenizer(
    sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)

print(inputs["overflow_to_sample_mapping"])

๊ฒฐ๊ณผ๋Š” ์ฒซ๋ฒˆ์งธ ๋ฌธ์žฅ์ด ์ด์ „๊ณผ ๊ฐ™์ด 7๊ฐœ์˜ ์ฒญํฌ๋กœ ๋ถ„ํ• ๋˜๊ณ  ๋‹ค์Œ 4๊ฐœ์˜ ์ฒญํฌ๊ฐ€ ๋‘ ๋ฒˆ์งธ ๋ฌธ์žฅ์—์„œ ์˜จ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์ด์ œ ๊ธธ์ด๊ฐ€ ๊ธด ์ปจํ…์ŠคํŠธ๋กœ ๋Œ์•„๊ฐ€ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ question-answering ํŒŒ์ดํ”„๋ผ์ธ์€ ์•ž์—์„œ ์–ธ๊ธ‰ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ์ตœ๋Œ€ ๊ธธ์ด 384์™€ ๋ชจ๋ธ์ด ๋ฏธ์„ธ ์กฐ์ •๋œ ๋ฐฉ์‹๊ณผ ๋™์ผํ•œ 128์˜ ๋ณดํญ(stride)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํŒŒ์ดํ”„๋ผ์ธ์„ ํ˜ธ์ถœํ•  ๋•Œ max_seq_len ๋ฐ stride ์ธ์ˆ˜๋ฅผ ์ „๋‹ฌํ•˜์—ฌ ํ•ด๋‹น ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ† ํฐํ™”ํ•  ๋•Œ ์ด๋Ÿฌํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ํŒจ๋”ฉ(padding)์„ ์ถ”๊ฐ€ํ•˜๊ณ (ํ…์„œ๋ฅผ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ์ƒ˜ํ”Œ๋“ค์˜ ๊ธธ์ด๋ฅผ ๋™์ผํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ) ์˜คํ”„์…‹์„ ์š”์ฒญํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค:

inputs = tokenizer(
    question,
    long_context,
    stride=128,
    max_length=384,
    padding="longest",
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

์œ„์—์„œ inputs์—๋Š” ๋ชจ๋ธ๋กœ ์ž…๋ ฅ๋˜๋Š” input_IDs์™€ attention_mask ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋ฐฉ๊ธˆ ์–ธ๊ธ‰ํ•œ ์˜คํ”„์…‹(offset) ๋ฐ overflow_to_sample_mapping์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ๋‘๊ฐ€์ง€๋Š” ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ฏ€๋กœ ํ…์„œ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์ „์— inputs์—์„œ ์ด๋ฅผ ์ œ๊ฑฐ(pop)ํ•ฉ๋‹ˆ๋‹ค:

_ = inputs.pop("overflow_to_sample_mapping")
offsets = inputs.pop("offset_mapping")

inputs = inputs.convert_to_tensors("pt")
print(inputs["input_ids"].shape)

๊ธธ์ด๊ฐ€ ๊ธด ์ปจํ…์ŠคํŠธ๋Š” ๋‘ ๊ฐœ๋กœ ๋ถ„ํ• ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์€ ๋‘๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ์‹œ์ž‘ ๋ฐ ๋งˆ์ง€๋ง‰ ๋กœ์ง“(logits)์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits
print(start_logits.shape, end_logits.shape)

์ด์ „๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ softmax๋ฅผ ์ทจํ•˜๊ธฐ ์ „์— ์ปจํ…์ŠคํŠธ์˜ ์ผ๋ถ€๊ฐ€ ์•„๋‹Œ ํ† ํฐ์„ ๋จผ์ € ๋งˆ์Šคํ‚นํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๋ชจ๋“  ํŒจ๋”ฉ ํ† ํฐ์„ ๋งˆ์Šคํ‚นํ•ฉ๋‹ˆ๋‹ค(attention_mask๋กœ ํ‘œ์‹œ๋œ ๋Œ€๋กœ):

sequence_ids = inputs.sequence_ids()
# Mask everything apart from the tokens of the context
mask = [1 != 1 for i in sequence_ids]
# Unmask the [CLS] token
mask[0] = False
# Mask all the [PAD] tokens
mask = torch.logical_or(torch.tensor(mask)[None], (inputs["attention_mask"] == 0))

start_logits[mask] = -10000
end_logits[mask] = -10000

๊ทธ๋Ÿฐ ๋‹ค์Œ softmax๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ์ง“(logits)์„ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

start_probabilities = torch.nn.functional.softmax(start_logits, dim=-1)
end_probabilities = torch.nn.functional.softmax(end_logits, dim=-1)

๋‹ค์Œ ๋‹จ๊ณ„๋Š” ์•ž์—์„œ ๊ธธ์ด๊ฐ€ ์งง์€ ์ปจํ…์ŠคํŠธ์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•œ ์ž‘์—…๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ฒญํฌ๊ฐ€ 2๊ฐœ์ด๋ฏ€๋กœ ์ด๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋‹ต๋ณ€(answer spans)์— ์ ์ˆ˜๋ฅผ ๋ถ€์—ฌํ•œ ๋‹ค์Œ ๊ฐ€์žฅ ์ข‹์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์€ ๋‹ต๋ณ€์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค:

candidates = []
for start_probs, end_probs in zip(start_probabilities, end_probabilities):
    scores = start_probs[:, None] * end_probs[None, :]
    idx = torch.triu(scores).argmax().item()
    
    start_idx = idx // scores.shape[0]
    end_idx = idx % scores.shape[0]
    score = scores[start_idx, end_idx].item()
    candidates.append((start_idx, end_idx, score))

print(candidates)

์œ„์—์„œ ์ถœ๋ ฅ๋œ 2๊ฐœ์˜ ํ›„๋ณด๋Š” ๋ชจ๋ธ์ด ๊ฐ ์ฒญํฌ(chunk, ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์„œ ๋ถ„ํ• ๋œ ์ปจํ…์ŠคํŠธ)์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋˜ ์ตœ์ƒ์˜ ๋‹ต๋ณ€์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ์ •๋‹ต์ด ๋‘๋ฒˆ์งธ๋ผ๊ณ  ํ™•์‹คํžˆ ๋” ํ™•์‹ ํ•ฉ๋‹ˆ๋‹ค(์ข‹์€ ์ง•์กฐ์ž…๋‹ˆ๋‹ค!). ์ด์ œ ๋‘ ํ† ํฐ ๋ฒ”์œ„(token spans)๋ฅผ ์ปจํ…์ŠคํŠธ์˜ ๋ฌธ์ž ๋ฒ”์œ„(character spans)์— ๋งคํ•‘ํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ฐ€์žฅ ํ™•์‹คํ•œ ๋‹ต๋ณ€์„ ์–ป๊ธฐ ์œ„ํ•ด ๋‘๋ฒˆ์งธ ํ›„๋ณด๋งŒ ๋งคํ•‘ํ•˜๋ฉด ๋˜์ง€๋งŒ, ์ฒซ๋ฒˆ์งธ ์ฒญํฌ์—์„œ ๋ชจ๋ธ์ด ์„ ํƒํ•œ ๋‹ต๋ณ€์„ ๋ณด๋Š” ๊ฒƒ๋„ ์žฌ๋ฏธ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

โœ๏ธ Try it out! ์œ„์˜ ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ 5๊ฐœ์˜ ๋‹ต๋ณ€์— ๋Œ€ํ•œ ์ ์ˆ˜์™€ ๋ฒ”์œ„๋ฅผ ๋ฐ˜ํ™˜ํ•ด ๋ณด์„ธ์š”(์ฒญํฌ๋ณ„๋กœ๊ฐ€ ์•„๋‹ˆ๋ผ ์ดํ•ฉ์ ์œผ๋กœ).

์šฐ๋ฆฌ๊ฐ€ ์ด์ „์— ๊ฐ€์ ธ์˜จ ์˜คํ”„์…‹์€ ์‹ค์ œ๋กœ๋Š” ์˜คํ”„์…‹ ๋ฆฌ์ŠคํŠธ(list of offsets)์ด๋ฉฐ ํ…์ŠคํŠธ ์ฒญํฌ๋‹น ํ•˜๋‚˜์˜ ๋ฆฌ์ŠคํŠธ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค:

for candidate, offset in zip(candidates, offsets):
    start_token, end_token, score = candidate
    start_char, _ = offset[start_token]
    _, end_char = offset[end_token]
    answer = long_context[start_char:end_char]
    result = {"answer": answer, "start":start_char, "end":end_char, "score":score}
    print(result)

์ฒซ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋ฅผ ๋ฌด์‹œํ•˜๋ฉด ์ด ๊ธธ์ด๊ฐ€ ๊ธด ์ปจํ…์ŠคํŠธ(long_context)์— ๋Œ€ํ•œ ํŒŒ์ดํ”„๋ผ์ธ ์‹คํ–‰ ๊ฒฐ๊ณผ์™€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.

โœ๏ธ Try it out! ์ด์ „์— ๊ณ„์‚ฐํ•œ ์ตœ๊ณ  ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ 5๊ฐœ์˜ ๋‹ต๋ณ€์„ ํ‘œ์‹œํ•ด๋ณด์„ธ์š”(๊ฐ ์ฒญํฌ๊ฐ€ ์•„๋‹Œ ์ „์ฒด ์ปจํ…์ŠคํŠธ์— ๋Œ€ํ•ด). ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ฒซ๋ฒˆ์งธ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๋Œ์•„๊ฐ€์„œ ํ˜ธ์ถœํ•  ๋•Œ top_k=5๋ฅผ ์ „๋‹ฌํ•ด๋ด…์‹œ๋‹ค.

์ด๊ฒƒ์œผ๋กœ ํ† ํฌ๋‚˜์ด์ €์˜ ๊ธฐ๋Šฅ์— ๋Œ€ํ•œ ์‹ฌ์ธต ๋ถ„์„์„ ๋งˆ์นฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์žฅ์—์„œ ์ผ๋ฐ˜์ ์ธ NLP ์ž‘์—…์—์„œ ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค„ ๋•Œ ์—ฌ๊ธฐ์„œ ๋ฐฐ์šด ๋ชจ๋“  ๊ฒƒ๋“ค์„ ์‹ค์ œ๋กœ ์ ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ข‹์€ ์›นํŽ˜์ด์ง€ ์ฆ๊ฒจ์ฐพ๊ธฐ