[Paper Review] FastText

FastText: Enriching Word Vectors with Subword Information

2017년 Bojanowski et al.(FaceBook AI) 제안
Fast: Baseline(CBOW, Skip-gram) 대비 학습 속도 빠름
Word Vectors: 자연어 -> vector로 표현하는 Word Representation방법
Subword: 형태론(morphology)적 feature정보를 한 단어의 subword unit(char level)에서 추출
ex) eat, eaten, eating
세 단어 모두 'eat'을 subword로 갖고 있음 -> 세 단어가 비슷한 의미를 갖도록 학습

Word Representations이란?

word를 vector공간에 단어의 의미를 인코딩 -> 단어를 연속형 실수 vector로 표현

예시)

Ronaldo, Messi, Dicaprio를 벡터화 해보자

	isRonaldo	isMessi	isDicaprio
Ronaldo	1	0	0
Messi	0	1	0
Dicaprio	0	0	1

위와 같이 one-hot encoding으로 표현하면, 3단어의 관련성에 대해 알 수 없다.
약간의 정보를 추가해주면, 다음과 같이 표현할 수 있다.

	isFootballer	isActor
Ronaldo	1	0
Messi	1	0
Dicaprio	0	1

이를 vector space에 표현해보면...

그럼 더 많은 정보들을 활용한다면(vector dim커진다면), 3개의 단어를 더 정교하게 표현할 수 있다.

	isFootballer	isActor	Popularity	Gender	Height	Weight
Ronaldo	1	0	1	…	…	…
Messi	1	0	1	…	…	…
Dicaprio	0	1	1	…	…	…

하지만, 직접 이런 정보들을 찾기는 힘듬
-> 대량의 corpus에서 word의 의미를 학습하게 하는 모델을 만들 수 없을까?

Can we have neural networks comb through a large corpus of text and generate word representations automatically?

Word2Vec복습

2013년, Mikolov et al. 제안
방대한 corpus를 활용하여 vector representations을 학습하는 방법

크게 2가지 학습 방법

CBOW(Continuous Bag of Words)
Skip-gram

Word2Vec의 Limitation

OOV(Out Of Vocab): 학습 데이터에 'eat'이 있더라도 'eating'이 없다면 word vector를 생성할 수 없음
Morphological meaing: 'eat'이라는 단어와 'eating'이라는 단어가 의미상 비슷하지만 word2vec은 이를 독립적인 별개의 단어로 인식

FastText

위의 Limitation을 해결하기 위해 제안된 방법

1. Skip-gram w/ Negative Sampling
Why negative sampling? 계산량 줄이기!

output layer에서 softmax사용 -> 사실상 업데이트 되는 부분은 주변 단어들(window size)이지만, 모든 단어들에 대해 계산해야함
multi-class classification(softmax) -> binary classification(binary logistic loss)

2. Sub-word Generation
[Main Idea]
'eats': corpus에서 자주 등장
'eating': corpus에서 자주 등장하지 않음
-> 공통 n-gram인 'eat'을 통해 'eating'의 의미를 추론해낼 수 있음

subword 생성

'eating'의 word vector은 n-grams vector과 자기 자신의 합으로 표현

n-grams들의 vector도 사용하다보니 skip-gram보다 학습속도는 1.5배 정도 느렸지만, 기존의 cbow & skip-gram baseline보다 word-similarity task에서 더 좋은 성능을 보임

Implementaion

gensim FastText: https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html
Pre-trained word vectors(157 langs): https://fasttext.cc/docs/en/crawl-vectors.html
FastText tutorial: https://github.com/ukairia777/tensorflow-nlp-tutorial/blob/main/09.%20Word%20Embedding/9-6.%20fasttext.ipynb

from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FastText(vector_size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file = corpus_file, 
    epochs = model.epochs,
    total_examples = model.corpus_count, 
    total_words = model.corpus_total_words,
)

print(model)

Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec:

model: Training architecture. Allowed values: cbow, skipgram (Default cbow)
vector_size: Dimensionality of vector embeddings to be learnt (Default 100)
alpha: Initial learning rate (Default 0.025)
window: Context window size (Default 5)
min_count: Ignore words with number of occurrences below this (Default 5)
loss: Training objective. Allowed values: ns, hs, softmax (Default ns)
sample: Threshold for downsampling higher-frequency words (Default 0.001)
negative: Number of negative words to sample, for ns (Default 5)
epochs: Number of epochs (Default 5)
sorted_vocab: Sort vocab by descending frequency (Default 1)
threads: Number of threads to use (Default 12)

In addition, fastText has three additional parameters:

min_n: min length of char ngrams (Default 3)
max_n: max length of char ngrams (Default 6)
bucket: number of buckets used for hashing ngrams (Default 2000000)

Reference

Piotr Bojanowski et al., “Enriching Word Vectors with Subword Information”
Tomas Mikolov et al., “Efficient Estimation of Word Representations in Vector Space”
https://amitness.com/2020/06/fasttext-embeddings/
https://www.youtube.com/watch?v=7UA21vg4kKE

Author And Source

이 문제에 관하여([Paper Review] FastText), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@ejjeong/Paper-Review-FastText

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)