CS224N (2) Word Vectors and Word Senses

Thumbnail image

Word2vec

Review

Iterate through each word of the whole corpus
Predict surrounding words using word vectors
Calculate $J(\theta)$
Update $\theta$
Each row of $U\in \Reals^{V\times d}$

Q. How to avoid too much weight on the high frquency words(like 'the', 'of', 'and')?
A. Discard the first biggest component of word vector! It contains information of frequency.

Stochastic Gradient Descent

Problem: $\nabla_{\theta}{J(\theta)}$
Stochastic Gradient Descent(SGD): Repeatedly sample windows, and update after each one!

While True:
    window = sample_window(corpus)
    theta_grad = evaluate_gradient(J, window, theta)
    theta = theta - alpha * theta_grad

Negative Sampling

Problem:
$P(o\mid c)= \frac{\exp(u_o^Tv_c)}{\sum_{w\in V}\exp(u_w^Tv_c)}$
Negative Sampling: Calculate the probability above with randomly selected negative(noise) pairs.
$J_{neg-sample}(\bm{o}, \bm{v}_c, \bm{U})=-\log(\sigma(\bm{u}_o^T\bm{v}_c))-\sum\limits_{k=1}^{K}\log(\sigma(-\bm{u}_k^T\bm{v}_c))$
Jneg−sample(o,vc,U)=−log(σ(uoTvc))−k=1∑Klog(σ(−ukTvc))

$\bm{u}_o$

$\bm{u}_k$

$K$

Maximize probability that real outside word appears.
Minimize probability that random words appear around centre word.
$P(w)=U(w)^{3/4}/Z$
P(w)=U(w)3/4/Z

$P(w)$

$U(w)$

3/4 power ⇒ common word ↓, rare word ↑ (This number determined heuristically)

GloVe

Word-document co-occurrence matrix

Word-document co-occurrence matrix will give general topics leading to "Latent Semantic Analysis"
Example of simple co-occurrence matrix:
I like deep learning.
I like NLP.
I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
Problems with simple co-occurrence vectors:
1) Increase in size with vocabulary
2) Very high dimensional, requires a lot of storage
3) Sparsity issues with vectors
⇒ Naive solution: Singular Vector Decomposition (SVD)
⇒ Better solution: GloVe

counts	I	like	enjoy	deep	learning	NLP	flying	.
I	0	2	1	0	0	0	0	0
like	2	0	0	1	0	1	0	0
enjoy	1	0	0	0	0	0	1	0
deep	0	1	0	0	1	0	0	0
learning	0	0	0	1	0	0	0	1
NLP	0	1	0	0	0	0	0	1
flying	0	0	1	0	0	0	0	1
.	0	0	0	0	1	1	1	0

Count-based vs Direct-prediction

The methods of word representation can be categorized as "Count-based" or "Direct-prediction"

Count-based	Direct-prediction
- LSA, HAL - COALS, Hellinger-PCA	- Word2vec (Skip-gram, CBOW) - NNLM, HLBL, RNN
Advantages: - Fast training - Efficient usage of statistics	Advantages: - Generate improved performance on other tasks - Can capture complex patterns beyond word similarity
Disadvantages: - Primarily used to capture word similarity - Disproportionate importance given to large counts	Disadvantages: - Scales with corpus size - Inefficient usage of statistics

In other words, count-based methods can capture characteristics of the document because these methods receive whole corpus as an input, while the direct-prediction methods outperform in representing the syntactic and symantic features of a word.

Glove

Glove is designed to combine the advantages of both count-based and direct-prediction.
Insight: Ratio of co-occurrence probabilities can encode meaning components.

	x=solid	x=gas	x=water	x=random
$P(x\mid ice)$	large	small	large	small
$P(x\mid steam)$	small	large	large	small
$\frac{P(x\mid ice)}{P(x\mid steam)}$	large	small	$\approx1$	$\approx1$

"Ratio" makes both-related words (water) or un-related words (fashion) to be close to 1.

Q. How can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space?
A. Log-bilinear model!

Log-bilinear Model ... (What is bilinear?)
1. co-occurrence prob. $P_{ij}=P(j\mid i)=\frac{X_{ij}}{X_i}$
2. ratio of co-occurrence prob.
  $\frac{P_{ik}}{P_{jk}}=F(w_{i}, w_{j}, {w_k})$
  PjkPik=F(wi,wj,wk)
  
  $F$
3. We want
  $F$ to represent the information present in
  $P_{ik}/P_{jk}$ $F(w_i, w_j, w_k) = F(w_i-w_j,w_k) = \frac{P_{ik}}{P_{jk}}$
4. As
  $P_{ik}/P_{jk}$
  Pik/Pjk is a scalar, $F(w_i-w_j, w_k)$
5. The distinction between a word and a context word is arbitrary and we are free to exchange the two roles.
  Thus, $F((w_i-w_j)^Tw_k)=\frac{F(w_i^Tw_k)}{F(w_j^Tw_k)}=\frac{P_{ik}}{P_{jk}}$
6. We require that $F$
7. Finally we get log-bilinear model:
  $w_i \cdot w_j = \log{P(i \mid j)}=P_{ij}$
Cost Function of GloVe
$J=\sum\limits_{i,j=1}^{V}f(X_{ij})(w_i^T\tilde{w}_j+b_i+\tilde{b}_j-\log{X_{ij}})$
- $f$
- Note: $b_i$
- For more details, check the original GloVe paper.
vs Word2vec
Someone's opinion: stackoverflow

If there is something wrong in my writing or understanding, please comment and make corrections!

[references]
1. https://youtu.be/kEMJRjEdNzM
2. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture02-wordvecs2.pdf
3. https://aclanthology.org/D14-1162/
4. https://qr.ae/pGPB2h
5. https://youtu.be/cYzp5IWqCsg
6. https://stackoverflow.com/questions/56071689/whats-the-major-difference-between-glove-and-word2vec

Author And Source

이 문제에 관하여(CS224N (2) Word Vectors and Word Senses), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@hyeok9855/CS224N-2-Word-Vectors-and-Word-Senses

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다