CS224N (2) Word Vectors and Word Senses

138590 단어 cs224nNLPNLP
Thumbnail image

Word2vec

Review

  • Iterate through each word of the whole corpus
  • Predict surrounding words using word vectors
  • Calculate J(θ)J(\theta) and θJ(θ)\nabla_{\theta}{J(\theta)}
  • Update θ\theta so you can predict well
  • Each row of URV×dU\in \Reals^{V\times d}

Q. How to avoid too much weight on the high frquency words(like 'the', 'of', 'and')?
A. Discard the first biggest component of word vector! It contains information of frequency.

Stochastic Gradient Descent

  • Problem: θJ(θ)\nabla_{\theta}{J(\theta)}
  • Stochastic Gradient Descent(SGD): Repeatedly sample windows, and update after each one!
While True:
    window = sample_window(corpus)
    theta_grad = evaluate_gradient(J, window, theta)
    theta = theta - alpha * theta_grad

Negative Sampling

  • Problem:
    P(oc)=exp(uoTvc)wVexp(uwTvc)P(o\mid c)= \frac{\exp(u_o^Tv_c)}{\sum_{w\in V}\exp(u_w^Tv_c)}
  • Negative Sampling: Calculate the probability above with randomly selected negative(noise) pairs.
  • Jnegsample(o,vc,U)=log(σ(uoTvc))k=1Klog(σ(ukTvc))J_{neg-sample}(\bm{o}, \bm{v}_c, \bm{U})=-\log(\sigma(\bm{u}_o^T\bm{v}_c))-\sum\limits_{k=1}^{K}\log(\sigma(-\bm{u}_k^T\bm{v}_c))
  • P(w)=U(w)3/4/ZP(w)=U(w)^{3/4}/Z

GloVe

Word-document co-occurrence matrix

  • Word-document co-occurrence matrix will give general topics leading to "Latent Semantic Analysis"

  • Example of simple co-occurrence matrix:
    I like deep learning.
    I like NLP.
    I enjoy flying.

  • counts I likeenjoydeeplearningNLPflying . 
    I02100000
    like20010100
    enjoy10000010
    deep01001000
    learning00010001
    NLP01000001
    flying00100001
    .00001110
  • Problems with simple co-occurrence vectors:
    1) Increase in size with vocabulary
    2) Very high dimensional, requires a lot of storage
    3) Sparsity issues with vectors
    ⇒ Naive solution: Singular Vector Decomposition (SVD)
    ⇒ Better solution: GloVe

Count-based vs Direct-prediction

  • The methods of word representation can be categorized as "Count-based" or "Direct-prediction"

  • Count-basedDirect-prediction
    - LSA, HAL
    - COALS, Hellinger-PCA
    - Word2vec (Skip-gram, CBOW)
    - NNLM, HLBL, RNN
    Advantages:
    - Fast training
    - Efficient usage of statistics
    Advantages:
    - Generate improved performance on other tasks
    - Can capture complex patterns beyond word similarity
    Disadvantages:
    - Primarily used to capture word similarity
    - Disproportionate importance given to large counts
    Disadvantages:
    - Scales with corpus size
    - Inefficient usage of statistics
  • In other words, count-based methods can capture characteristics of the document because these methods receive whole corpus as an input, while the direct-prediction methods outperform in representing the syntactic and symantic features of a word.

Glove

  • Glove is designed to combine the advantages of both count-based and direct-prediction.

  • Insight: Ratio of co-occurrence probabilities can encode meaning components.

  • x=solidx=gasx=waterx=random
    P(xice)P(x\mid ice)largesmalllargesmall
    P(xsteam)P(x\mid steam)smalllargelargesmall
    P(xice)P(xsteam)\frac{P(x\mid ice)}{P(x\mid steam)}largesmall1\approx11\approx1
  • "Ratio" makes both-related words (water) or un-related words (fashion) to be close to 1.

Q. How can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space?
A. Log-bilinear model!

  • Log-bilinear Model ... (What is bilinear?)

    1. co-occurrence prob. Pij=P(ji)=XijXiP_{ij}=P(j\mid i)=\frac{X_{ij}}{X_i}
    2. ratio of co-occurrence prob. PikPjk=F(wi,wj,wk)\frac{P_{ik}}{P_{jk}}=F(w_{i}, w_{j}, {w_k})
    3. We want FF to represent the information present in Pik/PjkP_{ik}/P_{jk}
    4. As Pik/PjkP_{ik}/P_{jk}
    5. The distinction between a word and a context word is arbitrary and we are free to exchange the two roles.
      Thus, F((wiwj)Twk)=F(wiTwk)F(wjTwk)=PikPjkF((w_i-w_j)^Tw_k)=\frac{F(w_i^Tw_k)}{F(w_j^Tw_k)}=\frac{P_{ik}}{P_{jk}}
    6. We require that FF be a homomorphism between the groups (R,+)(R,+) and (R>0,×)(R>0, \times)
    7. Finally we get log-bilinear model:
      wiwj=logP(ij)=Pijw_i \cdot w_j = \log{P(i \mid j)}=P_{ij}
  • Cost Function of GloVe
    J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlogXij)J=\sum\limits_{i,j=1}^{V}f(X_{ij})(w_i^T\tilde{w}_j+b_i+\tilde{b}_j-\log{X_{ij}})

    • ff is some special function that is designed to capping the effect of very common words.
    • Note: bib_i
    • For more details, check the original GloVe paper.
  • vs Word2vec
    Someone's opinion: stackoverflow


If there is something wrong in my writing or understanding, please comment and make corrections!


[references]
1. https://youtu.be/kEMJRjEdNzM
2. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture02-wordvecs2.pdf
3. https://aclanthology.org/D14-1162/
4. https://qr.ae/pGPB2h
5. https://youtu.be/cYzp5IWqCsg
6. https://stackoverflow.com/questions/56071689/whats-the-major-difference-between-glove-and-word2vec

좋은 웹페이지 즐겨찾기