[Review] Sequence to Sequence Learning with Neural Networks

8836 단어 DeepLearningRNN
Research Paper Link: https://arxiv.org/pdf/1409.3215.pdf
Github Repo: https://github.com/google/seq2seq
Official Introduction: https://www.tensorflow.org/tutorials/seq2seq

Abstract


DNN(Deep Neural Network) works well on large-scale datasets. But it is not good at mapping sequences to sequences.
In this paper, they present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.
And conducted the experiments on the task of English-French translation.
Finally, they found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Content

  • Introduction
  • The model
  • Experiments
  • Dataset details
  • Decoding and Rescoring
  • Reversing the Source Sentences
  • Training details
  • Parallelizations
  • Experimental Results
  • Related Work
  • Conclusion
  • Body


    1. Introduction


    By using LSTM, they created the model for machine translation.
    First reads the source sentence using an encoder to build a "thought"vector, a sequence of numbers that represents the sentence meaning; a decoder, then, processes the sentence vector to emit a translation.
    This is often referred to as the encoder-decoder architecture.

    Thought Vectors


    This vector has been called, by various people, an "embedding", a "representational vector"or a "latent vector". But Geoff Hinton, in a stroke of marketing genius, gave it the name "thought vector".

    Link: http://gabgoh.github.io/ThoughtVectors/

    2. The model


    They used the model based on the one introduced by Alex Graves .
    Additionally they put some feature,
    1. 2 different LSTM (Encoder, Decoder)
    2. 4 layers of LSTM (Deep RNN)
    3. Feeding Reversed Sentence
    4. Beam Search at output layer
    So to deeply understand their architecture, let's have a look at his research paper.
    Title: Generating Sequences With Recurrent Neural Networks
    Author: Alex Graves
    Date: 5/6/2014

    Abstract


    shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach has demonstrated for the text and online handwriting.

    RNN Prediction Architecture


    Image

    Input: $X = (x_1, x_2 , ... , x_T)$
    Output: $Y = (y_1, y_2 , ... , y_T)$
    Hidden: $h^n = (h^n_1, h^n_2, ... , h^n_T)$
    Paramater: $P(x_{t+1}|y_t)$
    * At first step(T=1), normally this should be $x_1 = 0$, hence this returns $y = x_2$

    Skip Connections (from Input to all hidden layers and all hidden layers to outputs)


    Input to Hidden skip connections
    h^n_t = H(W^{xh}x_t + W^{h^{n-1}h^n}h^{n-1}_{t} + W^{h^n h^n}h^{n}_{t-1} + b^n_t)
    
    Hidden to Output skip connections
    Cost Function: negative logarithm cost function
    Output Layer($y^k_t$): Softmax
    \hat y_t = b_y + \sum ^N_{n=1} W^{h^ny}h^n_t\\
    y_t = Y(\hat y_t)\\
    Pr(x) = \prod^T_{t=1} Pr(x_{t+1}|y_t)\\
    L(x) = - \sum ^T_{t=1} \log Pr(x_{t+1}|y_t)\\
    

    Prediction Network

    Pr(x_{t+1} = k|y_t) = y^k_t = \frac{\exp(\hat y^k_t)}{\sum^K_{k'=1} \hat y^{k'}_t}\\
    Hence\\
    L(x) = - \sum ^T_{t=1} \log y^k_t \Leftrightarrow \frac{\delta L(x)}{\delta y^k_t} = (y^k_t - \delta_{k, x_{t+1}})
    
    The partial derivatives of the loss with respect to the network weights can be efficiently calculated with backpropagation through time, and the network can then be trained with stochastic gradient descent.

    3. Experiments


    3.1 Dataset Details


    They used the WMT’14 English to French MT tasks.
    We chose this translation task and this specific training set subset because of the public availability of a tokenized training and test set together with 1000-best lists from the baseline SMT.
  • SMT: Statistical Machine Translation
  • 3.2 Decoding and Rescoring


    Training mainly involves to maximisation of the log probability of a correct translation $T$ given the source sentence $S$.
    \frac{1}{|S|}\sum_{(T,s) \in S} \log p(T|S)\\
    
    Once the training finishes, we do translation by finding the most likely translated words.
    \hat T = argmax \space p(T|S)
    
    Using beam search, we can reach to the most likely words.
    Size of Beam Search was tested as well and turned out that size of 2 significantly outperform compared to size of 1.
    * What is Beam Search

    3.3 Reversing the Source Sentences



    3.4 Training Details


    Layer: 4
    Cells per layer: 1,000
    Input: 160,000 vocabulary
    Output: 80,000 vocabulary => softmax over 80,000 words.....
    Number of Parameters in the nets: 384M...
    Weight Matrix Initialisation: Real number of (-0.08, 0.08)
    Optimisation: SGD(stochastic gradient descent)
    Batch Size: 128
    Constraints on the gradient:
    s = ||\frac{\nabla gradient}{128}||_2\\
    if \space s > 5, \space then \space g = \frac{5g}{s}
    

    3.5 Parallelisation


    A C++ implementation of deep LSTM with the configuration from the previous section on a single
    GPU processes a speed of approximately 1,700 words per second. This was too slow for our
    purposes, so we parallelized our model using an 8-GPU machine. Each layer of the LSTM was
    executed on a different GPU and communicated its activations to the next GPU/layer as soon as
    they were computed. Our models have 4 layers of LSTMs, each of which resides on a separate
    GPU. The remaining 4 GPUs were used to parallelize the softmax, so each GPU was responsible
    for multiplying by a 1000 × 20000 matrix. The resulting implementation achieved a speed of 6,300
    (both English and French) words per second with a minibatch size of 128. Training took about a ten
    days with this implementation.

    3.6 Experimental Results


    See sec 3.3.

    3.7 Performance on Long Sentences


    See sec 3.3.

    3.8 Model Analysis



    4. Related Work


    RNNLM(recurrent neural network language model)
    T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based
    language model. In INTERSPEECH, pages 1045–1048, 2010.
    NNLM
    M. Auli, M. Galley, C. Quirk, and G. Zweig. Joint language and translation modeling with recurrent
    neural networks. In EMNLP, 2013.
    J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul. Fast and robust neural network
    joint models for statistical machine translation. In ACL, 2014.
    End-to-End Training
    K. M. Hermann and P. Blunsom. Multilingual distributed representations without word alignment. In
    ICLR, 2014.

    5. Conclusion


    They concludes that this work is quite potential and can be followed more later.

    좋은 웹페이지 즐겨찾기