[Review] Sequence to Sequence Learning with Neural Networks

8836 단어 DeepLearning RNN

Research Paper Link: https://arxiv.org/pdf/1409.3215.pdf
Github Repo: https://github.com/google/seq2seq
Official Introduction: https://www.tensorflow.org/tutorials/seq2seq

Abstract

DNN(Deep Neural Network) works well on large-scale datasets. But it is not good at mapping sequences to sequences.
In this paper, they present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.
And conducted the experiments on the task of English-French translation.
Finally, they found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Content

Introduction

The model

Experiments

Dataset details

Decoding and Rescoring

Reversing the Source Sentences

Training details

Parallelizations

Experimental Results

Related Work

Conclusion

Body

1. Introduction

By using LSTM, they created the model for machine translation.
First reads the source sentence using an encoder to build a "thought"vector, a sequence of numbers that represents the sentence meaning; a decoder, then, processes the sentence vector to emit a translation.
This is often referred to as the encoder-decoder architecture.

Thought Vectors

This vector has been called, by various people, an "embedding", a "representational vector"or a "latent vector". But Geoff Hinton, in a stroke of marketing genius, gave it the name "thought vector".

Link: http://gabgoh.github.io/ThoughtVectors/

2. The model

They used the model based on the one introduced by Alex Graves .
Additionally they put some feature,
1. 2 different LSTM (Encoder, Decoder)
2. 4 layers of LSTM (Deep RNN)
3. Feeding Reversed Sentence
4. Beam Search at output layer
So to deeply understand their architecture, let's have a look at his research paper.
Title: Generating Sequences With Recurrent Neural Networks
Author: Alex Graves
Date: 5/6/2014

Abstract

shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach has demonstrated for the text and online handwriting.

RNN Prediction Architecture

Image

Input: $X = (x_1, x_2 , ... , x_T)$
Output: $Y = (y_1, y_2 , ... , y_T)$
Hidden: $h^n = (h^n_1, h^n_2, ... , h^n_T)$
Paramater: $P(x_{t+1}|y_t)$
* At first step(T=1), normally this should be $x_1 = 0$, hence this returns $y = x_2$

Skip Connections (from Input to all hidden layers and all hidden layers to outputs)

Input to Hidden skip connections

h^n_t = H(W^{xh}x_t + W^{h^{n-1}h^n}h^{n-1}_{t} + W^{h^n h^n}h^{n}_{t-1} + b^n_t)

Hidden to Output skip connections
Cost Function: negative logarithm cost function
Output Layer($y^k_t$): Softmax

\hat y_t = b_y + \sum ^N_{n=1} W^{h^ny}h^n_t\\
y_t = Y(\hat y_t)\\
Pr(x) = \prod^T_{t=1} Pr(x_{t+1}|y_t)\\
L(x) = - \sum ^T_{t=1} \log Pr(x_{t+1}|y_t)\\

Prediction Network

Pr(x_{t+1} = k|y_t) = y^k_t = \frac{\exp(\hat y^k_t)}{\sum^K_{k'=1} \hat y^{k'}_t}\\
Hence\\
L(x) = - \sum ^T_{t=1} \log y^k_t \Leftrightarrow \frac{\delta L(x)}{\delta y^k_t} = (y^k_t - \delta_{k, x_{t+1}})

The partial derivatives of the loss with respect to the network weights can be efficiently calculated with backpropagation through time, and the network can then be trained with stochastic gradient descent.

3. Experiments

3.1 Dataset Details

They used the WMT’14 English to French MT tasks.
We chose this translation task and this specific training set subset because of the public availability of a tokenized training and test set together with 1000-best lists from the baseline SMT.

SMT: Statistical Machine Translation

3.2 Decoding and Rescoring

Training mainly involves to maximisation of the log probability of a correct translation $T$ given the source sentence $S$.

\frac{1}{|S|}\sum_{(T,s) \in S} \log p(T|S)\\

Once the training finishes, we do translation by finding the most likely translated words.

\hat T = argmax \space p(T|S)

Using beam search, we can reach to the most likely words.
Size of Beam Search was tested as well and turned out that size of 2 significantly outperform compared to size of 1.
* What is Beam Search

3.3 Reversing the Source Sentences

3.4 Training Details

Layer: 4
Cells per layer: 1,000
Input: 160,000 vocabulary
Output: 80,000 vocabulary => softmax over 80,000 words.....
Number of Parameters in the nets: 384M...
Weight Matrix Initialisation: Real number of (-0.08, 0.08)
Optimisation: SGD(stochastic gradient descent)
Batch Size: 128
Constraints on the gradient:

s = ||\frac{\nabla gradient}{128}||_2\\
if \space s > 5, \space then \space g = \frac{5g}{s}

3.5 Parallelisation

A C++ implementation of deep LSTM with the configuration from the previous section on a single
GPU processes a speed of approximately 1,700 words per second. This was too slow for our
purposes, so we parallelized our model using an 8-GPU machine. Each layer of the LSTM was
executed on a different GPU and communicated its activations to the next GPU/layer as soon as
they were computed. Our models have 4 layers of LSTMs, each of which resides on a separate
GPU. The remaining 4 GPUs were used to parallelize the softmax, so each GPU was responsible
for multiplying by a 1000 × 20000 matrix. The resulting implementation achieved a speed of 6,300
(both English and French) words per second with a minibatch size of 128. Training took about a ten
days with this implementation.

3.6 Experimental Results

See sec 3.3.

3.7 Performance on Long Sentences

See sec 3.3.

3.8 Model Analysis

4. Related Work

RNNLM(recurrent neural network language model)
T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based
language model. In INTERSPEECH, pages 1045–1048, 2010.
NNLM
M. Auli, M. Galley, C. Quirk, and G. Zweig. Joint language and translation modeling with recurrent
neural networks. In EMNLP, 2013.
J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul. Fast and robust neural network
joint models for statistical machine translation. In ACL, 2014.
End-to-End Training
K. M. Hermann and P. Blunsom. Multilingual distributed representations without word alignment. In
ICLR, 2014.

5. Conclusion

They concludes that this work is quite potential and can be followed more later.

Reference

이 문제에 관하여([Review] Sequence to Sequence Learning with Neural Networks), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/Rowing0914/items/73d6882cf2ba35c78443

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

Java 스레드 간 통신 방식 상세 정보

[Spring] 스프링 입문

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다