data.txt

A paper on a new simple network architecture, the Transformer, based solely on attention mechanisms

The NIPS 2017 accepted paper, Attention Is All You Need, introduces Transformer, a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. This paper is authored by professionals from the Google research team including Ashish Vaswani, Noam Shazeer, Niki Parmar,  Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.

The Transformer – Attention is all you need
What problem is the paper attempting to solve?
Recurrent neural networks (RNN), long short-term memory networks(LSTM) and gated RNNs are the popularly approaches used for Sequence Modelling tasks such as machine translation and language modeling. However, RNN/CNN handle sequences word-by-word in a sequential fashion. This sequentiality is an obstacle toward parallelization of the process. Moreover, when such sequences are too long, the model is prone to forgetting the content of distant positions in sequence or mix it with following positions’ content.

Recent works have achieved significant improvements in computational efficiency and model performance through factorization tricks and conditional computation. But they are not enough to eliminate the fundamental constraint of sequential computation.

Attention mechanisms are one of the solutions to overcome the problem of model forgetting. This is because they allow dependency modelling without considering their distance in the input or output sequences. Due to this feature, they have become an integral part of sequence modeling and transduction models. However, in most cases attention mechanisms are used in conjunction with a recurrent network.

Paper summary
The Transformer proposed in this paper is a model architecture which relies entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and tremendously improves translation quality after being trained for as little as twelve hours on eight P100 GPUs.

Neural sequence transduction models generally have an encoder-decoder structure. The encoder maps an input sequence of symbol representations to a sequence of continuous representations. The decoder then generates an output sequence of symbols, one element at a time.

architecture of the transformer

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.

The authors are motivated to use self-attention because of three criteria.  

One is that the total computational complexity per layer.
Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
The third is the path length between long-range dependencies in the network.
The Transformer uses two different types of attention functions:

Scaled Dot-Product Attention, computes the attention function on a set of queries simultaneously, packed together into a matrix.

Multi-head attention, allows the model to jointly attend to information from different representation subspaces at different positions.

A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.

In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length is smaller than the representation dimensionality, which is often the case with machine translations