This is a comprehensive implementation of the Transformer architecture as described in the "Attention Is All You Need" paper. Here's a detailed break-down of the translation model:
The model uses a standard Transformer architecture with the following components:
Encoder: Comprising N layers, each with multi-head attention and feed-forward networks.
Decoder: Similar to the encoder, with an additional attention mechanism to attend to the encoder output.
Attention Mechanisms: Visualizing where the model focuses during translation.
The PositionalEmbeddings class adds position information to the input embeddings. It creates a matrix of positional encodings using sine and cosine functions. These encodings are added to the input embeddings to give the model information about the sequence order.
The LayerNormalization class implements layer normalization, which helps stabilize the learning process. It normalizes the inputs across the features.
The FeedForwardBlock class implements the position-wise feed-forward networks used in both the encoder and decoder. It consists of two linear transformations with a ReLU activation in between.
The MultiHeadAttentionBlock class implements the multi-head attention mechanism. It projects the input into query, key, and value vectors, splits them into multiple heads, computes scaled dot-product attention, and concatenates the results.
The ResidualConnection class implements the residual connections used throughout the model. It applies layer normalization, then the sublayer (attention or feed-forward), adds the result to the input, and applies dropout.
The EncoderBlock class combines self-attention and feed-forward layers with residual connections to form a single encoder layer.
The Encoder class stacks multiple encoder blocks and applies a final layer normalization.
The DecoderBlock class is similar to the encoder block but includes an additional cross-attention layer that attends to the encoder output.
The Decoder class stacks multiple decoder blocks and applies a final layer normalization.
The ProjectionLayer class converts the decoder output to logits over the target vocabulary.
The Transformer class combines all the above components into a complete model. It includes methods for encoding, decoding, and projecting.
The code follows the original Transformer architecture closely, including details like scaling the embeddings, using layer normalization, and initializing parameters with Xavier uniform initialization.