Warning
This repository was developed for academic and personal purposes in order to better understand the underlying architecture of the Transformer and to use it for future small projects.
You can install the package by either:
-
using pip
pip install git+https://github.com/RistoAle97/yati
This will not install the
dev
dependencies listed inpyproject.toml
. -
cloning the repository and installing the dependencies
git clone https://github.com/RistoAle97/yati pip install -e yati pip install yati[dev] # if you want to contribute to this project
Note
Some implementation choices have been made that may differ from the original paper [1]:
- The source and target embeddings are shared, so a unified vocabulary (one for all the languages in a NMT task to give an example) is needed.
- The embeddings are tied to the linear output (i.e.: they share the weights).
- Pre-normalization was employed instead of post-normalization [2].
- Layer normalization [3] is performed at the end of both the encoder and decoder stacks.
- There is no softmax layer as it is already used by the CrossEntropy loss function implemented in PyTorch.
Hereafter a comparison between the original transformer and the one from this repository.
Original | This repository |
---|---|
![]() |
![]() |
[1] Vaswani, Ashish, et al. "Attention is all you need.". Advances in neural information processing systems 30 (2017).
[2] Nguyen, Toan Q., and Julian Salazar. "Transformers without tears: Improving the normalization of self-attention.". arXiv preprint arXiv:1910.05895 (2019).
[3] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization". arXiv preprint arXiv:1607.06450 (2016).
Some additional nice reads:
This project is MIT licensed.