Adaptive Depth Transformers

Implementation of the paper "How Many Layers and Why? An Analysis of the Model Depth in Transformers". In this study, we investigate the role of the multiple layers in deep transformer models. We design a variant of ALBERT that dynamically adapts the number of layers for each token of the input.

Model architecture

We augment a multi-layer transformer encoder with a halting mechanism, which allows dynamically adjusting the number of layers for each token. We directly adapted this mechanism from Graves (2016). At each iteration, we compute a probability for each token to stop updating its state.

Pre-training

We pretrain three models, tiny, small, and base, whose hyper-parameters are detailed below.

Models	tiny	small	base
τ	1e-3	5e-4	2.5e-4
Max iterations	6	12	24
mlm (Acc.)	55.4	57.1	57.4
sop (Acc.)	80.9	83.9	84.3
All tokens	3.8	7.1	10.0
All unmasked tokens	3.5	6.5	9.2
[MASK/MASK]	5.8	10.9	16.0
[MASK/random]	5.8	10.9	16.0
[MASK/original]	4.0	7.4	10.5
[CLS]	6.0	12.0	22.5
[SEP]	2.5	7.6	8.4

How to use

We provide a set of demonstration notebooks to use the model:

Notebook	Link
Pre-training ALBERT (ACT Penalty)
Fine-tuning ALBERT (glue & hp search)

ACL (Student) 2021 Presentation

Citations

BibTeX entry and citation info

If you use our iterative transformer model for your scientific publication or your industrial applications, please cite the following paper:

@inproceedings{simoulin-crabbe-2021-many,
    title = "How Many Layers and Why? {A}n Analysis of the Model Depth in Transformers",
    author = "Simoulin, Antoine  and
      Crabb{\'e}, Benoit",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-srw.23",
    doi = "10.18653/v1/2021.acl-srw.23",
    pages = "221--228",
}

References

Alex Graves. 2016. Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
act		act
assets		assets
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adaptive Depth Transformers

Model architecture

Pre-training

How to use

ACL (Student) 2021 Presentation

Citations

BibTeX entry and citation info

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

AntoineSimoulin/adaptive-depth-transformers

Folders and files

Latest commit

History

Repository files navigation

Adaptive Depth Transformers

Model architecture

Pre-training

How to use

ACL (Student) 2021 Presentation

Citations

BibTeX entry and citation info

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages