Implementation of the paper "How Many Layers and Why? An Analysis of the Model Depth in Transformers". In this study, we investigate the role of the multiple layers in deep transformer models. We design a variant of ALBERT that dynamically adapts the number of layers for each token of the input.
We augment a multi-layer transformer encoder with a halting mechanism, which allows dynamically adjusting the number of layers for each token. We directly adapted this mechanism from Graves (2016). At each iteration, we compute a probability for each token to stop updating its state.
We pretrain three models, tiny, small, and base, whose hyper-parameters are detailed below.
Models | tiny | small | base |
τ | 1e-3 | 5e-4 | 2.5e-4 |
Max iterations | 6 | 12 | 24 |
mlm (Acc.) | 55.4 | 57.1 | 57.4 |
sop (Acc.) | 80.9 | 83.9 | 84.3 |
All tokens | 3.8 | 7.1 | 10.0 |
All unmasked tokens | 3.5 | 6.5 | 9.2 |
[MASK/MASK] | 5.8 | 10.9 | 16.0 |
[MASK/random] | 5.8 | 10.9 | 16.0 |
[MASK/original] | 4.0 | 7.4 | 10.5 |
[CLS] | 6.0 | 12.0 | 22.5 |
[SEP] | 2.5 | 7.6 | 8.4 |
We provide a set of demonstration notebooks to use the model:
Notebook | Link |
Pre-training ALBERT (ACT Penalty) | |
Fine-tuning ALBERT (glue & hp search) | |
If you use our iterative transformer model for your scientific publication or your industrial applications, please cite the following paper:
title = "How Many Layers and Why? {A}n Analysis of the Model Depth in Transformers",
author = "Simoulin, Antoine and
Crabb{\'e}, Benoit",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "",
doi = "10.18653/v1/2021.acl-srw.23",
pages = "221--228",
Alex Graves. 2016. Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983.