Skip to content

Releases: huggingface/transformers

v4.4.0: S2T, M2M100, I-BERT, mBART-50, DeBERTa-v2, XLSR-Wav2Vec2

16 Mar 15:39
Compare
Choose a tag to compare

v4.4.0: S2T, M2M100, I-BERT, mBART-50, DeBERTa-v2, XLSR-Wav2Vec2

SpeechToText

Two new models are released as part of the S2T implementation: Speech2TextModel and Speech2TextForConditionalGeneration, in PyTorch.

Speech2Text is a speech model that accepts a float tensor of log-mel filter-bank features extracted from the speech signal. It’s a transformer-based seq2seq model, so the transcripts/translations are generated autoregressively.

The Speech2Text model was proposed in fairseq S2T: Fast Speech-to-Text Modeling with fairseq by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=speech_to_text

M2M100

Two new models are released as part of the M2M100 implementation: M2M100Model and M2M100ForConditionalGeneration, in PyTorch.

M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks.

The M2M100 model was proposed in Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=m2m_100

I-BERT

Six new models are released as part of the I-BERT implementation: IBertModel, IBertForMaskedLM, IBertForSequenceClassification, IBertForMultipleChoice, IBertForTokenClassification and IBertForQuestionAnswering, in PyTorch.

I-BERT is a quantized version of RoBERTa running inference up to four times faster.

The I-BERT framework in PyTorch allows to identify the best parameters for quantization. Once the model is exported in a framework that supports int8 execution (such as TensorRT), a speedup of up to 4x is visible, with no loss in performance thanks to the parameter search.

The I-BERT model was proposed in I-BERT: Integer-only BERT Quantization by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=ibert

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=speech_to_text

mBART-50

MBart-50 is created using the original mbart-large-cc25 checkpoint by extending its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50 languages.

The MBart model was presented in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=mbart-50

DeBERTa-v2

Fixe new models are released as part of the DeBERTa-v2 implementation: DebertaV2Model, DebertaV2ForMaskedLM, DebertaV2ForSequenceClassification, DeberaV2ForTokenClassification and DebertaV2ForQuestionAnswering, in PyTorch.

The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.

It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=deberta-v2

Wav2Vec2

XLSR-Wav2Vec2

The XLSR-Wav2Vec2 model was proposed in Unsupervised Cross-Lingual Representation Learning For Speech Recognition by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.

The checkpoint corresponding to that model is added to the model hub: facebook/
wav2vec2-large-xlsr-53

Training script

A fine-tuning script showcasing how the Wav2Vec2 model can be trained has been added.

Further improvements

The Wav2Vec2 architecture becomes more stable as several changes are done to its architecture. This introduces feature extractors and feature processors as the pre-processing aspect of multi-modal speech models.

AMP & XLA Support for TensorFlow models

Most of the TensorFlow models are now compatible with automatic mixed precision and have XLA support.

  • Add AMP for TF Albert #10141 (@jplu)
  • Unlock XLA test for TF ConvBert #10207 (@jplu)
  • Making TF BART-like models XLA and AMP compliant #10191 (@jplu)
  • Making TF XLM-like models XLA and AMP compliant #10211 (@jplu)
  • Make TF CTRL compliant with XLA and AMP #10209 (@jplu)
  • Making TF GPT2 compliant with XLA and AMP #10230 (@jplu)
  • Making TF Funnel compliant with AMP #10216 (@jplu)
  • Making TF Lxmert model compliant with AMP #10257 (@jplu)
  • Making TF MobileBert model compliant with AMP #10259 (@jplu)
  • Making TF MPNet model compliant with XLA #10260 (@jplu)
  • Making TF T5 model compliant with AMP and XLA #10262 (@jplu)
  • Making TF TransfoXL model compliant with AMP #10264 (@jplu)
  • Making TF OpenAI GPT model compliant with AMP and XLA #10261 (@jplu)
  • Rework the AMP for TF XLNet #10274 (@jplu)
  • Making TF Longformer-like models compliant with AMP #10233 (@jplu)

SageMaker Trainer for model parallelism

We are rolling out experimental support for model parallelism on SageMaker with a new SageMakerTrainer that can be used in place of the regular Trainer. This is a temporary class that will be removed in a future version, the end goal is to have Trainer support this feature out of the box.

General improvements and bugfixes

Read more

v4.3.3: Patch release

24 Feb 20:16
Compare
Choose a tag to compare

This patch fixes an issue with the conversion for ConvBERT models: #10314.

V4.3.2: Patch release

09 Feb 19:13
Compare
Choose a tag to compare

This patch release fixes the RAG model (#10094) and the detection of whether faiss is available (#10103)

v4.3.1: Patch release

09 Feb 09:01
Compare
Choose a tag to compare

This patch release modifies the API of the Wav2Vec2 model: the Wav2Vec2ForCTC was added as a replacement of Wav2Vec2ForMaskedLM. Wav2Vec2ForMaskedLM is kept for backwards compatibility but is deprecated.

v4.3.0: Wav2Vec2, ConvBERT, BORT, Amazon SageMaker

08 Feb 17:45
Compare
Choose a tag to compare

Wav2Vec2 from facebook (@patrickvonplaten)

Two new models are released as part of the Wav2Vec2 implementation: Wav2Vec2Model and Wav2Vec2ForMaskedLM, in PyTorch.

Wav2Vec2 is a multi-modal model, combining speech and text. It's the first multi-modal model of its kind we welcome in Transformers.

The Wav2Vec2 model was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=wav2vec2

Available notebooks:

Contributions:

Future Additions

  • Enable fine-tuning and pretraining for Wav2Vec2
  • Add example script with dependency to wav2letter/flashlight
  • Add Encoder-Decoder Wav2Vec2 model

ConvBERT

The ConvBERT model was proposed in ConvBERT: Improving BERT with Span-based Dynamic Convolution by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.

Six new models are released as part of the ConvBERT implementation: ConvBertModel, ConvBertForMaskedLM, ConvBertForSequenceClassification, ConvBertForTokenClassification, ConvBertForQuestionAnswering and ConvBertForMultipleChoice. These models are available both in PyTorch and TensorFlow.

Contributions:

BORT

The BORT model was proposed in Optimal Subarchitecture Extraction for BERT by Amazon's Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the authors refer to as “Bort”.

The BORT model can be loaded directly in the BERT architecture, therefore all BERT model heads are available for BORT.

Contributions:

Trainer now supports Amazon SageMaker’s data parallel library (@sgugger)

When executing a script with Trainer using Amazon SageMaker and enabling SageMaker's data parallelism library, Trainer will automatically use the smdistributed library. All maintained examples have been tested with this functionality. Here is an overview of SageMaker data parallelism library.

  • When on SageMaker use their env variables for saves #9876 (@sgugger)

Community page

A new Community Page has been added to the docs. These contain all the notebooks contributed by the community, as well as some community projects built around Transformers. Feel free to open a PR if you want your project to be showcased!

Additional model architectures

DeBERTa now has more model heads available.

BART, mBART, Marian, Pegasus and Blenderbot now have decoder-only model architectures. They can therefore be used in decoder-only settings.

Breaking changes

None.

General improvements and bugfixes

Read more

v4.3.0.rc1: Wav2Vec2, ConvBERT, BORT, Amazon SageMaker

04 Feb 21:20
Compare
Choose a tag to compare

Wav2Vec2 from facebook (@patrickvonplaten)

Two new models are released as part of the Wav2Vec2 implementation: Wav2Vec2Model and Wav2Vec2ForMaskedLM, in PyTorch.

Wav2Vec2 is a multi-modal model, combining speech and text. It's the first multi-modal model of its kind we welcome in Transformers.

The Wav2Vec2 model was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=wav2vec2

Available notebooks:

Contributions:

Future Additions

  • Enable fine-tuning and pretraining for Wav2Vec2
  • Add example script with dependency to wav2letter/flashlight
  • Add Encoder-Decoder Wav2Vec2 model

ConvBERT

The ConvBERT model was proposed in ConvBERT: Improving BERT with Span-based Dynamic Convolution by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan.

Six new models are released as part of the ConvBERT implementation: ConvBertModel, ConvBertForMaskedLM, ConvBertForSequenceClassification, ConvBertForTokenClassification, ConvBertForQuestionAnswering and ConvBertForMultipleChoice. These models are available both in PyTorch and TensorFlow.

Contributions:

BORT

The BORT model was proposed in Optimal Subarchitecture Extraction for BERT by Amazon's Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the authors refer to as “Bort”.

The BORT model can be loaded directly in the BERT architecture, therefore all BERT model heads are available for BORT.

Contributions:

Trainer now supports Amazon SageMaker’s data parallel library (@sgugger)

When executing a script with Trainer using Amazon SageMaker and enabling SageMaker's data parallelism library, Trainer will automatically use the smdistributed library. All maintained examples have been tested with this functionality. Here is an overview of SageMaker data parallelism library.

  • When on SageMaker use their env variables for saves #9876 (@sgugger)

Community page

A new Community Page has been added to the docs. These contain all the notebooks contributed by the community, as well as some community projects built around Transformers. Feel free to open a PR if you want your project to be showcased!

Additional model architectures

DeBERTa now has more model heads available.

BART, mBART, Marian, Pegasus and Blenderbot now have decoder-only model architectures. They can therefore be used in decoder-only settings.

Breaking changes

None.

General improvements and bugfixes

Read more

v4.2.2: Patch release

21 Jan 08:14
Compare
Choose a tag to compare

This patch contains two fixes:

v4.2.1 Patch release

14 Jan 13:22
Compare
Choose a tag to compare

This patch contains three fixes:

v4.2.0: LED from AllenAI, Generation Scores, TensorFlow 2x speedup, faster import

13 Jan 15:13
Compare
Choose a tag to compare

v4.2.0: LED from AllenAI, encoder-decoder templates, fast imports

LED from AllenAI (@patrickvonplaten)

Four new models are released as part of the LED implementation: LEDModel, LEDForConditionalGeneration, LEDForSequenceClassification, LEDForQuestionAnswering, in PyTorch. The first two models have a TensorFlow version.

LED is the encoder-decoder variant of the Longformer model by allenai.

The LED model was proposed in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=led

Available notebooks:

Contributions:

Generation Scores & other outputs (@patrickvonplaten)

The PyTorch generation function now allows to return:

  • scores - the logits generated at each step
  • attentions - all attention weights at each generation step
  • hidden_states - all hidden states at each generation step

by simply adding return_dict_in_generate to the config or as an input to .generate()

Tweet:

Notebooks for a better explanation:

PR:

  • Add flags to return scores, hidden states and / or attention weights in GenerationMixin #9150 (@SBrandeis)

TensorFlow improvements

TensorFlow BERT-like model improvements (@jplu)

The TensorFlow version of the BERT-like models have been updated and are now twice as fast as the previous versions.

  • Improve BERT-like models performance with better self attention #9124 (@jplu)

Better integration in TensorFlow Serving (@jplu)

This version introduces a new API for TensorFlow saved models, which can now be exported with model.save_pretrained("path", saved_model=True) and easily loaded into a TensorFlow Serving environment.

DeepSpeed integration (@stas00)

Initial support for DeepSpeed to accelerate distributed training on several GPUs. This is an experimental feature that hasn't been fully tested yet, but early results are very encouraging (see this comment). Stay tuned for more details in the coming weeks!

Model templates (@patrickvonplaten)

The encoder-decoder version of the templates is now part of Transformers! Adding an encoder-decoder model is made very easy with this addition. More information can be found in the README.

Faster import (@sgugger)

The initialization process has been changed to only import what is required. Therefore, when using only PyTorch models, TensorFlow will not be imported and vice-versa. In the best situations the import of a transformers model now takes only a few hundreds of milliseconds (~200ms) compared to more than a few seconds (~3s) in previous versions.

Documentation highlights (@Qbiwan, @NielsRogge)

Some models now have improved documentation. The LayoutLM model has seen a general overhaul in its documentation thanks to @NielsRogge.

The tokenizer-only models Bertweet, Herbert and Phobert now have their own documentation pages thanks to @Qbiwan.

Breaking changes

There are no breaking changes between the previous version and this one.
This will be the first version to require TensorFlow >= 2.3.

General improvements and bugfixes

Read more

v4.1.1: TAPAS, MPNet, model parallelization, Sharded DDP, conda, multi-part downloads.

17 Dec 17:09
Compare
Choose a tag to compare

v4.1.1: TAPAS, MPNet, model parallelization, Sharded DDP, conda, multi-part downloads.

TAPAS (@NielsRogge)

Four new models are released as part of the TAPAS implementation: TapasModel, TapasForQuestionAnswering, TapasForMaskedLM and TapasForSequenceClassification, in PyTorch.

TAPAS is a question answering model, used to answer queries given a table. It is a multi-modal model, joining text for the query and tabular data.

The TAPAS model was proposed in TAPAS: Weakly Supervised Table Parsing via Pre-training by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.

MPNet (@StillKeepTry)

Six new models are released as part of the MPNet implementation: MPNetModel, MPNetForMaskedLM, MPNetForSequenceClassification, MPNetForMultipleChoice, MPNetForTokenClassification, MPNetForQuestionAnswering, in both PyTorch and TensorFlow.

MPNet introduces a novel self-supervised objective named masked and permuted language modeling for language understanding. It inherits the advantages of both the masked language modeling (MLM) and the permuted language modeling (PLM) to addresses the limitations of MLM/PLM, and further reduce the inconsistency between the pre-training and fine-tuning paradigms.

The MPNet model was proposed in MPNet: Masked and Permuted Pre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu.

  • MPNet: Masked and Permuted Pre-training for Language Understanding #8971 (@StillKeepTry)

Model parallel (@alexorona)

Model parallelism is introduced, allowing users to load very large models on two or more GPUs by spreading the model layers over them. This can allow GPU training even for very large models.

Conda release (@LysandreJik)

Transformers welcome their first conda releases, with v4.0.0, v4.0.1 and v4.1.0. The conda packages are now officially maintained on the huggingface channel.

Multi-part uploads (@julien-c)

For the first time, very large models can be uploaded to the model hub, by using multi-part uploads.

New examples and reorganization (@sgugger)

We introduced a refactored SQuAD example & notebook, which is faster and simpler than the previous scripts.

The example directory has been re-ordered as we introduce the separation between "examples", which are maintained examples showcasing how to do one specific task, and "research projects", which are bigger projects and maintained by the community.

Introduction of fairscale with Sharded DDP (@sgugger)

We introduce support for fariscale's ShardedDDP in the Trainer, allowing reduced memory usage when training models in a distributed fashion.

Barthez (@moussaKam)

The BARThez model is a French variant of the BART model. We welcome its specific tokenizer to the library and multiple checkpoints to the modelhub.

General improvements and bugfixes

Read more