Skip to content

Releases: huggingface/transformers

v4.9.0: TensorFlow examples, CANINE, tokenizer training, ONNX rework

22 Jul 10:53
Compare
Choose a tag to compare

v4.9.0: TensorFlow examples, CANINE, tokenizer training, ONNX rework

ONNX rework

This version introduces a new package, transformers.onnx, which can be used to export models to ONNX. Contrary to the previous implementation, this approach is meant as an easily extendable package where users may define their own ONNX configurations and export the models they wish to export.

python -m transformers.onnx --model=bert-base-cased onnx/bert-base-cased/
Validating ONNX model...
        -[✓] ONNX model outputs' name match reference model ({'pooler_output', 'last_hidden_state'}
        - Validating ONNX Model output "last_hidden_state":
                -[✓] (2, 8, 768) matchs (2, 8, 768)
                -[✓] all values close (atol: 0.0001)
        - Validating ONNX Model output "pooler_output":
                -[✓] (2, 768) matchs (2, 768)
                -[✓] all values close (atol: 0.0001)
All good, model saved at: onnx/bert-base-cased/model.onnx
  • [RFC] Laying down building stone for more flexible ONNX export capabilities #11786 (@mfuntowicz)

CANINE model

Four new models are released as part of the CANINE implementation: CanineForSequenceClassification, CanineForMultipleChoice, CanineForTokenClassification and CanineForQuestionAnswering, in PyTorch.

The CANINE model was proposed in CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It’s among the first papers that train a Transformer without using an explicit tokenization step (such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece). Instead, the model is trained directly at a Unicode character level. Training at a character level inevitably comes with a longer sequence length, which CANINE solves with an efficient downsampling strategy, before applying a deep Transformer encoder.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=canine

Tokenizer training

This version introduces a new method to train a tokenizer from scratch based off of an existing tokenizer configuration.

from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")
# We train on batch of texts, 1000 at a time here.
batch_size = 1000
corpus = (dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size))

tokenizer = AutoTokenizer.from_pretrained("gpt2")
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=20000)
  • Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) #12420 (@SaulLu)
  • Easily train a new fast tokenizer from a given one #12361 (@sgugger)

TensorFlow examples

The TFTrainer is now entering deprecation - and it is replaced by Keras. With version v4.9.0 comes the end of a long rework of the TensorFlow examples, for them to be more Keras-idiomatic, clearer, and more robust.

TensorFlow implementations

HuBERT is now implemented in TensorFlow:

Breaking changes

When load_best_model_at_end was set to True in the TrainingArguments, having a different save_strategy and eval_strategy was accepted but the save_strategy was overwritten by the eval_strategy (the option to keep track of the best model needs to make sure there is an evaluation each time there is a save). This led to a lot of confusion with users not understanding why the script was not doing what it was told, so this situation will now raise an error indicating to set save_strategy and eval_strategy to the same values, and in the case that value is "steps", save_steps must be a round multiple of eval_steps.

General improvements and bugfixes

Read more

Patch release: v4.8.2

30 Jun 12:42
Compare
Choose a tag to compare

Patch release: v4.8.2

v4.8.1: Patch release

24 Jun 14:15
Compare
Choose a tag to compare

v4.8.1: Patch release

  • Fix default for TensorBoard folder
  • Ray Tune install #12338
  • Tests fixes for Torch FX #12336

v4.8.0 Integration with the Hub and Flax/JAX support

23 Jun 17:28
468cda2
Compare
Choose a tag to compare

v4.8.0 Integration with the Hub and Flax/JAX support

Integration with the Hub

Our example scripts and Trainer are now optimized for publishing your model on the Hugging Face Hub, with Tensorboard training metrics, and an automatically authored model card which contains all the relevant metadata, including evaluation results.

Trainer Hub integration

Use --push_to_hub to create a model repo for your training and it will be saved with all relevant metadata at the end of the training.

Other flags are:

  • push_to_hub_model_id to control the repo name
  • push_to_hub_organization to specify an organization

Visualizing Training metrics on huggingface.co (based on Tensorboard)

By default if you have tensorboard installed the training scripts will use it to log, and the logging traces folder is conveniently located inside your model output directory, so you can push them to your model repo by default.

Any model repo that contains Tensorboard traces will spawn a Tensorboard server:

image

which makes it very convenient to see how the training went! This Hub feature is in Beta so let us know if anything looks weird :)

See this model repo

Model card generation

image

The model card contains info about the datasets used, the eval results, ...

Many users were already adding their eval results to their model cards in markdown format, but this is a more structured way of adding them which will make it easier to parse and e.g. represent in leaderboards such as the ones on Papers With Code!

We use a format specified in collaboration with [PaperswithCode] (https://github.com/huggingface/huggingface_hub/blame/main/modelcard.md), see also this repo.

Model, tokenizer and configurations

All models, tokenizers and configurations having a revamp push_to_hub() method as well as a push_to_hub argument in their save_pretrained() method. The workflow of this method is changed a bit to be more like git, with a local clone of the repo in a folder of the working directory, to make it easier to apply patches (use use_temp_dir=True to clone in temporary folders for the same behavior as the experimental API).

Flax/JAX support

Flax/JAX is becoming a fully supported backend of the Transformers library with more models having an implementation in it. BART, CLIP and T5 join the already existing models, find the whole list here.

General improvements and bug fixes

v4.7.0: DETR, RoFormer, ByT5, HuBERT, support for torch 1.9.0

17 Jun 16:20
Compare
Choose a tag to compare

v4.7.0: DETR, RoFormer, ByT5, Hubert, support for torch 1.9.0

DETR (@NielsRogge)

Three new models are released as part of the DETR implementation: DetrModel, DetrForObjectDetection and DetrForSegmentation, in PyTorch.

DETR consists of a convolutional backbone followed by an encoder-decoder Transformer which can be trained end-to-end for object detection. It greatly simplifies a lot of the complexity of models like Faster-R-CNN and Mask-R-CNN, which use things like region proposals, non-maximum suppression procedure, and anchor generation. Moreover, DETR can also be naturally extended to perform panoptic segmentation, by simply adding a mask head on top of the decoder outputs.

DETR can support any timm backbone.

The DETR model was proposed in End-to-End Object Detection with Transformers by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=detr

ByT5 (@patrickvonplaten)

A new tokenizer is released as part of the ByT5 implementation: ByT5Tokenizer. It can be used with the T5 family of models.

The ByT5 model was presented in ByT5: Towards a token-free future with pre-trained byte-to-byte models by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?search=byt5

RoFormer (@JunnYu)

14 new models are released as part of the RoFormer implementation: RoFormerModel, RoFormerForCausalLM, RoFormerForMaskedLM, RoFormerForSequenceClassification, RoFormerForTokenClassification, RoFormerForQuestionAnswering and RoFormerForMultipleChoice, TFRoFormerModel, TFRoFormerForCausalLM, TFRoFormerForMaskedLM, TFRoFormerForSequenceClassification, TFRoFormerForTokenClassification, TFRoFormerForQuestionAnswering and TFRoFormerForMultipleChoice, in PyTorch and TensorFlow.

RoFormer is a BERT-like autoencoding model with rotary position embeddings. Rotary position embeddings have shown improved performance on classification tasks with long texts. The RoFormer model was proposed in RoFormer: Enhanced Transformer with Rotary Position Embedding by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.

  • Add new model RoFormer (use rotary position embedding ) #11684 (@JunnYu)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=roformer

HuBERT (@patrickvonplaten)

HuBERT is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.

HuBERT was proposed in HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.

Two new models are released as part of the HuBERT implementation: HubertModel and HubertForCTC, in PyTorch.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=hubert

Hugging Face Course - Part 1

On Monday, June 14th, 2021, we released the first part of the Hugging Face Course. The course is focused on the Hugging Face ecosystem, including transformers. Most of the material in the course is now linked from the transformers documentation which now includes videos to explain singular concepts.

TensorFlow additions

The Wav2Vec2 model can now be used in TensorFlow:

PyTorch 1.9 support

Notebooks

General improvements and bugfixes

Read more

v4.6.1: Patch release

20 May 14:57
Compare
Choose a tag to compare

Fix regression in models for sequence classification used for regression tasks #11785
Fix checkpoint deletion when load_bert_model_at_end = True #11748
Fix evaluation in question answering examples #11746
Fix release utils #11784

v4.6.0: ViT, DeiT, CLIP, LUKE, BigBirdPegasus, MegatronBERT

12 May 15:07
Compare
Choose a tag to compare

v4.6.0: ViT, DeiT, CLIP, LUKE, BigBirdPegasus, MegatronBERT

Transformers aren't just for text - they can handle a huge range of input types, and there's been a flurry of papers and new models in the last few months applying them to vision tasks that had traditionally been dominated by convolutional networks. With this release, we're delighted to announce that several state-of-the-art pretrained vision and multimodal text+vision transformer models are now accessible in the huggingface/transformers repo. Give them a try!

ViT (@NielsRogge)

Two new models are released as part of the ViT implementation: ViTModel and ViTForImageClassification, in PyTorch.

ViT is an image transformer-based model obtaining state-of-the-art results on image classification tasks. It was the first paper that successfully trained a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures.

The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=vit

DeiT (@NielsRogge)

Three new models are released as part of the DeiT implementation: DeiTModel, DeiTForImageClassification and DeiTForImageClassificationWithTeacher, in PyTorch.

DeiT is an image transformer model similar to the ViT model. DeiT (data-efficient image transformers) models are more efficiently trained transformers for image classification, requiring far less data and far less computing resources compared to the original ViT models.

The DeiT model was proposed in Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=deit

CLIP (@patil-suraj)

Three new models are released as part of the CLIP implementation: CLIPModel, CLIPVisionModel and CLIPTextModel, in PyTorch.

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

The CLIP model was proposed in Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=clip

BigBirdPegasus (@vasudevgupta7)

BigBird is a sparse-attention-based transformer that extends Transformer based models, such as BERT to much longer sequences. In addition to sparse attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it has been shown that applying sparse, global, and random attention approximates full attention while being computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context, BigBird has shown improved performance on various long document NLP tasks, such as question answering and summarization, compared to BERT or RoBERTa.

The BigBird model was proposed in Big Bird: Transformers for Longer Sequences by Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others.

  • Add BigBirdPegasus #10991 (@vasudevgupta7)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=bigbird_pegasus

LUKE (@NielsRogge, @ikuyamada)

LUKE is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps improve performance on various downstream tasks involving reasoning about entities such as named entity recognition, extractive and cloze-style question answering, entity typing, and relation classification.

The LUKE model was proposed in LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=luke

Megatron (@jdemouth)

The MegatronBERT model is added to the library, giving access to the 345m variants.

It is implemented comes with nine different models: MegatronBertModel, MegatronBertForMaskedLM, MegatronBertForCausalLM, MegatronBertForNextSentencePrediction, MegatronBertForPreTraining, MegatronBertForSequenceClassification, MegatronBertForMultipleChoice, MegatronBertForTokenClassification, MegatronBertForQuestionAnswering, in PyTorch.

The MegatronBERT model was proposed in Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.

Hub integration in Transformers

The Hugging Face Hub integrates better within transformers, through two new added features:

  • Models, configurations and tokenizers now have a push_to_hub method to automatically push their state to the hub.

  • The Trainer can now automatically push its underlying model, configuration and tokenizer in a similar fashion. Additionally, it is able to create a draft of model card on the fly with the training hyperparameters and evaluation results.

  • Auto modelcard #11599 (@sgugger)

  • Trainer push to hub #11328 (@sgugger)

DeepSpeed ZeRO Stage 3 & ZeRO-Infinity

The Trainer now integrates two additional stages of ZeRO: ZeRO stage 3 for parameter partitioning, and ZeRO Infinity which extends CPU Offload with NVMe Offload.

Flax

Flax support is getting more robust, with model code stabilizing and new models being added to the library.

TensorFlow

We welcome @Rocketknight1 as a TensorFlow contributor. This version includes a brand new TensorFlow example based on Keras, which will be followed by examples covering most tasks.
Additionally, more TensorFlow setups are covered by adding support for AMD-based GPUs and M1 Macs.

Pipelines

Two new pipelines are added:

Notebooks

  • [Community notebooks] Add Wav2Vec notebook for creating captions for YT Clips #11142 (@Muennighoff)
  • add bigbird-pegasus evaluation notebook #11654 (@vasudevgupta7)
  • Vit notebooks + vit/deit fixes #11309 (@NielsRogge)

General improvements and bugfixes

Read more

v4.5.1: Patch release

13 Apr 15:25
Compare
Choose a tag to compare
  • Fix pipeline when used with private models (#11123)
  • Fix loading an architecture in an other (#11207)

v4.5.0: BigBird, GPT Neo, Examples, Flax support

06 Apr 16:50
Compare
Choose a tag to compare

v4.5.0: BigBird, GPT Neo, Examples, Flax support

BigBird (@vasudevgupta7)

Seven new models are released as part of the BigBird implementation: BigBirdModel, BigBirdForPreTraining, BigBirdForMaskedLM, BigBirdForCausalLM, BigBirdForSequenceClassification, BigBirdForMultipleChoice, BigBirdForQuestionAnswering in PyTorch.

BigBird is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse attention, BigBird also applies global attention as well as random attention to the input sequence.

The BigBird model was proposed in Big Bird: Transformers for Longer Sequences by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed.

It is released with an accompanying blog post: Understanding BigBird's Block Sparse Attention

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=big_bird

GPT Neo (@patil-suraj)

Two new models are released as part of the GPT Neo implementation: GPTNeoModel, GPTNeoForCausalLM in PyTorch.

GPT⁠-⁠Neo is the code name for a family of transformer-based language models loosely styled around the GPT architecture. EleutherAI's primary goal is to replicate a GPT⁠-⁠3 DaVinci-sized model and open-source it to the public.

The implementation within Transformers is a GPT2-like causal language model trained on the Pile dataset.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=gpt_neo

Examples

Features have been added to some examples, and additional examples have been added.

Raw training loop examples

Based on the accelerate library, examples completely exposing the training loop are now part of the library. For easy customization if you want to try a new research idea!

Standardize examples with Trainer

Thanks to the amazing contributions of @bhadreshpsavani, all examples with Trainer are now standardized and all support the predict stage and will return/save metrics in the same fashion.

Trainer & SageMaker Model Parallelism

The Trainer now supports SageMaker model parallelism out of the box, the old SageMakerTrainer is deprecated as a consequence and will be removed in version 5.

FLAX

FLAX support has been widened to support all model heads of the BERT architecture, alongside a general conversion script for checkpoints in PyTorch to be used in FLAX.

Auto models now have a FLAX implementation.

General improvements and bugfixes

Read more

Patch release V4.4.2

18 Mar 19:17
Compare
Choose a tag to compare
  • Add support for detecting intel-tensorflow version
  • Fix distributed evaluation on SageMaker with distributed evaluation