Skip to content

Releases: huggingface/transformers

v4.15.0

22 Dec 19:34
Compare
Choose a tag to compare

New Model additions

WavLM

WavLM was proposed in WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.

WavLM sets a new SOTA on the SUPERB benchmark.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=wavlm

Wav2Vec2Phoneme

Wav2Vec2Phoneme was proposed in Simple and Effective Zero-shot Cross-lingual Phoneme Recognition by Qiantong Xu, Alexei Baevski, Michael Auli.
Wav2Vec2Phoneme allows to do phoneme classification as part of automatic speech recognition

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=phoneme-recognition

UniSpeech-SAT

Unispeech-SAT was proposed in UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.

UniSpeech-SAT is especially good at speaker related tasks.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=unispeech-sat

UniSpeech

Unispeech was proposed in UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
Three new models are released as part of the ImageGPT integration: ImageGPTModel, ImageGPTForCausalImageModeling, ImageGPTForImageClassification, in PyTorch.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=unispeech

New Tasks

Speaker Diarization and Verification

Wav2Vec2-like architecture now have a speaker diarization and speaker verification head added to their architectures.
You can try out the new task here: https://huggingface.co/spaces/microsoft/wavlm-speaker-verification

  • Add Speaker Diarization and Verification heads by @anton-l in #14723

What's Changed

New Contributors

Full Changelog: v4.14.0...v4.15.0

v4.14.1: Patch release

15 Dec 19:02
Compare
Choose a tag to compare

v4.14.1 Patch release

Fixes a circular import when TensorFlow and Onnx are both installed (#14787)

v4.14.0: Perceiver, Keras model cards

15 Dec 17:27
Compare
Choose a tag to compare

Perceiver

The Perceiver model was released in the previous version:

Perceiver

Eight new models are released as part of the Perceiver implementation: PerceiverModel, PerceiverForMaskedLM, PerceiverForSequenceClassification, PerceiverForImageClassificationLearned, PerceiverForImageClassificationFourier, PerceiverForImageClassificationConvProcessing, PerceiverForOpticalFlow, PerceiverForMultimodalAutoencoding, in PyTorch.

The Perceiver IO model was proposed in Perceiver IO: A General Architecture for Structured Inputs & Outputs by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=perceiver

Version v4.14.0 adds support for Perceiver in multiple pipelines, including the fill mask and sequence classification pipelines.

Keras model cards

The Keras push to hub callback now generates model cards when pushing to the model hub. Additionally to the callback, model cards will be generated by default by the model.push_to_hub() method.

What's Changed

New Contributors

Full Changelog: v4.13.0...v4.14.0

v4.13.0: Perceiver, ImageGPT, mLUKE, Vision-Text dual encoders, QDQBert, new documentation frontend

09 Dec 16:07
Compare
Choose a tag to compare

New Model additions

Perceiver

Eight new models are released as part of the Perceiver implementation: PerceiverModel, PerceiverForMaskedLM, PerceiverForSequenceClassification, PerceiverForImageClassificationLearned, PerceiverForImageClassificationFourier, PerceiverForImageClassificationConvProcessing, PerceiverForOpticalFlow, PerceiverForMultimodalAutoencoding, in PyTorch.

The Perceiver IO model was proposed in Perceiver IO: A General Architecture for Structured Inputs & Outputs by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=perceiver

mLUKE

The mLUKE tokenizer is added. The tokenizer can be used for the multilingual variant of LUKE.

The mLUKE model was proposed in mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. It's a multilingual extension
of the LUKE model trained on the basis of XLM-RoBERTa.

  • Add mLUKE by @Ryou0634 in #14640

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=luke

ImageGPT

Three new models are released as part of the ImageGPT integration: ImageGPTModel, ImageGPTForCausalImageModeling, ImageGPTForImageClassification, in PyTorch.

The ImageGPT model was proposed in Generative Pretraining from Pixels by Mark
Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. ImageGPT (iGPT) is a GPT-2-like
model trained to predict the next pixel value, allowing for both unconditional and conditional image generation.

Compatible checkpoints can be found on the hub: https://huggingface.co/models?other=imagegpt

QDQBert

Eight new models are released as part of the QDQBert implementation: QDQBertModel, QDQBertLMHeadModel, QDQBertForMaskedLM, QDQBertForSequenceClassification, QDQBertForNextSentencePrediction, QDQBertForMultipleChoice, QDQBertForTokenClassification, QDQBertForQuestionAnswering, in PyTorch.

The QDQBERT model can be referenced in Integer Quantization for Deep Learning Inference: Principles and Empirical
Evaluation
by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius
Micikevicius.

  • Add QDQBert model and quantization examples of SQUAD task by @shangz-ai in #14066

Semantic Segmentation models

The semantic Segmentation models' API is unstable and bound to change between this version and the next.

The first semantic segmentation models are added. In semantic segmentation, the goal is to predict a class label for every pixel of an image. The models that are added are SegFormer (by NVIDIA) and BEiT (by Microsoft Research). BEiT was already available in the library, but this release includes the model with a semantic segmentation head.

The SegFormer model was proposed in SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. The model consists of a hierarchical Transformer encoder and a lightweight all-MLP decode head to achieve great results on image segmentation benchmarks such as ADE20K and Cityscapes.

The BEiT model was proposed in BEiT: BERT Pre-Training of Image Transformers by Hangbo Bao, Li Dong, Furu Wei. Rather than pre-training the model to predict the class of an image (as done in the original ViT paper), BEiT models are pre-trained to predict visual tokens from the codebook of OpenAI’s DALL-E model given masked patches.

Vision-text dual encoder

Adds VisionTextDualEncoder model in PyTorch and Flax to be able to load any pre-trained vision (ViT, DeiT, BeiT, CLIP's vision model) and text (BERT, ROBERTA) model in the library for vision-text tasks like CLIP.

This model pairs a vision and text encoder and adds projection layers to project the embeddings to another embeddings space with similar dimensions. which can then be used to align the two modalities.

CodeParrot

CodeParrot, a model trained to generate code, has been open-sourced in the research projects by @lvwerra.

Language model support for ASR

See https://huggingface.co/patrickvonplaten/wav2vec2-xlsr-53-es-kenlm for more information.

Flax-specific additions

Adds Flax version of the vision encoder-decoder model, and adds a Flax version of GPT-J.

TensorFlow-specific additions

Vision transformers are here! Convnets are so 2012, now that ML is converging on self-attention as a universal model.

Want to handle real-world tables, where text and data are positioned in a 2D grid? TAPAS is now here for both TensorFlow and PyTorch.

Automatic checkpointing and cloud saves to the HuggingFace Hub during training are now live, allowing you to resume training when it's interrupted, even if your initial instance is terminated. This is an area of very active development - watch this space for future developments, including automatic model card creation and more.

Auto-processors

A new class to automatically select processors is added: AutoProcessor. It can be used for all models that require a processor, in both computer vision and audio.

New documentation frontend

A new documentation frontend is out for the transformers library! The goal with this documentation is to be better aligned with the rest of our website, and contains tools to improve readability. The documentation can now be written in markdown rather than RST.

LayoutLM Improvements

The LayoutLMv2 feature extractor now supports non-English languages, and LayoutXLM gets its own processor.

  • LayoutLMv2FeatureExtractor now supports non-English languages when applying Tesseract OCR. by @Xargonus in #14514
  • Add LayoutXLMProcessor (and LayoutXLMTokenizer, LayoutXLMTokenizerFast) by @NielsRogge in #14115

Trainer Improvements

You can now take advantage of the Ampere hardware with the Trainer:

  • --bf16 - do training or eval in mixed precision of bfloat16
  • --bf16_full_eval - do eval in full bfloat16
  • --tf32 control having TF32 mode on/off

Improvements and bugfixes

Read more

v4.12.5: Patch release

17 Nov 16:38
Compare
Choose a tag to compare

Reverts a commit that introduced other issues:

  • Revert "Experimenting with adding proper get_config() and from_config() methods (#14361)"

v4.12.4: Patch release

16 Nov 22:31
Compare
Choose a tag to compare
  • Fix gradient_checkpointing backward compatibility (#14408)
  • [Wav2Vec2] Make sure that gradient checkpointing is only run if needed (#14407)
  • Experimenting with adding proper get_config() and from_config() methods (#14361)
  • enhance rewrite state_dict missing _metadata (#14348)
  • Support for TF >= 2.7 (#14345)
  • improve rewrite state_dict missing _metadata (#14276)
  • Fix of issue #13327: Wrong weight initialization for TF t5 model (#14241)

v4.12.3: Patch release

03 Nov 13:05
Compare
Choose a tag to compare

v4.12.3: Patch release

  • Add PushToHubCallback in main init (#14246)
  • Supports huggingface_hub >= 0.1.0

v4.12.2: Patch release

29 Oct 18:52
Compare
Choose a tag to compare

Fixes an issue with the image segmentation pipeline and PyTorch's inference mode.

v4.12.1: Patch release

29 Oct 18:43
Compare
Choose a tag to compare

Enables torch 1.10.0

v4.12.0: TrOCR, SEW & SEW-D, Unispeech & Unispeech-SAT, BARTPho

28 Oct 16:57
Compare
Choose a tag to compare

TrOCR and VisionEncoderDecoderModel

One new model is released as part of the TrOCR implementation: TrOCRForCausalLM, in PyTorch. It comes along a new VisionEncoderDecoderModel class, which allows to mix-and-match any vision Transformer encoder with any text Transformer as decoder, similar to the existing SpeechEncoderDecoderModel class.

The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.

The TrOCR model consists of an image transformer encoder and an autoregressive text transformer to perform optical character recognition in an end-to-end manner.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=trocr

SEW & SEW-D

SEW and SEW-D (Squeezed and Efficient Wav2Vec) were proposed in Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

SEW and SEW-D models use a Wav2Vec-style feature encoder and introduce temporal downsampling to reduce the length of the transformer encoder. SEW-D additionally replaces the transformer encoder with a DeBERTa one. Both models achieve significant inference speedups without sacrificing the speech recognition quality.

Compatible checkpoints are available on the Hub: https://huggingface.co/models?other=sew and https://huggingface.co/models?other=sew-d

DistilHuBERT

DistilHuBERT was proposed in DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT, by Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee.

DistilHuBERT is a distilled version of the HuBERT model. Using only two transformer layers, the model scores competitively on the SUPERB benchmark tasks.

Compatible checkpoint is available on the Hub: https://huggingface.co/ntu-spml/distilhubert

TensorFlow improvements

Several bug fixes and UX improvements for TensorFlow

Keras callback

Introduction of a Keras callback to push to the hub each epoch, or after a given number of steps:

Updates on the encoder-decoder framework

The encoder-decoder framework is now available in TensorFlow, allowing mixing and matching different encoders and decoders together into a single encoder-decoder architecture!

  • Add TFEncoderDecoderModel + Add cross-attention to some TF models by @ydshieh in #13222

Besides this, the EncoderDecoderModel classes have been updated to work similar to models like BART and T5. From now on, users don't need to pass decoder_input_ids themselves anymore to the model. Instead, they will be created automatically based on the labels (namely by shifting them one position to the right, replacing -100 by the pad_token_id and prepending the decoder_start_token_id). Note that this may result in training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0 that set the decoder_input_ids = labels.

  • Fix EncoderDecoderModel classes to be more like BART and T5 by @NielsRogge in #14139

Speech improvements

Auto-model API

To make it easier to extend the Transformers library, every Auto class a new register method, that allows you to register your own custom models, configurations or tokenizers. See more in the documentation

  • Add an API to register objects to Auto classes by @sgugger in #13989

Bug fixes and improvements

Read more