Skip to content

Releases: huggingface/transformers

v4.12.2: Patch release

29 Oct 18:52
Compare
Choose a tag to compare

Fixes an issue with the image segmentation pipeline and PyTorch's inference mode.

v4.12.1: Patch release

29 Oct 18:43
Compare
Choose a tag to compare

Enables torch 1.10.0

v4.12.0: TrOCR, SEW & SEW-D, Unispeech & Unispeech-SAT, BARTPho

28 Oct 16:57
Compare
Choose a tag to compare

TrOCR and VisionEncoderDecoderModel

One new model is released as part of the TrOCR implementation: TrOCRForCausalLM, in PyTorch. It comes along a new VisionEncoderDecoderModel class, which allows to mix-and-match any vision Transformer encoder with any text Transformer as decoder, similar to the existing SpeechEncoderDecoderModel class.

The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.

The TrOCR model consists of an image transformer encoder and an autoregressive text transformer to perform optical character recognition in an end-to-end manner.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=trocr

SEW & SEW-D

SEW and SEW-D (Squeezed and Efficient Wav2Vec) were proposed in Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.

SEW and SEW-D models use a Wav2Vec-style feature encoder and introduce temporal downsampling to reduce the length of the transformer encoder. SEW-D additionally replaces the transformer encoder with a DeBERTa one. Both models achieve significant inference speedups without sacrificing the speech recognition quality.

Compatible checkpoints are available on the Hub: https://huggingface.co/models?other=sew and https://huggingface.co/models?other=sew-d

DistilHuBERT

DistilHuBERT was proposed in DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT, by Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee.

DistilHuBERT is a distilled version of the HuBERT model. Using only two transformer layers, the model scores competitively on the SUPERB benchmark tasks.

Compatible checkpoint is available on the Hub: https://huggingface.co/ntu-spml/distilhubert

TensorFlow improvements

Several bug fixes and UX improvements for TensorFlow

Keras callback

Introduction of a Keras callback to push to the hub each epoch, or after a given number of steps:

Updates on the encoder-decoder framework

The encoder-decoder framework is now available in TensorFlow, allowing mixing and matching different encoders and decoders together into a single encoder-decoder architecture!

  • Add TFEncoderDecoderModel + Add cross-attention to some TF models by @ydshieh in #13222

Besides this, the EncoderDecoderModel classes have been updated to work similar to models like BART and T5. From now on, users don't need to pass decoder_input_ids themselves anymore to the model. Instead, they will be created automatically based on the labels (namely by shifting them one position to the right, replacing -100 by the pad_token_id and prepending the decoder_start_token_id). Note that this may result in training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0 that set the decoder_input_ids = labels.

  • Fix EncoderDecoderModel classes to be more like BART and T5 by @NielsRogge in #14139

Speech improvements

Auto-model API

To make it easier to extend the Transformers library, every Auto class a new register method, that allows you to register your own custom models, configurations or tokenizers. See more in the documentation

  • Add an API to register objects to Auto classes by @sgugger in #13989

Bug fixes and improvements

Read more

v4.11.3: Patch release

06 Oct 17:00
Compare
Choose a tag to compare

v4.11.3: Patch release

This patch release fixes a few issues encountered since the release of v4.11.2:

  • [DPR] Correct init (#13796)
  • Fix warning situation: UserWarning: max_length is ignored when padding=True" (#13829)
  • Bart: check if decoder_inputs_embeds is set (#13800)
  • include megatron_gpt2 in installed modules (#13834)
  • Fixing 1-length special tokens cut. (#13862)
  • Fixing empty prompts for text-generation when BOS exists. (#13859)
  • Fixing question-answering with long contexts (#13873)
  • Fixing GPU for token-classification in a better way. (#13856)
  • Fixing Backward compatiblity for zero-shot (#13855)
  • Fix hp search for non sigopt backends (#13897)
  • Fix trainer logging_nan_inf_filter in torch_xla mode #13896 (@ymwangg)
  • [Trainer] Fix nan-loss condition #13911 (@anton-l)

v4.11.2: Patch release

30 Sep 15:55
Compare
Choose a tag to compare

v4.11.2: Patch release

Fix the Trainer API on TPU:

v4.11.1: Patch release

29 Sep 16:06
Compare
Choose a tag to compare

v4.11.1: Patch release

Patch release with a few bug fixes:

  • [Wav2Vec2] Better error message (#13777)
  • Fix LayoutLM ONNX test error (#13710)
  • Fix warning for gradient_checkpointing (#13767)
  • Implement len in IterableDatasetShard (#13780)
  • Fix length of IterableDatasetShard and add test (#13792)

v4.11.0: GPT-J, Speech2Text2, FNet, Pipeline GPU utilization, dynamic model code loading

27 Sep 18:20
Compare
Choose a tag to compare

v4.11.0: GPT-J, Speech2Text2, FNet, Pipeline GPU utilization, dynamic model code loading

GPT-J

Three new models are released as part of the GPT-J implementation: GPTJModel, GPTJForCausalLM, GPTJForSequenceClassification, in PyTorch.

The GPT-J model was released in the kingoflolz/mesh-transformer-jax repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like causal language model trained on the Pile dataset.

It was contributed by @StellaAthena, @kurumuz, @EricHallahan, and @leogao2.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=gptj

SpeechEncoderDecoder & Speech2Text2

One new model is released as part of the Speech2Text2 implementation: Speech2Text2ForCausalLM, in PyTorch.

The Speech2Text2 model is used together with Wav2Vec2 for Speech Translation models proposed in Large-Scale Self- and Semi-Supervised Learning for Speech Translation by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.

Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as Wav2Vec2 or HuBERT for Speech-to-Text tasks. Please refer to the SpeechEncoderDecoder class on how to combine Speech2Text2 with any speech encoder-only model.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=speech2text2

FNet

Eight new models are released as part of the FNet implementation: FNetModel, FNetForPreTraining, FNetForMaskedLM, FNetForNextSentencePrediction, FNetForSequenceClassification, FNetForMultipleChoice, FNetForTokenClassification, FNetForQuestionAnswering, in PyTorch.

The FNet model was proposed in FNet: Mixing Tokens with Fourier Transforms by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT model with a fourier transform which returns only the real parts of the transform. The model is significantly faster than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97% accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=fnet

TensorFlow improvements

Several bug fixes and UX improvements for Tensorflow:

  • Users should notice much fewer unnecessary warnings and less 'console spam' in general while using Transformers with TensorFlow.
  • TensorFlow models should be less picky about the specific integer dtypes (int32/int64) that are passed as input

Changes to compile() and train_step()

  • You can now compile our TensorFlow models without passing a loss argument! If you do, the model will compute loss internally during the forward pass and then use this value to fit() on. This makes it much more convenient to get the right loss, particularly since many models have unique losses for certain tasks that are easy to overlook and annoying to reimplement. Remember to pass your labels as the "labels" key of your input dict when doing this, so that they're accessible to the model during the forward pass. There is no change to the behavior if you pass a loss argument, so all old code should remain unaffected by this change.

Associated PRs:

Pipelines

Pipeline refactor

The pipelines underwent a large refactor that should make contributing pipelines much simpler, and much less error-prone. As part of this refactor, PyTorch-based pipelines are now optimized for GPU performance based on PyTorch's Datasets and DataLoaders.

See below for an example leveraging the superb dataset.

pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")

# KeyDataset (only `pt`) will simply return the item in the dict returned by the dataset item
# as we're not interested in the `target` part of the dataset.
for out in tqdm.tqdm(pipe(KeyDataset(dataset, "file"))):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
    # {"text": ....}
    # ....

Audio classification pipeline

Additionally, an additional pipeline is available, for audio classification.

  • Add the AudioClassificationPipeline #13342 (@anton-l)
  • Enabling automatic loading of tokenizer with pipeline for audio-classification. #13376 (@Narsil)

Setters for common properties

Version v4.11.0 introduces setters for common configuration properties. Different configurations have different properties as coming from different implementations.

One such example is the BertConfig having the hidden_size attribute, while the GPT2Config has the n_embed attribute, which are essentially the same.

The newly introduced setters allow setting such properties through a standardized naming scheme, even on configuration objects that do not have them by default.

See the following code sample for an example:

from transformers import GPT2Config
config = GPT2Config()

config.hidden_size = 4  # Failed previously
config = GPT2Config(hidden_size =4)  # Failed previously

config.n_embed  # returns 4
config.hidden_size  # returns 4
  • Update model configs - Allow setters for common properties #13026 (@nreimers)

Dynamic model code loading

An experimental feature adding support for model files hosted on the hub is added as part of this release. A walkthrough is available in the PR description.

⚠️ This means that code files will be fetched from the hub to be executed locally. An additional argument, trust_remote_code is required when instantiating the model from the hub. We heavily encourage you to also specify a revision if using code from another user's or organization's repository.

Trainer

The Trainer has received several new features, the main one being that models are uploaded to the Hub each time you save them locally (you can specify another strategy). This push is asynchronous, so training continues normally without interruption.

Also:

  • The SigOpt optimization framework is now integrated in the Trainer API as an opt-in component.
  • The Trainer API now supports fine-tuning on distributed CPUs.

Associated PRs:

  • Push to hub when saving checkpoints #13503 (@sgugger)
  • Add SigOpt HPO to transformers trainer api #13572 (@kding1)
  • Add cpu distributed fine-tuning support for transformers Trainer API #13574 (@kding1)

Model size CPU memory usage reduction

The memory required to load a model in memory using PyTorch's torch.load requires twice the amount of memory necessary. An experimental feature allowing model loading while requiring only the model size in terms of memory usage is out in version v4.11.0.

It can be used by using the low_cpu_mem_usage=True argument with PyTorch pretrained models.

  • 1x model size CPU memory usage for from_pretrained #13466 (@stas00)

GPT-Neo: simplified local attention

The GPT-Neo local attention was greatly simplified with no loss of performance.

Breaking changes

We strive for no breaking changes between releases - however, some bugs are not discovered for long periods of time, and users may eventually rely on such bugs. We document here such changes that may affect users when updating to a recent version.

Order of overflowing tokens

The overflowing tokens returned by the slow tokenizers were returned in the wrong order. This is changed in the PR below.

Non-prefixed tokens for token classification pipeline

Updates the behavior of aggregation_strategy to more closely mimic the deprecated grouped_entities pipeline argument.

  • Fixing backward compatiblity for non prefixed tokens (B-, I-). #13493 (@Narsil)

Inputs normalization for Wav2Vec2 feature extractor

The changes in v4.10 (#12804) introduced a bug in inputs normalization for non-padded tensors that affected Wav2Vec2 fine-tuning.
This is fixed in the PR below.

General bug fixes and improvements

Read more

v4.10.3: Patch release

22 Sep 20:00
Compare
Choose a tag to compare

Patches an issue with the serialization of the TrainingArguments

v4.10.2: Patch release

10 Sep 16:37
Compare
Choose a tag to compare

v4.10.1: Patch release

10 Sep 14:40
Compare
Choose a tag to compare