Releases: huggingface/transformers
v4.12.2: Patch release
Fixes an issue with the image segmentation pipeline and PyTorch's inference mode.
v4.12.1: Patch release
Enables torch 1.10.0
v4.12.0: TrOCR, SEW & SEW-D, Unispeech & Unispeech-SAT, BARTPho
TrOCR and VisionEncoderDecoderModel
One new model is released as part of the TrOCR implementation: TrOCRForCausalLM
, in PyTorch. It comes along a new VisionEncoderDecoderModel
class, which allows to mix-and-match any vision Transformer encoder with any text Transformer as decoder, similar to the existing SpeechEncoderDecoderModel
class.
The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
The TrOCR model consists of an image transformer encoder and an autoregressive text transformer to perform optical character recognition in an end-to-end manner.
- Add TrOCR + VisionEncoderDecoderModel by @NielsRogge in #13874
Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=trocr
SEW & SEW-D
SEW and SEW-D (Squeezed and Efficient Wav2Vec) were proposed in Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
SEW and SEW-D models use a Wav2Vec-style feature encoder and introduce temporal downsampling to reduce the length of the transformer encoder. SEW-D additionally replaces the transformer encoder with a DeBERTa one. Both models achieve significant inference speedups without sacrificing the speech recognition quality.
Compatible checkpoints are available on the Hub: https://huggingface.co/models?other=sew and https://huggingface.co/models?other=sew-d
DistilHuBERT
DistilHuBERT was proposed in DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT, by Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee.
DistilHuBERT is a distilled version of the HuBERT model. Using only two transformer layers, the model scores competitively on the SUPERB benchmark tasks.
Compatible checkpoint is available on the Hub: https://huggingface.co/ntu-spml/distilhubert
TensorFlow improvements
Several bug fixes and UX improvements for TensorFlow
Keras callback
Introduction of a Keras callback to push to the hub each epoch, or after a given number of steps:
- Keras callback to push to hub each epoch, or after N steps by @Rocketknight1 in #13773
Updates on the encoder-decoder framework
The encoder-decoder framework is now available in TensorFlow, allowing mixing and matching different encoders and decoders together into a single encoder-decoder architecture!
Besides this, the EncoderDecoderModel
classes have been updated to work similar to models like BART and T5. From now on, users don't need to pass decoder_input_ids
themselves anymore to the model. Instead, they will be created automatically based on the labels
(namely by shifting them one position to the right, replacing -100 by the pad_token_id
and prepending the decoder_start_token_id
). Note that this may result in training discrepancies if fine-tuning a model trained with versions anterior to 4.12.0 that set the decoder_input_ids
= labels
.
- Fix EncoderDecoderModel classes to be more like BART and T5 by @NielsRogge in #14139
Speech improvements
- Add DistilHuBERT by @anton-l in #14174
- [Speech Examples] Add pytorch speech pretraining by @patrickvonplaten in #13877
- [Speech Examples] Add new audio feature by @patrickvonplaten in #14027
- Add ASR colabs by @patrickvonplaten in #14067
- [ASR] Make speech recognition example more general to load any tokenizer by @patrickvonplaten in #14079
- [Examples] Add an official audio classification example by @anton-l in #13722
- [Examples] Use Audio feature in speech classification by @anton-l in #14052
Auto-model API
To make it easier to extend the Transformers library, every Auto class a new register
method, that allows you to register your own custom models, configurations or tokenizers. See more in the documentation
Bug fixes and improvements
- Fix filtering in test fetcher utils by @sgugger in #13766
- Fix warning for gradient_checkpointing by @sgugger in #13767
- Implement len in IterableDatasetShard by @sgugger in #13780
- [Wav2Vec2] Better error message by @patrickvonplaten in #13777
- Fix LayoutLM ONNX test error by @nishprabhu in #13710
- Enable readme link synchronization by @qqaatw in #13785
- Fix length of IterableDatasetShard and add test by @sgugger in #13792
- [docs/gpt-j] addd instructions for how minimize CPU RAM usage by @patil-suraj in #13795
- [examples
run_glue.py
] missing requirementsscipy
,sklearn
by @stas00 in #13768 - [examples/flax] use Repository API for push_to_hub by @patil-suraj in #13672
- Fix gather for TPU by @sgugger in #13813
- [testing] auto-replay captured streams by @stas00 in #13803
- Add MultiBERTs conversion script by @gchhablani in #13077
- [Examples] Improve mapping in accelerate examples by @patrickvonplaten in #13810
- [DPR] Correct init by @patrickvonplaten in #13796
- skip gptj slow generate tests by @patil-suraj in #13809
- Fix warning situation: UserWarning: max_length is ignored when padding=True" by @shirayu in #13829
- Updating CITATION.cff to fix GitHub citation prompt BibTeX output. by @arfon in #13833
- Add TF notebooks by @Rocketknight1 in #13793
- Bart: check if decoder_inputs_embeds is set by @silviu-oprea in #13800
- include megatron_gpt2 in installed modules by @stas00 in #13834
- Delete MultiBERTs conversion script by @gchhablani in #13852
- Remove a duplicated bullet point in the GPT-J doc by @yaserabdelaziz in #13851
- Add Mistral GPT-2 Stability Tweaks by @siddk in #13573
- Fix broken link to distill models in docs by @Randl in #13848
- ✨ update image classification example by @nateraw in #13824
- Update no_* argument (HfArgumentParser) by @BramVanroy in #13865
- Update Tatoeba conversion by @Traubert in #13757
- Fixing 1-length special tokens cut. by @Narsil in #13862
- Fix flax summarization example: save checkpoint after each epoch and push checkpoint to the hub by @ydshieh in #13872
- Fixing empty prompts for text-generation when BOS exists. by @Narsil in #13859
- Improve error message when loading models from Hub by @aphedges in #13836
- Initial support for symbolic tracing with torch.fx allowing dynamic axes by @michaelbenayoun in #13579
- Allow dataset to be an optional argument for (Distributed)LengthGroupedSampler by @ZhaofengWu in #13820
- Fixing question-answering with long contexts by @Narsil in #13873
- fix(integrations): consider test metrics by @borisdayma in #13888
- fix: replace asserts by value error by @m5l14i11 in #13894
- Update parallelism.md by @hyunwoongko in #13892
- Autodocument the list of ONNX-supported models by @sgugger in #13884
- Fixing GPU for token-classification in a better way. by @Narsil in #13856
- Update FSNER code in examples->...
v4.11.3: Patch release
v4.11.3: Patch release
This patch release fixes a few issues encountered since the release of v4.11.2:
- [DPR] Correct init (#13796)
- Fix warning situation: UserWarning: max_length is ignored when padding=True" (#13829)
- Bart: check if decoder_inputs_embeds is set (#13800)
- include megatron_gpt2 in installed modules (#13834)
- Fixing 1-length special tokens cut. (#13862)
- Fixing empty prompts for text-generation when BOS exists. (#13859)
- Fixing question-answering with long contexts (#13873)
- Fixing GPU for token-classification in a better way. (#13856)
- Fixing Backward compatiblity for zero-shot (#13855)
- Fix hp search for non sigopt backends (#13897)
- Fix trainer logging_nan_inf_filter in torch_xla mode #13896 (@ymwangg)
- [Trainer] Fix nan-loss condition #13911 (@anton-l)
v4.11.2: Patch release
v4.11.1: Patch release
v4.11.0: GPT-J, Speech2Text2, FNet, Pipeline GPU utilization, dynamic model code loading
v4.11.0: GPT-J, Speech2Text2, FNet, Pipeline GPU utilization, dynamic model code loading
GPT-J
Three new models are released as part of the GPT-J implementation: GPTJModel
, GPTJForCausalLM
, GPTJForSequenceClassification
, in PyTorch.
The GPT-J model was released in the kingoflolz/mesh-transformer-jax repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like causal language model trained on the Pile dataset.
It was contributed by @StellaAthena, @kurumuz, @EricHallahan, and @leogao2.
- GPT-J-6B #13022 (@StellaAthena)
Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=gptj
SpeechEncoderDecoder & Speech2Text2
One new model is released as part of the Speech2Text2 implementation: Speech2Text2ForCausalLM
, in PyTorch.
The Speech2Text2 model is used together with Wav2Vec2 for Speech Translation models proposed in Large-Scale Self- and Semi-Supervised Learning for Speech Translation by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as Wav2Vec2 or HuBERT for Speech-to-Text tasks. Please refer to the SpeechEncoderDecoder class on how to combine Speech2Text2 with any speech encoder-only model.
- Add SpeechEncoderDecoder & Speech2Text2 #13186 (@patrickvonplaten)
Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=speech2text2
FNet
Eight new models are released as part of the FNet implementation: FNetModel
, FNetForPreTraining
, FNetForMaskedLM
, FNetForNextSentencePrediction
, FNetForSequenceClassification
, FNetForMultipleChoice
, FNetForTokenClassification
, FNetForQuestionAnswering
, in PyTorch.
The FNet model was proposed in FNet: Mixing Tokens with Fourier Transforms by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT model with a fourier transform which returns only the real parts of the transform. The model is significantly faster than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97% accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model.
- Add FNet #13045 (@gchhablani)
Compatible checkpoints can be found on the Hub: https://huggingface.co/models?other=fnet
TensorFlow improvements
Several bug fixes and UX improvements for Tensorflow:
- Users should notice much fewer unnecessary warnings and less 'console spam' in general while using Transformers with TensorFlow.
- TensorFlow models should be less picky about the specific integer dtypes (int32/int64) that are passed as input
Changes to compile() and train_step()
- You can now compile our TensorFlow models without passing a loss argument! If you do, the model will compute loss internally during the forward pass and then use this value to fit() on. This makes it much more convenient to get the right loss, particularly since many models have unique losses for certain tasks that are easy to overlook and annoying to reimplement. Remember to pass your labels as the "labels" key of your input dict when doing this, so that they're accessible to the model during the forward pass. There is no change to the behavior if you pass a loss argument, so all old code should remain unaffected by this change.
Associated PRs:
- Modified TF train_step #13678 (@Rocketknight1)
- Fix Tensorflow T5 with int64 input #13479 (@Rocketknight1)
- MarianMT int dtype fix #13496 (@Rocketknight1)
- Removed console spam from misfiring warnings #13625 (@Rocketknight1)
Pipelines
Pipeline refactor
The pipelines underwent a large refactor that should make contributing pipelines much simpler, and much less error-prone. As part of this refactor, PyTorch-based pipelines are now optimized for GPU performance based on PyTorch's Dataset
s and DataLoader
s.
See below for an example leveraging the superb
dataset.
pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0)
dataset = datasets.load_dataset("superb", name="asr", split="test")
# KeyDataset (only `pt`) will simply return the item in the dict returned by the dataset item
# as we're not interested in the `target` part of the dataset.
for out in tqdm.tqdm(pipe(KeyDataset(dataset, "file"))):
print(out)
# {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
# {"text": ....}
# ....
Audio classification pipeline
Additionally, an additional pipeline is available, for audio classification.
- Add the
AudioClassificationPipeline
#13342 (@anton-l) - Enabling automatic loading of tokenizer with
pipeline
foraudio-classification
. #13376 (@Narsil)
Setters for common properties
Version v4.11.0 introduces setters for common configuration properties. Different configurations have different properties as coming from different implementations.
One such example is the BertConfig
having the hidden_size
attribute, while the GPT2Config
has the n_embed
attribute, which are essentially the same.
The newly introduced setters allow setting such properties through a standardized naming scheme, even on configuration objects that do not have them by default.
See the following code sample for an example:
from transformers import GPT2Config
config = GPT2Config()
config.hidden_size = 4 # Failed previously
config = GPT2Config(hidden_size =4) # Failed previously
config.n_embed # returns 4
config.hidden_size # returns 4
Dynamic model code loading
An experimental feature adding support for model files hosted on the hub is added as part of this release. A walkthrough is available in the PR description.
trust_remote_code
is required when instantiating the model from the hub. We heavily encourage you to also specify a revision
if using code from another user's or organization's repository.
Trainer
The Trainer
has received several new features, the main one being that models are uploaded to the Hub each time you save them locally (you can specify another strategy). This push is asynchronous, so training continues normally without interruption.
Also:
- The SigOpt optimization framework is now integrated in the
Trainer
API as an opt-in component. - The
Trainer
API now supports fine-tuning on distributed CPUs.
Associated PRs:
- Push to hub when saving checkpoints #13503 (@sgugger)
- Add SigOpt HPO to transformers trainer api #13572 (@kding1)
- Add cpu distributed fine-tuning support for transformers Trainer API #13574 (@kding1)
Model size CPU memory usage reduction
The memory required to load a model in memory using PyTorch's torch.load
requires twice the amount of memory necessary. An experimental feature allowing model loading while requiring only the model size in terms of memory usage is out in version v4.11.0.
It can be used by using the low_cpu_mem_usage=True
argument with PyTorch pretrained models.
GPT-Neo: simplified local attention
The GPT-Neo local attention was greatly simplified with no loss of performance.
- [GPT-Neo] Simplify local attention #13491 (@finetuneanon, @patil-suraj)
Breaking changes
We strive for no breaking changes between releases - however, some bugs are not discovered for long periods of time, and users may eventually rely on such bugs. We document here such changes that may affect users when updating to a recent version.
Order of overflowing tokens
The overflowing tokens returned by the slow tokenizers were returned in the wrong order. This is changed in the PR below.
- Correct order of overflowing_tokens for slow tokenizer #13179 (@Apoorvgarg-creator)
Non-prefixed tokens for token classification pipeline
Updates the behavior of aggregation_strategy
to more closely mimic the deprecated grouped_entities
pipeline argument.
Inputs normalization for Wav2Vec2 feature extractor
The changes in v4.10 (#12804) introduced a bug in inputs normalization for non-padded tensors that affected Wav2Vec2 fine-tuning.
This is fixed in the PR below.
- [Wav2Vec2] Fix normalization for non-padded tensors #13512 (@patrickvonplaten)
General bug fixes and improvements
- Fixes for the documentation #13361 (@sgugger)
- fix wrong 'cls' masking for bigbird qa model output #13143 (@donggyukimc)
- Improve T5 docs #13240 (@NielsRogge)
- Fix tokenizer saving during training with
Trainer
#12806 (@SaulLu) - Fix DINO #13369 (@NielsRogge)
- Properly register missing submodules in main init #13372 (@sgugger)
- Add
Hubert
to theAutoFeatureExtractor
#13366 (@anton-l) - Add missing feature extractors #13374 (@LysandreJik)
- Fix RemBERT tokenizer initialization #13375 (@LysandreJik)
- [Flax] Fix BigBird #13380 (@patrickvonplaten)
- [GPU Tests] Fix SpeechEncoderDecoder GPU tests #13383 (@patrickvonplaten)
- Fix name and get_class method in AutoFeatureExtractor #13385 (@sgugger)
- [Flax/run_hybrid_clip] Fix duplicating images when captions_per_image exceeds th...
v4.10.3: Patch release
Patches an issue with the serialization of the TrainingArguments
v4.10.2: Patch release
- [Wav2Vec2] Fix dtype 64 bug #13517 (@patrickvonplaten)