Releases: huggingface/transformers
Transformers v5
Transformers v5 release notes
- Highlights
- Significant API changes: dynamic weight loading, tokenization
- Backwards Incompatible Changes
- Bugfixes and improvements
We have a migration guide that will be continuously updated available on the main branch, please check it out in case you're facing issues: migration guide.
Highlights
We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 1200 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.
We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.
This release is the full V5 release. It sets in motion something bigger: going forward, starting with v5, we'll now release minor releases every week, rather than every 5 weeks. Expect v5.1 to follow next week, then v5.2 the week that follows, etc.
We're moving forward with this change to ensure you have access to models as soon as they're supported in the library, rather than a few weeks after.
In order to install this release, please do so with the following:
pip install transformersFor us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.
Transformers version 5 is a community endeavor, and we couldn't have shipped such a massive release without the help of the entire community.
Significant API changes
Dynamic weight loading
We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
weight loading API is designed to apply operations to the checkpoints loaded by transformers.
Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge,
and split the layers according to how they're defined in this new API. These operations are often a necessity when
working with quantization or parallelism algorithms.
This new API is centered around the new WeightConverter class:
class WeightConverter(WeightTransform):
operations: list[ConversionOps]
source_keys: Union[str, list[str]]
target_keys: Union[str, list[str]]The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common
operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount
to defining the following conversion:
conversion = WeightConverter(
["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], # The input layers
"self_attn.qkv_proj", # The single layer as output
operations=[Concatenate(dim=0)],
)In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
layer.
This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
and helped us remove a lot of technical debt that we accumulated over the past few years.
This results in several improvements:
- Much cleaner definition of transformations applied to the checkpoint
- Reversible transformations, so loading and saving a checkpoint should result in the same checkpoint
- Faster model loading thanks to scheduling of tensor materialization
- Enables complex mix of transformations that wouldn't otherwise be possible (such as quantization + MoEs, or TP + MoEs)
Linked PR: #41580
Tokenization
Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.
Defining a new tokenizer object should be as simple as this:
from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE
class Llama5Tokenizer(TokenizersBackend):
def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
if vocab is None:
self._vocab = {
str(unk_token): 0,
str(bos_token): 1,
str(eos_token): 2,
}
else:
self._vocab = vocab
self._merges = merges
self._tokenizer = Tokenizer(
BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
)
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
)
super().__init__(
tokenizer_object=self._tokenizer,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
)Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet 😉).
The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.
Backend Architecture Changes: moving away from the slow/fast tokenizer separation
Up to now, transformers maintained two parallel implementations for many tokenizers:
- "Slow" tokenizers (
tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend. - "Fast" tokenizers (
tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.
In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:
- TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:
- handling additional tokens
- a full python API for setting and updating
- automatic parallelization,
- automatic offsets
- customization
- training
- SentencePieceBackend: for tokenizers requiring the
sentencepiecelibrary. It inherits fromPythonBackend. - PythonBackend: a Python implementations of the features provided by
tokenizers. Basically allows adding tokens. - MistralCommonBackend: relies on
MistralCommon's tokenization library. (Previously known as theMistralCommonTokenizer)
The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.
Defining a tokenizers outside of the existing backends
We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.
To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.
If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:
encodedecodevocab_sizeget_vocabconvert_tokens_to_idsconvert_ids_to_tokensfrom_pretrainedsave_pretrained- among a few others
API Changes
1. Direct tokenizer initialization with vocab and merges
Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer()This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.
These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:
from transformers import LlamaTokenizer
vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]
tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comp...
Patch release v4.57.6
What's Changed
Another fix for qwen vl models that prevented correctly loading the associated model type - this works together with #41808 of the previous patch release.
- Fixed incorrect model_type for qwen2vl and qwen2.5vl when config is saved and loaded again by @i3hz in #41758
Full Changelog: v4.57.5...v4.57.6
Release candidate v5.0.0rc3
Release candidate v5.0.0rc3
New models:
- [GLM-4.7] GLM-Lite Supoort by @zRzRzRzRzRzRzR in #43031
- [GLM-Image] AR Model Support for GLM-Image by @zRzRzRzRzRzRzR in #43100
- Add LWDetr model by @sbucaille in #40991
- Add LightOnOCR model implementation by @baptiste-aubertin in #41621
What's Changed
We are getting closer and closer to the official release!
This RC is focused on removing more of the deprecated stuff, fixing some minors issues, doc updates.
- Update Japanese README to match English version by @lilin-1 in #43069
- [docs] Deploying by @stevhliu in #42263
- [docs] inference engines by @stevhliu in #42932
- Fix typos: Remove duplicate duplicate words words by @efeecllk in #43040
- [style] Rework ruff rules and update all files by @Cyrilvallez in #43144
- [CB] Minor fix in kwargs by @remi-or in #43147
- [Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT by @sniper35 in #43068
- Fix some deprecated practices in torch 2.9 by @Cyrilvallez in #43167
- Fix Fuyu processor width dimension bug in
_get_num_multimodal_tokensby @Abhinavexists in #43137 - Inherit from PreTrainedTokenizerBase by @juliendenize in #43143
- Generation config boolean defaults by @zucchini-nlp in #43000
- Fix failing
BartModelIntegrationTestby @Sai-Suraj-27 in #43160 - fix failure of llava/pixtral by @sywangyi in #42985
- GemmaTokenizer: remove redundant whitespace pre-tokenizer by @vaibhav-research in #43106
- Support
auto_doctringin Processors by @yonigozlan in #42101 - Fix failing
BitModelIntegrationTestby @Sai-Suraj-27 in #43164 - [
Fp8] Fix experts by @vasqu in #43154 - Docs: improve wording for documentation build instructions by @Sailnagale in #43007
- [makefile] Cleanup and improve the rules by @Cyrilvallez in #43171
- Some new models added stuff that was already removed by @Cyrilvallez in #43179
- Fixes and compilation warning in torchao docs by @merveenoyan in #42909
- [cache] Remove all deprecated classes by @Cyrilvallez in #43168
- Bump huggingface_hub minimal version by @Wauplin in #43188
- Rework check_config_attributes.py by @Cyrilvallez in #43191
- Fix generation config validation by @zucchini-nlp in #43175
- [style] Use 'x | y' syntax for processors as well by @Wauplin in #43189
- Remove deprecated objects by @Cyrilvallez in #43170
- fix chunked prefill implementation issue-43082 by @marcndo in #43132
- Reduce add_dates verbosity by @yonigozlan in #43184
- Add support for MiniMax-M2 by @rogeryoungh in #42028
- Fix failing
salesforce-ctrl,xlm&gpt-neomodel generation tests by @Sai-Suraj-27 in #43180 - Less verbose library helpers by @Cyrilvallez in #43197
- run all test files on CircleCI by @ydshieh in #43146
- Clamp temperature to >=1.0 for Dia generation by @Haseebasif7 in #43029
- Fix spelling typos in comments and code by @raimbekovm in #43046
- [docs] llama.cpp by @stevhliu in #43185
- [docs] gptq formatting fix by @victorywwong in #43216
- Grouped beam search from config params by @zucchini-nlp in #42472
- [
Generate] Allow custom config values in generate config by @vasqu in #43181 - Fix failing
Pix2StructIntegrationTestby @Sai-Suraj-27 in #43229 - Fix missing UTF-8 encoding in check_repo.py for Windows compatibility by @aarushisingh04 in #43123
- [Tokenizer] Change default value of return_dict to True in doc string for apply_chat_template by @kashif in #43223
- Fix failing
PhiIntegrationTestsby @Sai-Suraj-27 in #43214 - Use
HF_TOKENdirectly and removerequire_read_tokenby @ydshieh in #43233 - Fix failing
Owlv2ModelIntegrationTest&OwlViTModelIntegrationTestby @Sai-Suraj-27 in #43182 - Fix flashattn wrt quantized models by @SunMarc in #43145
- Remove unused imports by @cyyever in #43078
- Fix unsafe torch.load() in _load_rng_state allowing arbitrary code execution by @ColeMurray in #43140
- Reapply modular to examples by @Cyrilvallez in #43234
- More robust diff checks in
add_datesby @yonigozlan in #43199 - docs: fix grammatical error in README.md by @davidfertube in #43236
- Fix typo: seperately → separately in lw_detr converter by @skyvanguard in #43235
- Qwen-VL video processor accepts min/max pixels by @zucchini-nlp in #43228
- Deprecate dtype per sub config by @zucchini-nlp in #42990
- Remove more deprecated objects/args by @Cyrilvallez in #43195
- [CB] Soft-reset offloading by @remi-or in #43150
- Make benchmark-v2 to be device agnostic, to support more torch built-in devices like xpu by @yao-matrix in #43153
- Fix benchmark script by @Cyrilvallez in #43253
- Adding to run slow by @IlyasMoutawwakil in #43250
- Fix failing
Vip-llavamodel integration test by @Sai-Suraj-27 in #43252 - Remove deprecated and unused
position_idsin allapply_rotary_pos_embby @Cyrilvallez in #43255 - fix
_get_test_infointesting_utils.pyby @ydshieh in #43259 - Fix failing
Hiera,SwiftFormer&LEDModel integration tests by @Sai-Suraj-27 in #43225 - [style] Fix init isort and align makefile and CI by @Cyrilvallez in #43260
- [docs] tensorrt-llm by @stevhliu in #43176
- [consistency] Ensure models are added to the
_toctree.ymlby @Cyrilvallez in #43264 - Fix failing
PegasusX,Mvp&LEDmodel integration tests by @Sai-Suraj-27 in #43245 - [CB] Ensure parallel decoding test passes using FA by @remi-or in #43277
- fix crash in when running FSDP2+TP by @sywangyi in #43226
- [ci] Fixing some failing tests for important models by @Abdennacer-Badaoui in #43231
New Contributors
- @efeecllk made their first contribution in #43040
- @sniper35 made their first contribution in #43068
- @Abhinavexists made their first contribution in #43137
- @vaibhav-research made their first contribution in #43106
- @Sailnagale made their first contribution in #43007
- @rogeryoungh made their first contribution in #42028
- @Haseebasif7 made their first contribution in #43029
- @victorywwong made their first contribution in #43216
- @aarushisingh04 made their first contributi...
Patch release v4.57.5
What's Changed
Should not have said last patch 😉 These should be the last remaining fixes that got lost in between patches and the transition to v5.
- QwenVL: add skipped keys in setattr as well by @zucchini-nlp in #41808
- Fix lr_scheduler_parsing by @SunMarc in #41322
Full Changelog: v4.57.4...v4.57.5
Patch release v4.57.4
What's Changed
Last patch release for v4: We have a few small fixes for remote generation methods (e.g. group beam search), vLLM, and an offline tokenizer fix (if it's already been cached).
- Grouped beam search from config params by @zucchini-nlp in #42472
- Handle decorator with optional arguments better @hmellor in #42512
- fix: make mistral base check conditional to fix offline loading by @Killusions in #42880
New Contributors
- @Killusions made their first contribution in #42880
Full Changelog: v4.57.3...v4.57.4
Release candidate 5.0.0rc2
What's Changed
This release candidate is focused on fixing AutoTokenizer, expanding the dynamic weight loading support, and improving performances with MoEs!
MoEs and performances:
- batched and grouped experts implementations by @IlyasMoutawwakil in #42697
- Optimize MoEs for decoding using batched_mm by @IlyasMoutawwakil in #43126
Tokenization:
The main issue with the tokenization refactor is that tokenizer_class are now "enforced" when in most cases they are wrong. This took a while to properly isolate and now we try to use TokenizersBackend whenever we can. #42894 has a much more detailed description of the big changes!
- use
TokenizersBackendby @ArthurZucker in #42894 - Fix convert_tekken_tokenizer by @juliendenize in #42592
- refactor more tokenizers - v5 guide update by @itazap in #42768
- [
Tokenizers] Change treatment of special tokens by @vasqu in #42903
Core
Here we focused on boosting the performances of loading weights on device!
- [saving] Simplify general logic by @Cyrilvallez in #42766
- Do not rely on config for inferring model dtype by @Cyrilvallez in #42838
- Improve BatchFeature: stack list and lists of torch tensors by @yonigozlan in #42750
- Remove tied weights from internal attribute if they are not tied by @Cyrilvallez in #42871
- Enforce call to
post_initand fix all of them by @Cyrilvallez in #42873 - Simplify tie weights logic by @Cyrilvallez in #42895
- Add buffers to
_init_weightsfor ALL models by @Cyrilvallez in #42309 - [loading] Really initialize on meta device for huge perf gains by @Cyrilvallez in #42941
- Do not use accelerate hooks if the device_map has only 1 device by @Cyrilvallez in #43019
- Move missing weights and non-persistent buffers to correct device earlier by @Cyrilvallez in #43021
New models
- Sam: Perception Encoder Audiovisual by @eustlb in #42905
- adds jais2 model support by @sarathc-cerebras in #42684
- Add Pixio pre-trained models by @LiheYoung in #42795
- [
Ernie 4.5] Ernie VL models by @vasqu in #39585 - [loading][TP] Fix device placement at loading-time, and simplify sharding primitives by @Cyrilvallez in #43003
- GLM-ASR Support by @zRzRzRzRzRzRzR in #42875
Quantization
- [Devstral] Make sure FP8 conversion works correctly by @patrickvonplaten in #42715
- Fp8 dq by @SunMarc in #42926
- [Quantization] Removing misleading int8 quantization in Finegrained FP8 by @MekkCyber in #42945
- Fix deepspeed + quantization by @SunMarc in #43006
Breaking changes
Mostly around processors!
- 🚨 Fix ConvNeXt image processor default interpolation to BICUBIC by @lukepayyapilli in #42934
- 🚨 Fix EfficientNet image processor default interpolation to BICUBIC by @lukepayyapilli in #42956
- Add fast version of
convert_segmentation_map_to_binary_masksto EoMT by @simonreise in #43073 - 🚨Fix MobileViT image processor default interpolation to BICUBIC by @lukepayyapilli in #43024
Thanks again to everyone !
New Contributors
- @ZX-ModelCloud made their first contribution in #42833
- @AYou0207 made their first contribution in #42863
- @wasertech made their first contribution in #42864
- @preetam1407 made their first contribution in #42685
- @Taise228 made their first contribution in #41416
- @CandiedCode made their first contribution in #42885
- @sarathc-cerebras made their first contribution in #42684
- @nandan2003 made their first contribution in #42318
- @LiheYoung made their first contribution in #42795
- @majiayu000 made their first contribution in #42928
- @lukepayyapilli made their first contribution in #42934
- @leaderofARS made their first contribution in #42966
- @qianyue76 made their first contribution in #43095
- @stefgina made their first contribution in #43033
- @HuiyingLi made their first contribution in #43084
- @raimbekovm made their first contribution in #43038
- @PredictiveManish made their first contribution in #43053
- @pushkar-hue made their first contribution in #42736
- @vykhovanets made their first contribution in #43042
- @tanmay2004 made their first contribution in #42737
- @atultw made their first contribution in #43061
Full Changelog: v5.0.0rc1...v5.0.0rc2
Release candidate 5.0.0rc1
What's Changed
This release candidate was focused mostly on quantization support with the new dynamic weight loader, and a few notable 🚨 breaking changes🚨:
- Default dtype for any model when using
from_pretrainedis nowauto!
- Default auto 🚨 🚨 by @ArthurZucker in #42805
- Default shard size when saving a model is now 50GB:
- 🚨🚨 [saving] Default to 50GB shards, and remove non-safe serialization by @Cyrilvallez in #42734
This is now as fast as before thanks to xet, and is just more convenient on the hub.
- Kwargs. They are fundamental to enable integration with vllm and other toosl:
- Every model forward() should have **kwargs by @Rocketknight1 in #42603
Dynamic weight loader updates:
Mostly QOL and fixed + support back CPU offloading.
- mark params as _is_hf_initialized with DS Zero3 from weight conversion by @winglian in #42626
- [loading] Allow loading to happen without threading by @Cyrilvallez in #42619
- [loading] Correctly load params during offloading & careful memory considerations by @Cyrilvallez in #42632
- allow registration of custom checkpoint conversion mappings by @winglian in #42634
New models:
- Add FastVLM by @camilla-deckard in #41112
- Lasr model by @eustlb in #42648
- [Model] Add PaddleOCR-VL Model Support by @zhang-prog in #42178
Some notable quantization fixes:
Mostly added support for fbgemme , quanto,
- Fix fp8 + some enhancement by @SunMarc in #42455
- Fix eetq quanto quant methods by @SunMarc in #42557
- [Quantization] per tensor quantization kernel by @MekkCyber in #42560
- [Quantization] fix fbgemm by @MekkCyber in #42561
- [Quantization] Fix FP8 experts replacing by @MekkCyber in #42654
- [Quantization] Fix Static FP8 Quantization by @MekkCyber in #42775
- [core] fix fp-quant by @MekkCyber in #42613
Peft:
The dynamic weight loader broke small things, this adds glue for all models but MoEs.
- FIX Error when trying to load non-LoRA PEFT by @BenjaminBossan in #42663
- Fix PEFT integration with new weight loader by @Cyrilvallez in #42701
Misc
Tokenization needed more refactoring, this time its a lot cleaner!
- Refactor-tokenization-more by @ArthurZucker in #42563
- Only default
rope_parametersto emptydictif there is something to put in it by @hmellor in #42651
We omitted a lot of other commits for clarity, but thanks to everyone and the new contributors!
New Contributors
- @camilla-deckard made their first contribution in #41112
- @Aaraviitkgp made their first contribution in #42466
- @ngazagna-qc made their first contribution in #40691
- @arrdel made their first contribution in #42577
- @marconaguib made their first contribution in #42587
- @Xiao-Chenguang made their first contribution in #42436
- @Furkan-rgb made their first contribution in #42465
- @mertunsall made their first contribution in #42615
- @anranlee99 made their first contribution in #42438
- @UserChen666 made their first contribution in #42335
- @efazal made their first contribution in #41723
- @Harrisonyong made their first contribution in #36416
- @hawon223 made their first contribution in #42384
- @Bissmella made their first contribution in #42647
- @AgainstEntropy made their first contribution in #42689
- @dongluw made their first contribution in #42642
- @hqkqn32 made their first contribution in #42620
- @zhang-prog made their first contribution in #42178
Full Changelog: v5.0.0rc0...v5.0.0rc1
Transformers v5.0.0rc0
Transformers v5 release notes
- Highlights
- Significant API changes: dynamic weight loading, tokenization
- Backwards Incompatible Changes
- Bugfixes and improvements
Highlights
We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 800 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.
We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.
This release is a release candidate (RC). It is not the final v5 release, and we will push on pypi as a pre-release. This means that the current release is purely opt-in, as installing transformers without specifying this exact release will install the latest version instead (v4.57.3 as of writing).
In order to install this release, please do so with the following:
pip install transformers --preFor us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.
Transformers version 5 is a community endeavor, and this is the last mile. Let's ship this together!
Significant API changes
Note
👀 Nothing is final and things are still actively in movement. We have a section dedicated to what is planned for future release candidates, yet is known not to work in the RC0. Look for "Disclaimers for the RC0".
We'll be eagerly awaiting your feedback in our GitHub issues!
Dynamic weight loading
We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
weight loading API is designed to apply operations to the checkpoints loaded by transformers.
Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge,
and split the layers according to how they're defined in this new API. These operations are often a necessity when
working with quantization or parallelism algorithms.
This new API is centered around the new WeightConverter class:
class WeightConverter(WeightTransform):
operations: list[ConversionOps]
source_keys: Union[str, list[str]]
target_keys: Union[str, list[str]]The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common
operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount
to defining the following conversion:
conversion = WeightConverter(
["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], # The input layers
"self_attn.qkv_proj", # The single layer as output
operations=[Concatenate(dim=0)],
)In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
layer.
This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
and helped us remove a lot of technical debt that we accumulated over the past few years.
This results in several improvements:
- Much cleaner definition of transformations applied to the checkpoint
- Reversible transformations, so loading and saving a checkpoint should result in the same checkpoint
- Faster model loading thanks to scheduling of tensor materialization
- Enables complex mix of transformations that wouldn't otherwise be possible (such as quantization + MoEs, or TP + MoEs)
While this is being implemented, expect varying levels of support across different release candidates.
Linked PR: #41580
Tokenization
Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.
Defining a new tokenizer object should be as simple as this:
from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE
class Llama5Tokenizer(TokenizersBackend):
def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
if vocab is None:
self._vocab = {
str(unk_token): 0,
str(bos_token): 1,
str(eos_token): 2,
}
else:
self._vocab = vocab
if merges is not None:
self._merges = merges
else:
self._merges = generate_merges(filtered_vocab)
self._tokenizer = Tokenizer(
BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
)
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
)
super().__init__(
tokenizer_object=self._tokenizer,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
)Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet 😉).
The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.
Backend Architecture Changes: moving away from the slow/fast tokenizer separation
Up to now, transformers maintained two parallel implementations for many tokenizers:
- "Slow" tokenizers (
tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend. - "Fast" tokenizers (
tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.
In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:
- TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:
- handling additional tokens
- a full python API for setting and updating
- automatic parallelization,
- automatic offsets
- customization
- training
- SentencePieceBackend: for tokenizers requiring the
sentencepiecelibrary. It inherits fromPythonBackend. - PythonBackend: a Python implementations of the features provided by
tokenizers. Basically allows adding tokens. - MistralCommonBackend: relies on
MistralCommon's tokenization library. (Previously known as theMistralCommonTokenizer)
The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.
Defining a tokenizers outside of the existing backends
We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.
To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.
If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:
encodedecodevocab_sizeget_vocabconvert_tokens_to_idsconvert_ids_to_tokensfrom_pretrainedsave_pretrained- among a few others
API Changes
1. Direct tokenizer initialization with vocab and merges
Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer()This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.
These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:
from transformers import LlamaTokenizer
vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]
tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)This tokenizer will behave as a Llama-like toke...
Patch release v4.57.3
There was a hidden bug when loading models with local_files_only=True and a typo related to the recent patch.
The main fix is: b605555.
We are really sorry that this slipped through, our CIs just did not catch it.
As it affects a lot of users we are gonna yank the previous release
Patch Release v4.57.2
This patch most notably fixes an issue on some Mistral tokenizers. It contains the following commits: