Fix MetaspacePreTokenizer prepend_scheme + xlm-roberta-base support#11
Merged
DePasqualeOrg merged 5 commits intomainfrom Mar 4, 2026
Merged
Fix MetaspacePreTokenizer prepend_scheme + xlm-roberta-base support#11DePasqualeOrg merged 5 commits intomainfrom
DePasqualeOrg merged 5 commits intomainfrom
Conversation
…xSpace MetaspacePreTokenizer.preTokenize() never prepended the replacement character (▁) when add_prefix_space was absent from the tokenizer config, even when prepend_scheme was set to "always". This broke XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace with prepend_scheme: "always". The fix aligns with the canonical Rust implementation (huggingface/ tokenizers PR #1357) where prepend_scheme is the sole authority: - init: resolves prependScheme from explicit prepend_scheme first, falling back to add_prefix_space for backward compatibility - preTokenize: uses switch on prependScheme directly, removing the addPrefixSpace gate Cherry-picked from huggingface/swift-transformers#319 (f3d5cbf).
addPrefixSpace property and PrependScheme.from(rawValue:) are no longer used — prependScheme is now resolved directly in init.
Cherry-picked from huggingface/swift-transformers#320 (31be7a5).
Test downloads kredor/punctuate-all tokenizer and verifies correct XLM-RoBERTa tokenization when prepend_scheme is used without addPrefixSpace. Cherry-picked from huggingface/swift-transformers#321 (67baef8).
Use downloadModel + AutoTokenizer.from(directory:) instead of AutoTokenizer.from(pretrained:) which doesn't exist in our fork.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picks upstream PRs #319, #320, #321 with minor adaptations for our fork.
Problem
MetaspacePreTokenizer.preTokenize()gated prepending the replacement character (▁) on theaddPrefixSpaceconfig flag. Whenadd_prefix_spacewas absent from the tokenizer config (defaults tofalse), the replacement was never prepended – even whenprepend_schemewas set to"always". This broke XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace withprepend_scheme: "always".Fix
The fix aligns with the canonical Rust implementation (tokenizers PR #1357), where
prepend_schemeis the sole authority:init: resolvesprependSchemefrom explicitprepend_schemefirst, falling back toadd_prefix_spacefor backward compatibility (defaulting to.alwayswhen both are absent).preTokenize: switches onprependSchemedirectly, removing theaddPrefixSpacegate.Changes
Cherry-picked from upstream
"Xlm-RobertaTokenizer"model mapping (case variant needed forFacebookAI/xlm-roberta-base) + integration test.kredor/punctuate-allverifying correct XLM-RoBERTa tokenization withprepend_scheme.Additional changes
addPrefixSpacestored property (no longer read afterinit) andPrependScheme.from(rawValue:)static method (replaced by inline resolution).downloadModel+AutoTokenizer.from(directory:)instead ofAutoTokenizer.from(pretrained:), which doesn't exist in our fork.