Skip to content

Fix MetaspacePreTokenizer prepend_scheme + xlm-roberta-base support#11

Merged
DePasqualeOrg merged 5 commits intomainfrom
fix-metaspace-prepend-scheme
Mar 4, 2026
Merged

Fix MetaspacePreTokenizer prepend_scheme + xlm-roberta-base support#11
DePasqualeOrg merged 5 commits intomainfrom
fix-metaspace-prepend-scheme

Conversation

@DePasqualeOrg
Copy link
Copy Markdown
Owner

Cherry-picks upstream PRs #319, #320, #321 with minor adaptations for our fork.

Problem

MetaspacePreTokenizer.preTokenize() gated prepending the replacement character () on the addPrefixSpace config flag. When add_prefix_space was absent from the tokenizer config (defaults to false), the replacement was never prepended – even when prepend_scheme was set to "always". This broke XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace with prepend_scheme: "always".

Fix

The fix aligns with the canonical Rust implementation (tokenizers PR #1357), where prepend_scheme is the sole authority:

  • init: resolves prependScheme from explicit prepend_scheme first, falling back to add_prefix_space for backward compatibility (defaulting to .always when both are absent).
  • preTokenize: switches on prependScheme directly, removing the addPrefixSpace gate.

Changes

Cherry-picked from upstream

  • #319 – Bug fix for MetaspacePreTokenizer + 8 unit tests covering all prepend_scheme / add_prefix_space combinations.
  • #320 – Adds "Xlm-RobertaTokenizer" model mapping (case variant needed for FacebookAI/xlm-roberta-base) + integration test.
  • #321 – Integration test for kredor/punctuate-all verifying correct XLM-RoBERTa tokenization with prepend_scheme.

Additional changes

  • Removed dead code left by the bug fix: addPrefixSpace stored property (no longer read after init) and PrependScheme.from(rawValue:) static method (replaced by inline resolution).
  • Adapted integration tests to use downloadModel + AutoTokenizer.from(directory:) instead of AutoTokenizer.from(pretrained:), which doesn't exist in our fork.

beshkenadze and others added 5 commits March 4, 2026 15:17
…xSpace

MetaspacePreTokenizer.preTokenize() never prepended the replacement
character (▁) when add_prefix_space was absent from the tokenizer
config, even when prepend_scheme was set to "always". This broke
XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace
with prepend_scheme: "always".

The fix aligns with the canonical Rust implementation (huggingface/
tokenizers PR #1357) where prepend_scheme is the sole authority:

- init: resolves prependScheme from explicit prepend_scheme first,
  falling back to add_prefix_space for backward compatibility
- preTokenize: uses switch on prependScheme directly, removing the
  addPrefixSpace gate

Cherry-picked from huggingface/swift-transformers#319 (f3d5cbf).
addPrefixSpace property and PrependScheme.from(rawValue:) are no longer
used — prependScheme is now resolved directly in init.
Test downloads kredor/punctuate-all tokenizer and verifies correct
XLM-RoBERTa tokenization when prepend_scheme is used without
addPrefixSpace.

Cherry-picked from huggingface/swift-transformers#321 (67baef8).
Use downloadModel + AutoTokenizer.from(directory:) instead of
AutoTokenizer.from(pretrained:) which doesn't exist in our fork.
@DePasqualeOrg DePasqualeOrg merged commit e786f06 into main Mar 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants