Fix MetaspacePreTokenizer prepend_scheme + xlm-roberta-base support by DePasqualeOrg · Pull Request #11 · DePasqualeOrg/swift-tokenizers

DePasqualeOrg · 2026-03-04T14:28:38Z

Cherry-picks upstream PRs #319, #320, #321 with minor adaptations for our fork.

Problem

MetaspacePreTokenizer.preTokenize() gated prepending the replacement character (▁) on the addPrefixSpace config flag. When add_prefix_space was absent from the tokenizer config (defaults to false), the replacement was never prepended – even when prepend_scheme was set to "always". This broke XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace with prepend_scheme: "always".

Fix

The fix aligns with the canonical Rust implementation (tokenizers PR #1357), where prepend_scheme is the sole authority:

init: resolves prependScheme from explicit prepend_scheme first, falling back to add_prefix_space for backward compatibility (defaulting to .always when both are absent).
preTokenize: switches on prependScheme directly, removing the addPrefixSpace gate.

Changes

Cherry-picked from upstream

#319 – Bug fix for MetaspacePreTokenizer + 8 unit tests covering all prepend_scheme / add_prefix_space combinations.
#320 – Adds "Xlm-RobertaTokenizer" model mapping (case variant needed for FacebookAI/xlm-roberta-base) + integration test.
#321 – Integration test for kredor/punctuate-all verifying correct XLM-RoBERTa tokenization with prepend_scheme.

Additional changes

Removed dead code left by the bug fix: addPrefixSpace stored property (no longer read after init) and PrependScheme.from(rawValue:) static method (replaced by inline resolution).
Adapted integration tests to use downloadModel + AutoTokenizer.from(directory:) instead of AutoTokenizer.from(pretrained:), which doesn't exist in our fork.

…xSpace MetaspacePreTokenizer.preTokenize() never prepended the replacement character (▁) when add_prefix_space was absent from the tokenizer config, even when prepend_scheme was set to "always". This broke XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace with prepend_scheme: "always". The fix aligns with the canonical Rust implementation (huggingface/ tokenizers PR #1357) where prepend_scheme is the sole authority: - init: resolves prependScheme from explicit prepend_scheme first, falling back to add_prefix_space for backward compatibility - preTokenize: uses switch on prependScheme directly, removing the addPrefixSpace gate Cherry-picked from huggingface/swift-transformers#319 (f3d5cbf).

addPrefixSpace property and PrependScheme.from(rawValue:) are no longer used — prependScheme is now resolved directly in init.

Cherry-picked from huggingface/swift-transformers#320 (31be7a5).

Test downloads kredor/punctuate-all tokenizer and verifies correct XLM-RoBERTa tokenization when prepend_scheme is used without addPrefixSpace. Cherry-picked from huggingface/swift-transformers#321 (67baef8).

Use downloadModel + AutoTokenizer.from(directory:) instead of AutoTokenizer.from(pretrained:) which doesn't exist in our fork.

beshkenadze and others added 5 commits March 4, 2026 15:17

Remove dead code from MetaspacePreTokenizer after prepend_scheme fix

3fd2f7b

addPrefixSpace property and PrependScheme.from(rawValue:) are no longer used — prependScheme is now resolved directly in init.

Support tokenizer from FacebookAI/xlm-roberta-base

8eaa002

Cherry-picked from huggingface/swift-transformers#320 (31be7a5).

Add integration test for MetaspacePreTokenizer fix

9e2b0fd

Test downloads kredor/punctuate-all tokenizer and verifies correct XLM-RoBERTa tokenization when prepend_scheme is used without addPrefixSpace. Cherry-picked from huggingface/swift-transformers#321 (67baef8).

Adapt cherry-picked integration tests for our API

5b9fe34

Use downloadModel + AutoTokenizer.from(directory:) instead of AutoTokenizer.from(pretrained:) which doesn't exist in our fork.

DePasqualeOrg merged commit e786f06 into main Mar 4, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MetaspacePreTokenizer prepend_scheme + xlm-roberta-base support#11

Fix MetaspacePreTokenizer prepend_scheme + xlm-roberta-base support#11
DePasqualeOrg merged 5 commits intomainfrom
fix-metaspace-prepend-scheme

DePasqualeOrg commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DePasqualeOrg commented Mar 4, 2026

Problem

Fix

Changes

Cherry-picked from upstream

Additional changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants