fix(auto): Map deepseek_v2 and deepseek_v3 to LlamaTokenizer by BillionClaw · Pull Request #44783 · huggingface/transformers

BillionClaw · 2026-03-17T05:58:54Z

This PR fixes the DeepSeek tokenizer issue where spaces were lost during decoding in Transformers v5.

Problem

DeepSeek V2 and V3 models use SentencePiece tokenization (like Llama) but were falling back to the generic TokenizersBackend in v5. This caused incorrect decoding where spaces were lost. For example, encoding and decoding "How are you doing?" would produce "Howareyoudoing?".

Root Cause

In Transformers v5, the tokenizer mapping system changed. Models not explicitly mapped in TOKENIZER_MAPPING_NAMES fall back to TokenizersBackend. DeepSeek models need the special SentencePiece handling provided by LlamaTokenizer (Metaspace pretokenizer with ▁ replacement and proper decoder chain).

Fix

Explicitly map deepseek_v2 and deepseek_v3 model types to LlamaTokenizer in TOKENIZER_MAPPING_NAMES.

Verification

Verified that deepseek_v2 and deepseek_v3 now resolve to LlamaTokenizer
Ran existing tokenizer auto tests - all pass
The fix is minimal (2 lines) and follows the existing pattern for similar models

Fixes #44779

Tagging: @ArthurZucker @itazap (tokenizers),

DeepSeek V2 and V3 models use SentencePiece tokenization (like Llama) but were falling back to the generic TokenizersBackend in v5. This caused incorrect decoding where spaces were lost (e.g., "How are you?" becoming "Howareyou?"). This fix explicitly maps deepseek_v2 and deepseek_v3 model types to LlamaTokenizer, which properly handles the SentencePiece-style tokenization with Metaspace and the correct decoder chain. Fixes huggingface#44779

github-actions · 2026-03-17T06:00:02Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

github-actions · 2026-03-17T06:13:00Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44783&sha=4af4cd

ArthurZucker

This is just wrong, #44779 (comment) see this comment

ArthurZucker reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(auto): Map deepseek_v2 and deepseek_v3 to LlamaTokenizer#44783

fix(auto): Map deepseek_v2 and deepseek_v3 to LlamaTokenizer#44783
BillionClaw wants to merge 1 commit intohuggingface:mainfrom
BillionClaw:clawoss/fix/deepseek-tokenizer-mapping-44779

BillionClaw commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BillionClaw commented Mar 17, 2026

Problem

Root Cause

Fix

Verification

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants