Skip to content

fix(auto): Map deepseek_v2 and deepseek_v3 to LlamaTokenizer#44783

Open
BillionClaw wants to merge 1 commit intohuggingface:mainfrom
BillionClaw:clawoss/fix/deepseek-tokenizer-mapping-44779
Open

fix(auto): Map deepseek_v2 and deepseek_v3 to LlamaTokenizer#44783
BillionClaw wants to merge 1 commit intohuggingface:mainfrom
BillionClaw:clawoss/fix/deepseek-tokenizer-mapping-44779

Conversation

@BillionClaw
Copy link

This PR fixes the DeepSeek tokenizer issue where spaces were lost during decoding in Transformers v5.

Problem

DeepSeek V2 and V3 models use SentencePiece tokenization (like Llama) but were falling back to the generic TokenizersBackend in v5. This caused incorrect decoding where spaces were lost. For example, encoding and decoding "How are you doing?" would produce "Howareyoudoing?".

Root Cause

In Transformers v5, the tokenizer mapping system changed. Models not explicitly mapped in TOKENIZER_MAPPING_NAMES fall back to TokenizersBackend. DeepSeek models need the special SentencePiece handling provided by LlamaTokenizer (Metaspace pretokenizer with ▁ replacement and proper decoder chain).

Fix

Explicitly map deepseek_v2 and deepseek_v3 model types to LlamaTokenizer in TOKENIZER_MAPPING_NAMES.

Verification

  • Verified that deepseek_v2 and deepseek_v3 now resolve to LlamaTokenizer
  • Ran existing tokenizer auto tests - all pass
  • The fix is minimal (2 lines) and follows the existing pattern for similar models

Fixes #44779

Tagging: @ArthurZucker @itazap (tokenizers),

DeepSeek V2 and V3 models use SentencePiece tokenization (like Llama)
but were falling back to the generic TokenizersBackend in v5. This
caused incorrect decoding where spaces were lost (e.g., "How are you?"
becoming "Howareyou?").

This fix explicitly maps deepseek_v2 and deepseek_v3 model types to
LlamaTokenizer, which properly handles the SentencePiece-style
tokenization with Metaspace and the correct decoder chain.

Fixes huggingface#44779
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@github-actions
Copy link
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44783&sha=4af4cd

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just wrong, #44779 (comment) see this comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deepseek tokenizer produces incorrect results as of v5 (works in v4)

2 participants