Deepseek tokenizer produces incorrect results as of v5 (works in v4)

### System Info

- `transformers` version: 5.3.0
- Platform: Linux-6.6.113+-x86_64-with-glibc2.35
- Python version: 3.12.12
- Huggingface_hub version: 1.6.0
- Safetensors version: 0.7.0
- Accelerate version: 1.13.0
- Accelerate config: 	not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.10.0+cpu (NA)
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

@ArthurZucker and @itazap

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```py
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-R1')

text = "How are you doing?"
print(tokenizer.encode(text))
print(tokenizer.tokenize(text))
print(tokenizer.decode(tokenizer.encode(text)))
```

produces
```
[4117, 591, 12829, 62552, 33]
['How', 'are', 'you', 'doing', '?']
Howareyoudoing?
```

### Expected behavior

Downgrading to `transformers==4.57.6`, running the same code as above produces

```
[0, 4117, 477, 440, 4843, 33]
['How', 'Ġare', 'Ġyou', 'Ġdoing', '?']
<｜begin▁of▁sentence｜>How are you doing?
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek tokenizer produces incorrect results as of v5 (works in v4) #44779

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deepseek tokenizer produces incorrect results as of v5 (works in v4) #44779

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions