System Info
transformers version: 5.3.0
- Platform: Linux-6.6.113+-x86_64-with-glibc2.35
- Python version: 3.12.12
- Huggingface_hub version: 1.6.0
- Safetensors version: 0.7.0
- Accelerate version: 1.13.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.10.0+cpu (NA)
- Using distributed or parallel set-up in script?:
Who can help?
@ArthurZucker and @itazap
Information
Tasks
Reproduction
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-R1')
text = "How are you doing?"
print(tokenizer.encode(text))
print(tokenizer.tokenize(text))
print(tokenizer.decode(tokenizer.encode(text)))
produces
[4117, 591, 12829, 62552, 33]
['How', 'are', 'you', 'doing', '?']
Howareyoudoing?
Expected behavior
Downgrading to transformers==4.57.6, running the same code as above produces
[0, 4117, 477, 440, 4843, 33]
['How', 'Ġare', 'Ġyou', 'Ġdoing', '?']
<|begin▁of▁sentence|>How are you doing?