-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Open
Description
I use SentenceTransformer load qwen3_embedding_0.6b.
The eos_token is "<|im_end|>", the eos_token_id is "151645", but the output is "151643".
sentence-transformers: 5.1.1
model is downloaded from modelscope with no change.
code:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("./qwen3_embedding_0.6b", tokenizer_kwargs={"padding_side": "left"})
tokenizer = model.tokenizer
text = ["!", "!#012", "<|im_end|>"]
text_tokens = tokenizer(text)
print(f"tokenizer.eos_token: {tokenizer.eos_token}")
print(f"tokenizer.eos_token_id: {tokenizer.eos_token_id}")
print(f"tokenizer.pad_token: {tokenizer.pad_token}")
print(f"tokenizer.pad_token_id: {tokenizer.pad_token_id}")
print(f"text_tokens.input_ids: {text_tokens.input_ids}")
print(f"text_tokens.attention_mask: {text_tokens.attention_mask}")
output:
tokenizer.eos_token: <|im_end|>
tokenizer.eos_token_id: 151645
tokenizer.pad_token: <|endoftext|>
tokenizer.pad_token_id: 151643
text_tokens.input_ids: [[0, 151643], [0, 2, 15, 16, 17, 151643], [151645, 151643]]
text_tokens.attention_mask: [[1, 1], [1, 1, 1, 1, 1, 1], [1, 1]]
In "./qwen3_embedding_0.6b/tokenizer_config.json", config has set "eos_token": "<|im_end|>".
Thank you very much.
Metadata
Metadata
Assignees
Labels
No labels