Bug
In src/examples/huggingface/convert_checkpoint_to_hf.py (the non-hybrid converter), the -t/--tokenizer CLI flag is ignored when a sibling tokenizer/ directory exists next to the checkpoint.
Lines 143-154:
tokenizer_path = Path(original_checkpoint_path).parent / "tokenizer"
huggingface_tokenizer = None
if tokenizer_path.exists():
log.info(f"Saving preexisting tokenizer from {tokenizer_path}")
huggingface_tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
huggingface_tokenizer.save_pretrained(output_path)
...
else:
tokenizer_id = tokenizer_id or tokenizer_config.identifier
...
The tokenizer_id (from -t) is only consulted in the else branch, so if the checkpoint was saved with a tokenizer directory (common for SFT checkpoints), the explicit -t override is silently ignored.
Expected behavior: -t should take precedence over the sibling tokenizer/ directory when explicitly provided.
Note: The hybrid converter (convert_checkpoint_to_hf_hybrid.py) does not have this issue — it always respects -t.
Bug
In
src/examples/huggingface/convert_checkpoint_to_hf.py(the non-hybrid converter), the-t/--tokenizerCLI flag is ignored when a siblingtokenizer/directory exists next to the checkpoint.Lines 143-154:
The
tokenizer_id(from-t) is only consulted in theelsebranch, so if the checkpoint was saved with a tokenizer directory (common for SFT checkpoints), the explicit-toverride is silently ignored.Expected behavior:
-tshould take precedence over the siblingtokenizer/directory when explicitly provided.Note: The hybrid converter (
convert_checkpoint_to_hf_hybrid.py) does not have this issue — it always respects-t.