Commit 3ceacbe
Add HuggingFace tokenizer.json format support (#17)
* feat: add HuggingFace tokenizer.json format loading support
Add HuggingFaceTokenizerLoader for parsing HF tokenizer.json files,
supporting both BPE (Gemma) and Unigram (Llama) SentencePiece models.
TokenizerJsonLoader now auto-detects HF format and dispatches accordingly.
Closes #10
https://claude.ai/code/session_01Qt65mqTaXaENRyhCKQYjzJ
* refactor: simplify HuggingFace loader after code review
- Extract specialContents set into _HfMetadata (avoid duplicate construction)
- Eliminate double map lookup in BPE merge score resolution
- Extract shared _buildTrainerSpec/_buildNormalizerSpec helpers
- Define special token variant constants at file level
- Remove unnecessary fromMap/_fromJsonMap indirection
- Unify test fixture builders with shared defaults
https://claude.ai/code/session_01Qt65mqTaXaENRyhCKQYjzJ
---------
Co-authored-by: Claude <noreply@anthropic.com>1 parent b2f9d49 commit 3ceacbe
4 files changed
Lines changed: 1008 additions & 0 deletions
File tree
- lib
- src/sentencepiece/serialization
- test
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
18 | 20 | | |
19 | 21 | | |
20 | 22 | | |
| |||
0 commit comments