Skip to content

Commit 3ceacbe

Browse files
brody-0125claude
andauthored
Add HuggingFace tokenizer.json format support (#17)
* feat: add HuggingFace tokenizer.json format loading support Add HuggingFaceTokenizerLoader for parsing HF tokenizer.json files, supporting both BPE (Gemma) and Unigram (Llama) SentencePiece models. TokenizerJsonLoader now auto-detects HF format and dispatches accordingly. Closes #10 https://claude.ai/code/session_01Qt65mqTaXaENRyhCKQYjzJ * refactor: simplify HuggingFace loader after code review - Extract specialContents set into _HfMetadata (avoid duplicate construction) - Eliminate double map lookup in BPE merge score resolution - Extract shared _buildTrainerSpec/_buildNormalizerSpec helpers - Define special token variant constants at file level - Remove unnecessary fromMap/_fromJsonMap indirection - Unify test fixture builders with shared defaults https://claude.ai/code/session_01Qt65mqTaXaENRyhCKQYjzJ --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent b2f9d49 commit 3ceacbe

4 files changed

Lines changed: 1008 additions & 0 deletions

File tree

lib/dart_sentencepiece_tokenizer.dart

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ export 'src/sentencepiece/sentencepiece_tokenizer.dart'
1515
SpTruncationConfig,
1616
SpTruncationDirection,
1717
ModelType;
18+
export 'src/sentencepiece/serialization/huggingface_json.dart'
19+
show HuggingFaceTokenizerLoader;
1820
export 'src/sentencepiece/serialization/tokenizer_json.dart'
1921
show
2022
SentencePieceTokenizerJson,

0 commit comments

Comments
 (0)