Prepare v1.3.1 release: CHANGELOG, README, and pubspec updates (#18)

brody-0125 · claude · web-flow · commit 147bd0b24d40 · 2026-04-04T01:09:19.000+09:00
Add CHANGELOG entry for HuggingFace tokenizer.json format support (PR #17), update README with new loading section and API reference, and bump version to 1.3.1 for pub.dev release. https://claude.ai/code/session_01WuXLBfrYomYDizonffKcz5 Co-authored-by: Claude <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,23 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [1.3.1] - 2026-04-03
+
+### Added
+
+- **HuggingFace `tokenizer.json` Format Support**
+  - `HuggingFaceTokenizerLoader` class for loading HuggingFace tokenizer.json files directly
+    - `fromJsonString()` / `fromMap()` - Parse from JSON string or pre-parsed map
+    - `fromJsonFile()` / `fromJsonFileSync()` - Load from file (async/sync)
+  - Supports both **Unigram** (Llama) and **BPE** (Gemma) model types
+  - Automatic detection of special tokens (unk, bos, eos, pad) from `added_tokens` section
+  - Normalizer settings inference (addDummyPrefix, escapeWhitespaces) from HuggingFace normalizer config
+  - Post-processor configuration parsing (addBosToken, addEosToken) from TemplateProcessing
+  - Byte fallback detection from decoder configuration
+  - Added tokens handling beyond base vocabulary
+  - `TokenizerJsonLoader.isHuggingFaceFormat()` - Helper to detect HuggingFace format
+  - Auto-detection in `TokenizerJsonLoader` - Automatically delegates to `HuggingFaceTokenizerLoader` when HuggingFace format is detected
+
 ## [1.3.0] - 2026-02-02
 
 ### Added
diff --git a/README.md b/README.md
@@ -16,13 +16,14 @@ A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE
 - **Batch Processing** - Sequential and parallel (Isolate-based) batch encoding
 - **Streaming API** - HuggingFace TextStreamer compatible for real-time LLM output
 - **HuggingFace Compatible** - JSON serialization, dynamic token addition, tokenize() API
-- **Well Tested** - 274 tests with 100% pass rate
+- **HuggingFace tokenizer.json** - Load tokenizers directly from HuggingFace `tokenizer.json` format
+- **Well Tested** - 274+ tests with 100% pass rate
 
 ## Installation
 
 ```yaml
 dependencies:
-  dart_sentencepiece_tokenizer: ^1.3.0
+  dart_sentencepiece_tokenizer: ^1.3.1
 ```
 
 ## Quick Start
@@ -242,6 +243,42 @@ final loadedSync = TokenizerJsonLoader.fromJsonFileSync('tokenizer.json');
 final fromString = TokenizerJsonLoader.fromJsonString(jsonString);
 ```
 
+### HuggingFace tokenizer.json Loading (v1.3.1+)
+
+Load tokenizers directly from HuggingFace `tokenizer.json` format, enabling compatibility with models like Gemma and Llama that distribute tokenizers in this format.
+
+```dart
+// Auto-detection via TokenizerJsonLoader (recommended)
+final tokenizer = await TokenizerJsonLoader.fromJsonFile('tokenizer.json');
+
+// Or use HuggingFaceTokenizerLoader directly
+final tokenizer = await HuggingFaceTokenizerLoader.fromJsonFile(
+  'tokenizer.json',
+);
+
+// From JSON string
+final tokenizer = HuggingFaceTokenizerLoader.fromJsonString(jsonString);
+
+// With custom config override
+final tokenizer = HuggingFaceTokenizerLoader.fromJsonString(
+  jsonString,
+  config: SentencePieceConfig.gemma,
+);
+
+// Check format before loading
+final data = jsonDecode(jsonString) as Map<String, dynamic>;
+if (TokenizerJsonLoader.isHuggingFaceFormat(data)) {
+  final tokenizer = HuggingFaceTokenizerLoader.fromMap(data);
+}
+```
+
+**Supported features:**
+- Unigram and BPE model types
+- Special token detection (unk, bos, eos, pad)
+- Normalizer and post-processor inference
+- Byte fallback from decoder configuration
+- Added tokens beyond base vocabulary
+
 ### Decoding
 
 ```dart
@@ -403,9 +440,19 @@ final customTokenizer = SentencePieceTokenizer.fromModelFileSync(
 
 | Method | Description |
 |--------|-------------|
-| `fromJsonString(json)` | Load from JSON string |
-| `fromJsonFile(path)` | Load from JSON file (async) |
-| `fromJsonFileSync(path)` | Load from JSON file (sync) |
+| `fromJsonString(json)` | Load from JSON string (auto-detects format) |
+| `fromJsonFile(path)` | Load from JSON file (async, auto-detects format) |
+| `fromJsonFileSync(path)` | Load from JSON file (sync, auto-detects format) |
+| `isHuggingFaceFormat(data)` | Check if JSON map is HuggingFace format |
+
+### HuggingFaceTokenizerLoader
+
+| Method | Description |
+|--------|-------------|
+| `fromJsonString(json)` | Load from HuggingFace JSON string |
+| `fromMap(data)` | Load from pre-parsed JSON map |
+| `fromJsonFile(path)` | Load from file (async) |
+| `fromJsonFileSync(path)` | Load from file (sync) |
 
 ### Encoding
 
@@ -450,7 +497,9 @@ Download SentencePiece models from HuggingFace:
 - [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/tokenizer.model)
 - [Gemma](https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.model)
 
-Format: Binary protobuf (.model files from SentencePiece C++ library).
+**Supported formats:**
+- Binary protobuf (`.model` files from SentencePiece C++ library)
+- HuggingFace `tokenizer.json` (auto-detected, v1.3.1+)
 
 ## Testing
 
diff --git a/pubspec.yaml b/pubspec.yaml
@@ -1,6 +1,6 @@
 name: dart_sentencepiece_tokenizer
 description: A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.
-version: 1.3.0
+version: 1.3.1
 repository: https://github.com/brody-0125/dart_sentencepiece_tokenizer
 issue_tracker: https://github.com/brody-0125/dart_sentencepiece_tokenizer/issues
 topics: