Skip to content

Commit 147bd0b

Browse files
brody-0125claude
andauthored
Prepare v1.3.1 release: CHANGELOG, README, and pubspec updates (#18)
Add CHANGELOG entry for HuggingFace tokenizer.json format support (PR #17), update README with new loading section and API reference, and bump version to 1.3.1 for pub.dev release. https://claude.ai/code/session_01WuXLBfrYomYDizonffKcz5 Co-authored-by: Claude <noreply@anthropic.com>
1 parent 3ceacbe commit 147bd0b

3 files changed

Lines changed: 73 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,23 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [1.3.1] - 2026-04-03
9+
10+
### Added
11+
12+
- **HuggingFace `tokenizer.json` Format Support**
13+
- `HuggingFaceTokenizerLoader` class for loading HuggingFace tokenizer.json files directly
14+
- `fromJsonString()` / `fromMap()` - Parse from JSON string or pre-parsed map
15+
- `fromJsonFile()` / `fromJsonFileSync()` - Load from file (async/sync)
16+
- Supports both **Unigram** (Llama) and **BPE** (Gemma) model types
17+
- Automatic detection of special tokens (unk, bos, eos, pad) from `added_tokens` section
18+
- Normalizer settings inference (addDummyPrefix, escapeWhitespaces) from HuggingFace normalizer config
19+
- Post-processor configuration parsing (addBosToken, addEosToken) from TemplateProcessing
20+
- Byte fallback detection from decoder configuration
21+
- Added tokens handling beyond base vocabulary
22+
- `TokenizerJsonLoader.isHuggingFaceFormat()` - Helper to detect HuggingFace format
23+
- Auto-detection in `TokenizerJsonLoader` - Automatically delegates to `HuggingFaceTokenizerLoader` when HuggingFace format is detected
24+
825
## [1.3.0] - 2026-02-02
926

1027
### Added

README.md

Lines changed: 55 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,14 @@ A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE
1616
- **Batch Processing** - Sequential and parallel (Isolate-based) batch encoding
1717
- **Streaming API** - HuggingFace TextStreamer compatible for real-time LLM output
1818
- **HuggingFace Compatible** - JSON serialization, dynamic token addition, tokenize() API
19-
- **Well Tested** - 274 tests with 100% pass rate
19+
- **HuggingFace tokenizer.json** - Load tokenizers directly from HuggingFace `tokenizer.json` format
20+
- **Well Tested** - 274+ tests with 100% pass rate
2021

2122
## Installation
2223

2324
```yaml
2425
dependencies:
25-
dart_sentencepiece_tokenizer: ^1.3.0
26+
dart_sentencepiece_tokenizer: ^1.3.1
2627
```
2728
2829
## Quick Start
@@ -242,6 +243,42 @@ final loadedSync = TokenizerJsonLoader.fromJsonFileSync('tokenizer.json');
242243
final fromString = TokenizerJsonLoader.fromJsonString(jsonString);
243244
```
244245

246+
### HuggingFace tokenizer.json Loading (v1.3.1+)
247+
248+
Load tokenizers directly from HuggingFace `tokenizer.json` format, enabling compatibility with models like Gemma and Llama that distribute tokenizers in this format.
249+
250+
```dart
251+
// Auto-detection via TokenizerJsonLoader (recommended)
252+
final tokenizer = await TokenizerJsonLoader.fromJsonFile('tokenizer.json');
253+
254+
// Or use HuggingFaceTokenizerLoader directly
255+
final tokenizer = await HuggingFaceTokenizerLoader.fromJsonFile(
256+
'tokenizer.json',
257+
);
258+
259+
// From JSON string
260+
final tokenizer = HuggingFaceTokenizerLoader.fromJsonString(jsonString);
261+
262+
// With custom config override
263+
final tokenizer = HuggingFaceTokenizerLoader.fromJsonString(
264+
jsonString,
265+
config: SentencePieceConfig.gemma,
266+
);
267+
268+
// Check format before loading
269+
final data = jsonDecode(jsonString) as Map<String, dynamic>;
270+
if (TokenizerJsonLoader.isHuggingFaceFormat(data)) {
271+
final tokenizer = HuggingFaceTokenizerLoader.fromMap(data);
272+
}
273+
```
274+
275+
**Supported features:**
276+
- Unigram and BPE model types
277+
- Special token detection (unk, bos, eos, pad)
278+
- Normalizer and post-processor inference
279+
- Byte fallback from decoder configuration
280+
- Added tokens beyond base vocabulary
281+
245282
### Decoding
246283

247284
```dart
@@ -403,9 +440,19 @@ final customTokenizer = SentencePieceTokenizer.fromModelFileSync(
403440

404441
| Method | Description |
405442
|--------|-------------|
406-
| `fromJsonString(json)` | Load from JSON string |
407-
| `fromJsonFile(path)` | Load from JSON file (async) |
408-
| `fromJsonFileSync(path)` | Load from JSON file (sync) |
443+
| `fromJsonString(json)` | Load from JSON string (auto-detects format) |
444+
| `fromJsonFile(path)` | Load from JSON file (async, auto-detects format) |
445+
| `fromJsonFileSync(path)` | Load from JSON file (sync, auto-detects format) |
446+
| `isHuggingFaceFormat(data)` | Check if JSON map is HuggingFace format |
447+
448+
### HuggingFaceTokenizerLoader
449+
450+
| Method | Description |
451+
|--------|-------------|
452+
| `fromJsonString(json)` | Load from HuggingFace JSON string |
453+
| `fromMap(data)` | Load from pre-parsed JSON map |
454+
| `fromJsonFile(path)` | Load from file (async) |
455+
| `fromJsonFileSync(path)` | Load from file (sync) |
409456

410457
### Encoding
411458

@@ -450,7 +497,9 @@ Download SentencePiece models from HuggingFace:
450497
- [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf/resolve/main/tokenizer.model)
451498
- [Gemma](https://huggingface.co/google/gemma-7b/resolve/main/tokenizer.model)
452499

453-
Format: Binary protobuf (.model files from SentencePiece C++ library).
500+
**Supported formats:**
501+
- Binary protobuf (`.model` files from SentencePiece C++ library)
502+
- HuggingFace `tokenizer.json` (auto-detected, v1.3.1+)
454503

455504
## Testing
456505

pubspec.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
name: dart_sentencepiece_tokenizer
22
description: A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.
3-
version: 1.3.0
3+
version: 1.3.1
44
repository: https://github.com/brody-0125/dart_sentencepiece_tokenizer
55
issue_tracker: https://github.com/brody-0125/dart_sentencepiece_tokenizer/issues
66
topics:

0 commit comments

Comments
 (0)