You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prepare v1.3.1 release: CHANGELOG, README, and pubspec updates (#18)
Add CHANGELOG entry for HuggingFace tokenizer.json format support
(PR #17), update README with new loading section and API reference,
and bump version to 1.3.1 for pub.dev release.
https://claude.ai/code/session_01WuXLBfrYomYDizonffKcz5
Co-authored-by: Claude <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+55-6Lines changed: 55 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,13 +16,14 @@ A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE
16
16
-**Batch Processing** - Sequential and parallel (Isolate-based) batch encoding
17
17
-**Streaming API** - HuggingFace TextStreamer compatible for real-time LLM output
18
18
-**HuggingFace Compatible** - JSON serialization, dynamic token addition, tokenize() API
19
-
-**Well Tested** - 274 tests with 100% pass rate
19
+
-**HuggingFace tokenizer.json** - Load tokenizers directly from HuggingFace `tokenizer.json` format
20
+
-**Well Tested** - 274+ tests with 100% pass rate
20
21
21
22
## Installation
22
23
23
24
```yaml
24
25
dependencies:
25
-
dart_sentencepiece_tokenizer: ^1.3.0
26
+
dart_sentencepiece_tokenizer: ^1.3.1
26
27
```
27
28
28
29
## Quick Start
@@ -242,6 +243,42 @@ final loadedSync = TokenizerJsonLoader.fromJsonFileSync('tokenizer.json');
242
243
final fromString = TokenizerJsonLoader.fromJsonString(jsonString);
243
244
```
244
245
246
+
### HuggingFace tokenizer.json Loading (v1.3.1+)
247
+
248
+
Load tokenizers directly from HuggingFace `tokenizer.json` format, enabling compatibility with models like Gemma and Llama that distribute tokenizers in this format.
249
+
250
+
```dart
251
+
// Auto-detection via TokenizerJsonLoader (recommended)
252
+
final tokenizer = await TokenizerJsonLoader.fromJsonFile('tokenizer.json');
253
+
254
+
// Or use HuggingFaceTokenizerLoader directly
255
+
final tokenizer = await HuggingFaceTokenizerLoader.fromJsonFile(
256
+
'tokenizer.json',
257
+
);
258
+
259
+
// From JSON string
260
+
final tokenizer = HuggingFaceTokenizerLoader.fromJsonString(jsonString);
261
+
262
+
// With custom config override
263
+
final tokenizer = HuggingFaceTokenizerLoader.fromJsonString(
264
+
jsonString,
265
+
config: SentencePieceConfig.gemma,
266
+
);
267
+
268
+
// Check format before loading
269
+
final data = jsonDecode(jsonString) as Map<String, dynamic>;
270
+
if (TokenizerJsonLoader.isHuggingFaceFormat(data)) {
271
+
final tokenizer = HuggingFaceTokenizerLoader.fromMap(data);
272
+
}
273
+
```
274
+
275
+
**Supported features:**
276
+
- Unigram and BPE model types
277
+
- Special token detection (unk, bos, eos, pad)
278
+
- Normalizer and post-processor inference
279
+
- Byte fallback from decoder configuration
280
+
- Added tokens beyond base vocabulary
281
+
245
282
### Decoding
246
283
247
284
```dart
@@ -403,9 +440,19 @@ final customTokenizer = SentencePieceTokenizer.fromModelFileSync(
403
440
404
441
| Method | Description |
405
442
|--------|-------------|
406
-
|`fromJsonString(json)`| Load from JSON string |
407
-
|`fromJsonFile(path)`| Load from JSON file (async) |
408
-
|`fromJsonFileSync(path)`| Load from JSON file (sync) |
443
+
|`fromJsonString(json)`| Load from JSON string (auto-detects format) |
444
+
|`fromJsonFile(path)`| Load from JSON file (async, auto-detects format) |
445
+
|`fromJsonFileSync(path)`| Load from JSON file (sync, auto-detects format) |
446
+
|`isHuggingFaceFormat(data)`| Check if JSON map is HuggingFace format |
447
+
448
+
### HuggingFaceTokenizerLoader
449
+
450
+
| Method | Description |
451
+
|--------|-------------|
452
+
|`fromJsonString(json)`| Load from HuggingFace JSON string |
453
+
|`fromMap(data)`| Load from pre-parsed JSON map |
454
+
|`fromJsonFile(path)`| Load from file (async) |
455
+
|`fromJsonFileSync(path)`| Load from file (sync) |
409
456
410
457
### Encoding
411
458
@@ -450,7 +497,9 @@ Download SentencePiece models from HuggingFace:
0 commit comments