Releases: brody-0125/dart_sentencepiece_tokenizer
v1.3.2
What's New
GitHub Actions CI Pipeline (#19)
Automated continuous integration is now configured for the project:
- Analyze job — Enforces
dart formatconsistency anddart analyze --fatal-infoswith zero tolerance for warnings. - Test job — Runs the full test suite across a matrix of Dart stable and Dart 3.10.7 (minimum supported SDK version).
- Minimal permissions (
contents: read) and concurrency groups to cancel stale runs.
Improvements
- Code formatting — Applied
dart formatto 23 source files for consistent code style across the codebase. - Static analysis cleanup — Resolved all
dart analyze --fatal-infosissues:- Removed deprecated
avoid_returning_null_for_futurelint rule. - Added curly braces to
ifstatements,constconstructors, andfinallocal variables where required.
- Removed deprecated
- Documentation (#21) — Added inline comments clarifying
google/sentencepieceproto spec compliance for default token IDs (unkId=0,bosId=1,eosId=2,padId=-1).
Notes
This is a maintenance release with no API changes or breaking changes. Focus: CI infrastructure, code hygiene, and documentation clarity.
Full Changelog: v1.3.1...v1.3.2
v1.3.1 — HuggingFace tokenizer.json Native Support
Load HuggingFace tokenizers directly from tokenizer.json — no conversion step required.
What's New
HuggingFace tokenizer.json Format Support
You can now load any HuggingFace tokenizer.json file without converting it to SentencePiece .model format first. This makes it straightforward to use tokenizers published on the HuggingFace Hub.
// Load from file
final tokenizer = await HuggingFaceTokenizerLoader.fromJsonFile('tokenizer.json');
// Load from a pre-parsed map
final tokenizer = HuggingFaceTokenizerLoader.fromMap(jsonMap);
// Auto-detection — works transparently with TokenizerJsonLoader
final tokenizer = await TokenizerJsonLoader.fromJsonFile('tokenizer.json');Supported model types:
- Unigram — Llama, T5, ALBERT, XLNet, and other Unigram-based models
- BPE — Gemma, GPT-2, RoBERTa, and other BPE-based models
Automatic configuration inference:
- Special tokens (
unk,bos,eos,pad) are detected from theadded_tokenssection - Normalizer settings (
addDummyPrefix,escapeWhitespaces) are inferred from the HuggingFace normalizer config - Post-processor flags (
addBosToken,addEosToken) are parsed fromTemplateProcessing - Byte fallback behavior is detected from the decoder configuration
- Tokens beyond the base vocabulary are handled automatically
Format detection:
TokenizerJsonLoader.isHuggingFaceFormat() lets you check whether a JSON map uses the HuggingFace format. When you call TokenizerJsonLoader.fromJsonFile(), HuggingFace format is detected and delegated automatically — no code changes needed if you already use TokenizerJsonLoader.
Install / Upgrade
dependencies:
dart_sentencepiece_tokenizer: ^1.3.1Full Changelog: https://github.com/brody-0125/dart_sentencepiece_tokenizer/blob/develop/CHANGELOG.md
1.3.0
What's Changed
- feat: add HuggingFace TextStreamer compatible streaming API by @brody-0125 in #8
Full Changelog: 1.2.2...1.3.0
1.2.2
What's Changed
- feat: optimize memory usage and refactor tests for v1.2.2 by @brody-0125 in #7
Full Changelog: 1.2.1...1.2.2
1.2.1
What's Changed
- feat: add JSON Serialization API, Dynamic Token Addition API and Optimized BPE Algorithm by @brody-0125 in #5
Full Changelog: 1.2.0...1.2.1
1.1.0~1.2.0
What's Changed
- feat: improve BPE Algorithm by @brody-0125 in #1
- feat: improve BPE Algorithm (#1) by @brody-0125 in #2
- feat: add JSON Serialization API, Dynamic Token Addition API and Optimized BPE Algorithm by @brody-0125 in #3
Full Changelog: 1.0.0...1.2.0