Releases · brody-0125/dart_sentencepiece_tokenizer

07 Apr 16:27

brody-0125

v1.3.2

891f2b3

v1.3.2 Latest

Latest

What's New

GitHub Actions CI Pipeline (#19)

Automated continuous integration is now configured for the project:

Analyze job — Enforces dart format consistency and dart analyze --fatal-infos with zero tolerance for warnings.
Test job — Runs the full test suite across a matrix of Dart stable and Dart 3.10.7 (minimum supported SDK version).
Minimal permissions (contents: read) and concurrency groups to cancel stale runs.

Improvements

Code formatting — Applied dart format to 23 source files for consistent code style across the codebase.
Static analysis cleanup — Resolved all dart analyze --fatal-infos issues:
- Removed deprecated avoid_returning_null_for_future lint rule.
- Added curly braces to if statements, const constructors, and final local variables where required.
Documentation (#21) — Added inline comments clarifying google/sentencepiece proto spec compliance for default token IDs (unkId=0, bosId=1, eosId=2, padId=-1).

Notes

This is a maintenance release with no API changes or breaking changes. Focus: CI infrastructure, code hygiene, and documentation clarity.

Full Changelog: v1.3.1...v1.3.2

Assets 2

03 Apr 16:12

brody-0125

v1.3.1

147bd0b

v1.3.1 — HuggingFace tokenizer.json Native Support

Load HuggingFace tokenizers directly from tokenizer.json — no conversion step required.

What's New

HuggingFace `tokenizer.json` Format Support

You can now load any HuggingFace tokenizer.json file without converting it to SentencePiece .model format first. This makes it straightforward to use tokenizers published on the HuggingFace Hub.

// Load from file
final tokenizer = await HuggingFaceTokenizerLoader.fromJsonFile('tokenizer.json');

// Load from a pre-parsed map
final tokenizer = HuggingFaceTokenizerLoader.fromMap(jsonMap);

// Auto-detection — works transparently with TokenizerJsonLoader
final tokenizer = await TokenizerJsonLoader.fromJsonFile('tokenizer.json');

Supported model types:

Unigram — Llama, T5, ALBERT, XLNet, and other Unigram-based models
BPE — Gemma, GPT-2, RoBERTa, and other BPE-based models

Automatic configuration inference:

Special tokens (unk, bos, eos, pad) are detected from the added_tokens section
Normalizer settings (addDummyPrefix, escapeWhitespaces) are inferred from the HuggingFace normalizer config
Post-processor flags (addBosToken, addEosToken) are parsed from TemplateProcessing
Byte fallback behavior is detected from the decoder configuration
Tokens beyond the base vocabulary are handled automatically

Format detection:

TokenizerJsonLoader.isHuggingFaceFormat() lets you check whether a JSON map uses the HuggingFace format. When you call TokenizerJsonLoader.fromJsonFile(), HuggingFace format is detected and delegated automatically — no code changes needed if you already use TokenizerJsonLoader.

Install / Upgrade

dependencies:
  dart_sentencepiece_tokenizer: ^1.3.1

Full Changelog: https://github.com/brody-0125/dart_sentencepiece_tokenizer/blob/develop/CHANGELOG.md

Assets 2

02 Feb 14:12

brody-0125

1.3.0

b2f9d49

1.3.0

What's Changed

feat: add HuggingFace TextStreamer compatible streaming API by @brody-0125 in #8

Full Changelog: 1.2.2...1.3.0

Contributors

brody-0125

Assets 2

02 Feb 14:11

brody-0125

1.2.2

4e75dff

1.2.2

What's Changed

feat: optimize memory usage and refactor tests for v1.2.2 by @brody-0125 in #7

Full Changelog: 1.2.1...1.2.2

Contributors

brody-0125

Assets 2

27 Jan 16:47

brody-0125

1.2.1

2f5c652

1.2.1

What's Changed

feat: add JSON Serialization API, Dynamic Token Addition API and Optimized BPE Algorithm by @brody-0125 in #5

Full Changelog: 1.2.0...1.2.1

Contributors

brody-0125

Assets 2

17 Jan 12:49

brody-0125

1.2.0

929d90e

1.1.0~1.2.0

What's Changed

feat: improve BPE Algorithm by @brody-0125 in #1
feat: improve BPE Algorithm (#1) by @brody-0125 in #2
feat: add JSON Serialization API, Dynamic Token Addition API and Optimized BPE Algorithm by @brody-0125 in #3

Full Changelog: 1.0.0...1.2.0

Contributors

brody-0125

Assets 2

02 Jan 16:23

brody-0125

1.0.0

a4d9833

1.0.0

Full Changelog: https://github.com/brody-0125/dart_sentencepiece_tokenizer/commits/1.0.0

Assets 2

Releases: brody-0125/dart_sentencepiece_tokenizer

v1.3.2

What's New

Improvements

Notes

Uh oh!

v1.3.1 — HuggingFace tokenizer.json Native Support

What's New

HuggingFace tokenizer.json Format Support

Install / Upgrade

Uh oh!

1.3.0

What's Changed

Contributors

Uh oh!

1.2.2

What's Changed

Contributors

Uh oh!

1.2.1

What's Changed

Contributors

Uh oh!

1.1.0~1.2.0

What's Changed

Contributors

Uh oh!

1.0.0

Uh oh!

HuggingFace `tokenizer.json` Format Support