- Add
tokenizer.jsonloading support for HuggingFace tokenizer filesWordPieceTokenizer.fromTokenizerJson()- async file loadingWordPieceTokenizer.fromTokenizerJsonSync()- sync file loadingWordPieceTokenizer.fromTokenizerJsonString()- load from JSON string
- Add
Vocabulary.fromMap()factory for token-to-ID map construction - Automatically extract normalizer, post-processor, and added tokens from JSON
- Support optional
configOverrideparameter for advanced configuration - 25 new tests including vocab.txt vs tokenizer.json equivalence verification
- Add comprehensive dartdoc documentation to all public API elements
- Document library, classes, methods, and properties following Effective Dart guidelines
- Improve pub.dev documentation score (target: 20%+ API documentation)
- Initial release
- Pure Dart implementation of BERT WordPiece tokenizer
- 100% HuggingFace tokenizers compatibility
- Memory-efficient typed arrays (Int32List, Uint8List)
- Single text and sentence pair encoding
- Batch encoding (sequential and parallel with Isolates)
- Padding and truncation support
- Offset mapping (char-to-token, token-to-char, word-to-tokens)
- Vocabulary access and token conversion utilities