All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Training pipeline substantially refactored: all text preprocessing (including regex chunking and pretokenization) is now performed by a parallelized Rust-side workflow. This reduces training time by over 80% and cuts memory usage by approximately 50%.
- Ensured
indicatifprogress resources are properly cleaned up after batch operations.
- Optional terminal progress indicators for training and batch encode/decode operations.
save()now uses the tokenizer’s stored regex pattern (self.pat) instead of requiring areg_patparameter. The pattern is now persisted and restored correctly byload().
- Converted Rust extension public API docs to reStructuredText to align with Python-side docs.
- Added
indicatifas a Rust dependency for progress tracking. - Added progress indicator logic for training, batch encode, and batch decode operations, including special-token workflows.
- Updated
Cargo.tomlwith library metadata (license,repository,homepage, anddocumentation). - Updated tokenizer model versioning logic to mirror package version automatically.
- Removed the
_decoratorsmodule; training time visibility is now provided by progress indicators.
- Parallel encode/decode processing pipeline.
- Support for user-defined special token IDs, with validation for duplicates and overwrites.
save()now throws aTrainingErrorwhen attempting to save an untrained tokenizer, preventing silent writes of empty files.
- Ported majority of Python encode and decode logic to Rust.
- Streamlined Python
Tokenizerinternals to eliminate redundant code following migration of hot path logic to Rust. - Added internal helper methods in
TokenizerABC for wiring Rust logic to Python API. - Moved
decode_batch()toTokenizerABC. - Added
encode_batch()method inTokenizersubclasses. - Bumped package version from
v0.1.2tov.0.2.0. - Updated tokenizer version from
0.1.0to0.2.0for consistency with the package version. - Updated README quick start section with examples on parallel encoding and special token workflows.
- Improved encoding and decoding performance across single-text, batch, and special-token workloads.
- Regex patterns for StarCoder, BLOOM, and Falcon were removed from the preset as remaining patterns cover most use cases.
- The
_bpe.pymodule and itsslow_bpe_merge()andslow_bpe_merge_with_freq_update()functions have been removed.
- Fixed duplicate symbol exports in
__init__.pyand__init__.pyithat caused redundant auto-imports and member completions in IDEs. - Updated
TokenPatterndocumentation by turning the regex pattern table into a list and removing the Source column. - Clarified that
BasicTokenizerlossy decoding refers to UTF-8 reconstruction duringdecode()(useserrors="replace").
- Added Python version classifiers to fix PyPI badge display on GitHub.
- Byte-level BPE tokenizer with Rust-accelerated training and encoding via PyO3/maturin.
- O(N log V) training and O(N log N) encoding algorithms based on Algorithm 2.
RegexTokenizerwith built-in patterns from GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2, DeepSeek, StarCoder, Falcon, and BLOOM.BasicTokenizeras a minimal reference implementation.- Custom regex pattern support via
get_tokenizer(custom_pattern=...). - Special token registration and handling with four built-in strategies:
all,none,none-raise, andcustom. - Versioned
.model/.vocabserialization format withsave()andload(). from_pretrained()factory for loading saved tokenizers with auto-detected type.TokenPatternenum for looking up pre-defined patterns by name.- Custom exception hierarchy rooted in
ByteTokError. - CI/CD pipeline for building platform wheels (Linux, macOS, Windows) and publishing to PyPI.
