102 lines (64 loc) · 4.38 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.3.0] - 2026-03-13

Changed

Training pipeline substantially refactored: all text preprocessing (including regex chunking and pretokenization) is now performed by a parallelized Rust-side workflow. This reduces training time by over 80% and cuts memory usage by approximately 50%.

[0.2.2] - 2026-03-02

Fixed

Ensured indicatif progress resources are properly cleaned up after batch operations.

[0.2.1] - 2026-03-01

Added

Optional terminal progress indicators for training and batch encode/decode operations.

Fixed

save() now uses the tokenizer’s stored regex pattern (self.pat) instead of requiring a reg_pat parameter. The pattern is now persisted and restored correctly by load().

Changed

Converted Rust extension public API docs to reStructuredText to align with Python-side docs.
Added indicatif as a Rust dependency for progress tracking.
Added progress indicator logic for training, batch encode, and batch decode operations, including special-token workflows.
Updated Cargo.toml with library metadata (license, repository, homepage, and documentation).
Updated tokenizer model versioning logic to mirror package version automatically.

Removed

Removed the _decorators module; training time visibility is now provided by progress indicators.

[0.2.0] - 2026-02-20

Added

Parallel encode/decode processing pipeline.
Support for user-defined special token IDs, with validation for duplicates and overwrites.

Fixed

save() now throws a TrainingError when attempting to save an untrained tokenizer, preventing silent writes of empty files.

Changed

Ported majority of Python encode and decode logic to Rust.
Streamlined Python Tokenizer internals to eliminate redundant code following migration of hot path logic to Rust.
Added internal helper methods in Tokenizer ABC for wiring Rust logic to Python API.
Moved decode_batch() to Tokenizer ABC.
Added encode_batch() method in Tokenizer subclasses.
Bumped package version from v0.1.2 to v.0.2.0.
Updated tokenizer version from 0.1.0 to 0.2.0 for consistency with the package version.
Updated README quick start section with examples on parallel encoding and special token workflows.
Improved encoding and decoding performance across single-text, batch, and special-token workloads.

Removed

Regex patterns for StarCoder, BLOOM, and Falcon were removed from the preset as remaining patterns cover most use cases.
The _bpe.py module and its slow_bpe_merge() and slow_bpe_merge_with_freq_update() functions have been removed.

[0.1.2] - 2026-02-07

Fixed

Fixed duplicate symbol exports in __init__.py and __init__.pyi that caused redundant auto-imports and member completions in IDEs.
Updated TokenPattern documentation by turning the regex pattern table into a list and removing the Source column.
Clarified that BasicTokenizer lossy decoding refers to UTF-8 reconstruction during decode() (uses errors="replace").

[0.1.1] - 2026-02-06

Fixed

Added Python version classifiers to fix PyPI badge display on GitHub.

[0.1.0] - 2026-02-06

Added

Byte-level BPE tokenizer with Rust-accelerated training and encoding via PyO3/maturin.
O(N log V) training and O(N log N) encoding algorithms based on Algorithm 2.
RegexTokenizer with built-in patterns from GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2, DeepSeek, StarCoder, Falcon, and BLOOM.
BasicTokenizer as a minimal reference implementation.
Custom regex pattern support via get_tokenizer(custom_pattern=...).
Special token registration and handling with four built-in strategies: all, none, none-raise, and custom.
Versioned .model / .vocab serialization format with save() and load().
from_pretrained() factory for loading saved tokenizers with auto-detected type.
TokenPattern enum for looking up pre-defined patterns by name.
Custom exception hierarchy rooted in ByteTokError.
CI/CD pipeline for building platform wheels (Linux, macOS, Windows) and publishing to PyPI.