ByteTok implements byte-level Byte Pair Encoding (BPE) with a Rust-accelerated core for training and encoding. Text is first converted to raw bytes, then merged according to learned pair statistics.
The training pipeline first pretokenizes the corpus, deduplicates identical pieces, and tracks their frequencies as weighted counts. Merge steps then operate over those weighted pieces instead of repeatedly rescanning the full token stream, which cuts redundant work while preserving the same merge decisions.
If this methodology seems familiar to you, that's because ByteTok's current training algorithm draws inspiration from Hugging Face's implementation!
- High-performance Rust-powered training, encoding, and decoding: Engineered from the ground up with a parallel processing pipeline for efficient handling of large-scale NLP datasets (1GB+) with the aim of enabling rapid processing for modern LLM applications.
- Built-in regex patterns: Choose from a pre-tokenization regex preset that includes GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2 and DeepSeek.
- Custom regex patterns: Supported alongside the built-in presets.
- Special token strategies: Control how special tokens are handled during encoding.
- Serialization: Supports versioned
.model/.vocabfile formats for saving tokenizer state, as well as easy loading via afrom_pretrained()function.
This project started as a weekend experiment with BPE for text compression. I later needed a tokenizer for my custom GPT, which was bottlenecked by context length due to character-level encoding. I wanted a simple API that did four things correctly at a reasonable speed:
- Train on custom text
- Save learned encodings
- Encode text
- Decode text
Feel free to check out robust libraries such as OpenAI's tiktoken and Google's sentencepiece that are widely adopted in production environments. Tiktoken resembles ByteTok the most, but it should be noted that ByteTok provides a training pipeline which Tiktoken lacks.
In contrast, ByteTok was developed with a different focus. It prioritizes simplicity and usability by offering a clear API that efficiently maps strings to lists of token IDs. All this without burdening users with overly complex configuration or excessive parameters.
These benchmarks were conducted on a Linux x86_64 system equipped with an Intel Core i7-12700H processor (20 cores @ 4.70 GHz) and 32GB DDR5 RAM. Encoding and decoding throughput represent the speed of encode_batch() and decode_batch() operations, respectively.
Dataset: Sci-Fi Books (Gutenberg)
| Corpus Size | Vocab Size | Training Time | Encoding Throughput | Decoding Throughput | Compression Ratio | Size Reduction |
|---|---|---|---|---|---|---|
| 132.36 MB | 10,000 | 32.4 secs | 14.13 MB/sec | 80.9M tokens/sec | 3.53x | 71.6% |
| 216.96 MB | 25,000 | 1.26 mins | 13.65 MB/sec | 83.8M tokens/sec | 3.66x | 72.7% |
| 216.96 MB | 50,000 | 1.38 mins | 12.86 MB/sec | 81.6M tokens/sec | 3.80x | 73.7% |
| 326.96 MB | 50,000 | 2.09 mins | 12.43 MB/sec | 81.6M tokens/sec | 3.84x | 74.0% |
| 420.36 MB | 100,000 | 4.06 mins | 12.00 MB/sec | 84.7M tokens/sec | 3.96x | 74.7% |
- Python >= 3.12
Install from PyPI:
# with pip
pip install bytetok
# or with uv (recommended)
uv add bytetokIf you want to develop or build from source, you will need the Rust toolchain rustup.
# clone the repository
git clone https://github.com/VihangaFTW/bytetok.git
# install with uv
uv sync
# or build with maturin
uv sync --group dev
uv run maturin develop --releaseHere you will find the primary workflows for using ByteTok tokenizers. For detailed API usage and additional features, see the full documentation in the Wiki.
The API has been designed with simplicity in mind:
import bytetok as btok
# Create a tokenizer with a built-in pattern (default: gpt4o).
tokenizer = btok.get_tokenizer("gpt4o")
# Train on text.
tokenizer.train("your training corpus here...", vocab_size=1000)
# Encode and decode.
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
assert text == "Hello, world!"
# Save and reload.
tokenizer.save("my_tokenizer")
reloaded = btok.from_pretrained("my_tokenizer.model")Custom regex patterns can be used for pre-tokenization:
import bytetok as btok
# Create a tokenizer with a custom pattern
# For example, split on whitespace and punctuation.
tokenizer = btok.get_tokenizer(custom_pattern = r"\w+|[^\w\s]")For best results, it is recommended to choose from the built-in presets, which have been extensively validated.
ByteTok supports parallel encoding and decoding for faster processing of large batches of text.
Use encode_batch to perform parallel encoding to efficiently handle large collections of texts. You can then decode the resulting list of token sequences in parallel using decode_batch:
import bytetok as btok
tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)
# Encode a batch of texts in parallel.
texts = ["First document...", "Second document...", "Third document..."]
encoded = tokenizer.encode_batch(texts, show_progress=False)
# Decode the batch in parallel.
decoded = tokenizer.decode_batch(encoded, errors="replace", show_progress=False)
assert decoded[0] == "First document..."Register special tokens after training, then encode with a strategy to control how they are handled:
import bytetok as btok
tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)
# Set special tokens (IDs must be >= vocab size).
tokenizer.set_special_tokens({"<|endoftext|>": 15005, "<|pad|>": 13005})
# Encode with strategy: "all" allows special tokens in text; "none" ignores them.
strategy = btok.get_strategy("all")
tokens = tokenizer.encode("Hello<|endoftext|>world", strategy=strategy)
# Batch encoding with special tokens.
encoded = tokenizer.encode_batch(
["Doc one.", "Doc two<|pad|>padding", "Doc three."],
strategy=strategy,
)ByteTok automatically checks for conflicts when special tokens would replace existing tokens in the vocabulary or if there are duplicates.
For a complete list of special token strategies, see the Wiki documentation.
ByteTok is inspired by Andrej Kaparthy's minbpe. A walkthrough of minbpe repository is documented on his Youtube channel here.
This project is licensed under the MIT License. See the LICENSE file for details.