BudTikTok is a next-generation, production-ready tokenization library designed to bridge the gap between high-performance systems and the HuggingFace ecosystem. It offers a 5-10x performance advantage over standard HuggingFace tokenizers while maintaining 95% API and format compatibility.
- SIMD Acceleration: Runtime-detected optimization for AVX-512, AVX2, SSE4.2 (x86_64) and NEON, SVE (ARM64).
- Parallel Execution: Native Rayon integration for multi-threaded batch encoding and decoding.
- Intelligent Caching: Multi-level cache with CLOCK eviction and sharded access for high concurrency.
- Lazy Evaluation: Zero-copy pipeline design that only computes what is necessary.
- CUDA Support: Fully integrated GPU tokenization pipeline.
- Multi-GPU: Automatic load balancing across available GPUs.
- Async Pipeline: Overlapped CPU-GPU data transfer for maximum throughput.
- Drop-in Replacement: Compatible with standard
tokenizer.jsonfiles. - Post-Processing: Native support for BERT, RoBERTa, and Template post-processors.
- Model Support:
- WordPiece (BERT, DistilBERT, Electra)
- BPE (GPT-2, RoBERTa, Llama-2)
- Unigram (Albert, T5)
- WordLevel
- Gap Analysis: See BUDTIKTOK_HF_GAP_ANALYSIS.md for detailed compatibility report.
- Pre-tokenized Requests: Native support for pre-tokenized inputs to bypass redundant processing.
- Token Budget Routing: Intelligent routing based on token budgets for efficient batching.
pip install budtiktokAdd budtiktok to your Cargo.toml:
[dependencies]
budtiktok = { git = "https://github.com/BudEcosystem/budtiktok.git" }from budtiktok import Tokenizer
# Load from a standard tokenizer.json file
tokenizer = Tokenizer.from_file("tokenizer.json")
# Encode text
encoding = tokenizer.encode("Hello, world!")
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")
# Decode IDs back to text
decoded = tokenizer.decode(encoding.ids)
print(f"Decoded: {decoded}")use budtiktok::TokenizerPipeline;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load from a standard tokenizer.json file
let tokenizer = TokenizerPipeline::from_file("tokenizer.json")?;
// Encode text
let encoding = tokenizer.encode("Hello, world!", true)?;
println!("Tokens: {:?}", encoding.get_tokens());
println!("IDs: {:?}", encoding.get_ids());
// Decode IDs back to text
let decoded = tokenizer.decode(encoding.get_ids(), true)?;
println!("Decoded: {}", decoded);
Ok(())
}use budtiktok::{TokenizerPipeline, GpuConfig};
// Enable GPU with auto-detection
let config = GpuConfig::auto();
let tokenizer = TokenizerPipeline::from_file_with_gpu("tokenizer.json", config)?;
// Tokenize on GPU (transparently handles batching)
let encodings = tokenizer.encode_batch(&texts, true)?;BudTikTok employs a Pipeline Wrapper pattern:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TokenizerPipeline โ
โ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โNormalizer โโ โPreTokenizer โโ โ Model โโ โPostProcessor โ โ
โ โ (Option) โ โ (Option) โ โ โ โ (Option) โ โ
โ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ AddedVocabulary โโโโโโโโโโโโโโ Decoder โ โ
โ โ (Aho-Corasick) โ โ(Option) โ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
This design allows for:
- Lazy Evaluation: Components like normalizers are only applied when necessary.
- Zero-Copy Optimizations: Extensive use of
Cow<str>and memory mapping. - Lock-Free Concurrency:
RwLockfor read-heavy vocabulary access andArcfor shared immutable components.
- Arc StringInterner - Eliminated potential use-after-free risks with reference-counted string storage
- Thread-safe vocabulary - All interned strings are protected by
Arcfor safe concurrent access - Zero unsafe transmute in hot paths - Safety-critical code now uses bounded lifetimes
save()andfrom_pretrained()- Full compatibility with HuggingFacetokenizer.jsonformat- Roundtrip verified - All tokenizer types (BPE, WordPiece, Unigram) pass serialization tests
- Model versioning - Deterministic config hashes for cache key generation
- AVX2 character classification - Real SIMD intrinsics for 32-byte parallel character classification
- PSHUFB nibble technique - Ultra-fast ASCII class lookup
- Verified correctness - SIMD paths produce identical results to scalar (test-verified)
- Structured
Vocabularytype - Replaced rawVec<String>with type-safe vocabulary management - Centralized special tokens -
SpecialTokensstruct for[CLS],[SEP],[PAD], etc. - Improved type safety - Compile-time guarantees for vocabulary operations
407 tests passed (budtiktok-core)
3 serialization roundtrip tests passed
Workspace compilation: Success
This project is licensed under the Apache-2.0 license.