BudTikTok: High-Performance HF-Compatible Tokenization

BudTikTok is a next-generation, production-ready tokenization library designed to bridge the gap between high-performance systems and the HuggingFace ecosystem. It offers a 5-10x performance advantage over standard HuggingFace tokenizers while maintaining 95% API and format compatibility.

🚀 Key Features

⚡ Extreme Performance

SIMD Acceleration: Runtime-detected optimization for AVX-512, AVX2, SSE4.2 (x86_64) and NEON, SVE (ARM64).
Parallel Execution: Native Rayon integration for multi-threaded batch encoding and decoding.
Intelligent Caching: Multi-level cache with CLOCK eviction and sharded access for high concurrency.
Lazy Evaluation: Zero-copy pipeline design that only computes what is necessary.

🎮 GPU Acceleration

CUDA Support: Fully integrated GPU tokenization pipeline.
Multi-GPU: Automatic load balancing across available GPUs.
Async Pipeline: Overlapped CPU-GPU data transfer for maximum throughput.

🔌 Full HuggingFace Compatibility

Drop-in Replacement: Compatible with standard tokenizer.json files.
Post-Processing: Native support for BERT, RoBERTa, and Template post-processors.
Model Support:
- WordPiece (BERT, DistilBERT, Electra)
- BPE (GPT-2, RoBERTa, Llama-2)
- Unigram (Albert, T5)
- WordLevel
Gap Analysis: See BUDTIKTOK_HF_GAP_ANALYSIS.md for detailed compatibility report.

🧠 LatentBud Integration

Pre-tokenized Requests: Native support for pre-tokenized inputs to bypass redundant processing.
Token Budget Routing: Intelligent routing based on token budgets for efficient batching.

📦 Installation

Python

pip install budtiktok

Rust

Add budtiktok to your Cargo.toml:

[dependencies]
budtiktok = { git = "https://github.com/BudEcosystem/budtiktok.git" }

🛠️ Usage

Python

from budtiktok import Tokenizer

# Load from a standard tokenizer.json file
tokenizer = Tokenizer.from_file("tokenizer.json")

# Encode text
encoding = tokenizer.encode("Hello, world!")
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")

# Decode IDs back to text
decoded = tokenizer.decode(encoding.ids)
print(f"Decoded: {decoded}")

Rust

use budtiktok::TokenizerPipeline;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load from a standard tokenizer.json file
    let tokenizer = TokenizerPipeline::from_file("tokenizer.json")?;

    // Encode text
    let encoding = tokenizer.encode("Hello, world!", true)?;
    println!("Tokens: {:?}", encoding.get_tokens());
    println!("IDs: {:?}", encoding.get_ids());

    // Decode IDs back to text
    let decoded = tokenizer.decode(encoding.get_ids(), true)?;
    println!("Decoded: {}", decoded);

    Ok(())
}

GPU Tokenization

use budtiktok::{TokenizerPipeline, GpuConfig};

// Enable GPU with auto-detection
let config = GpuConfig::auto();
let tokenizer = TokenizerPipeline::from_file_with_gpu("tokenizer.json", config)?;

// Tokenize on GPU (transparently handles batching)
let encodings = tokenizer.encode_batch(&texts, true)?;

🏗️ Architecture

BudTikTok employs a Pipeline Wrapper pattern:

┌─────────────────────────────────────────────────────────────────┐
│                      TokenizerPipeline                          │
│  ┌───────────┐  ┌──────────────┐  ┌───────┐  ┌──────────────┐  │
│  │Normalizer │→ │PreTokenizer  │→ │ Model │→ │PostProcessor │  │
│  │ (Option)  │  │  (Option)    │  │       │  │  (Option)    │  │
│  └───────────┘  └──────────────┘  └───────┘  └──────────────┘  │
│                                       ↓                         │
│  ┌─────────────────┐            ┌─────────┐                     │
│  │ AddedVocabulary │←───────────│ Decoder │                     │
│  │ (Aho-Corasick)  │            │(Option) │                     │
│  └─────────────────┘            └─────────┘                     │
└─────────────────────────────────────────────────────────────────┘

This design allows for:

Lazy Evaluation: Components like normalizers are only applied when necessary.
Zero-Copy Optimizations: Extensive use of Cow<str> and memory mapping.
Lock-Free Concurrency: RwLock for read-heavy vocabulary access and Arc for shared immutable components.

🆕 Recent Updates (Dec 2025)

Memory Safety Refactor

Arc StringInterner - Eliminated potential use-after-free risks with reference-counted string storage
Thread-safe vocabulary - All interned strings are protected by Arc for safe concurrent access
Zero unsafe transmute in hot paths - Safety-critical code now uses bounded lifetimes

Full HuggingFace Serialization

save() and from_pretrained() - Full compatibility with HuggingFace tokenizer.json format
Roundtrip verified - All tokenizer types (BPE, WordPiece, Unigram) pass serialization tests
Model versioning - Deterministic config hashes for cache key generation

Enhanced SIMD Performance

AVX2 character classification - Real SIMD intrinsics for 32-byte parallel character classification
PSHUFB nibble technique - Ultra-fast ASCII class lookup
Verified correctness - SIMD paths produce identical results to scalar (test-verified)

Vocabulary Refactor

Structured Vocabulary type - Replaced raw Vec<String> with type-safe vocabulary management
Centralized special tokens - SpecialTokens struct for [CLS], [SEP], [PAD], etc.
Improved type safety - Compile-time guarantees for vocabulary operations

Test Results

407 tests passed (budtiktok-core)
3 serialization roundtrip tests passed
Workspace compilation: Success

📄 License

This project is licensed under the Apache-2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
.venv		.venv
benchmark_data		benchmark_data
crates		crates
docs		docs
infra/prometheus		infra/prometheus
test_data/bert-base-uncased		test_data/bert-base-uncased
.gitignore		.gitignore
BUDTIKTOK_HF_GAP_ANALYSIS.md		BUDTIKTOK_HF_GAP_ANALYSIS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cargo.toml.tmp		Cargo.toml.tmp
Dockerfile		Dockerfile
IMPLEMENTATION_DESIGN.md		IMPLEMENTATION_DESIGN.md
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
INTEGRATION_PLAN.md		INTEGRATION_PLAN.md
README.md		README.md
REGRESSION_TEST_REPORT.md		REGRESSION_TEST_REPORT.md
TASKS.md		TASKS.md
benchmark_hf_1gb.py		benchmark_hf_1gb.py
benchmark_results.csv		benchmark_results.csv
bpe_budtiktok_results.json		bpe_budtiktok_results.json
bpe_hf_results.json		bpe_hf_results.json
budtiktok_hf_gap_analysis.md		budtiktok_hf_gap_analysis.md
clippy.toml		clippy.toml
compare_tokens.py		compare_tokens.py
concurrency_bottleneck.py		concurrency_bottleneck.py
debug_accuracy.py		debug_accuracy.py
debug_apostrophe.py		debug_apostrophe.py
debug_wordpiece.py		debug_wordpiece.py
deep_debug.py		deep_debug.py
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
embedding_comparison.py		embedding_comparison.py
final_10.py		final_10.py
find_divergence.py		find_divergence.py
find_remaining_diff.py		find_remaining_diff.py
full_pipeline_benchmark.py		full_pipeline_benchmark.py
hf_regression_test.py		hf_regression_test.py
hf_tokeniser_features.md		hf_tokeniser_features.md
investigate_final.py		investigate_final.py
reproduce_interner.rs		reproduce_interner.rs
rustfmt.toml		rustfmt.toml
test_fffd.py		test_fffd.py
test_hf_pretok.py		test_hf_pretok.py
verify_accuracy.py		verify_accuracy.py
wordpiece_budtiktok_results.json		wordpiece_budtiktok_results.json
wordpiece_hf_results.json		wordpiece_hf_results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BudTikTok: High-Performance HF-Compatible Tokenization

🚀 Key Features

⚡ Extreme Performance

🎮 GPU Acceleration

🔌 Full HuggingFace Compatibility

🧠 LatentBud Integration

📦 Installation

Python

Rust

🛠️ Usage

Python

Rust

GPU Tokenization

🏗️ Architecture

🆕 Recent Updates (Dec 2025)

Memory Safety Refactor

Full HuggingFace Serialization

Enhanced SIMD Performance

Vocabulary Refactor

Test Results

📄 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BudTikTok: High-Performance HF-Compatible Tokenization

🚀 Key Features

⚡ Extreme Performance

🎮 GPU Acceleration

🔌 Full HuggingFace Compatibility

🧠 LatentBud Integration

📦 Installation

Python

Rust

🛠️ Usage

Python

Rust

GPU Tokenization

🏗️ Architecture

🆕 Recent Updates (Dec 2025)

Memory Safety Refactor

Full HuggingFace Serialization

Enhanced SIMD Performance

Vocabulary Refactor

Test Results

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages