BudTikTok is a high-performance disaggregated tokenization system designed to outperform HuggingFace Tokenizers, TEI, BlazeText, and Snowflake across all hardware platforms.
- Project Setup
- Core Infrastructure
- Tokenization Algorithms
- SIMD Acceleration
- GPU Tokenization
- Distributed Architecture
- LatentBud Integration
- Resilience and Operations
- Test Suites (TDD)
- Profiling Tools
- Accuracy Testing Suite
- Comparison Benchmarking Tool
- Deployment
- Documentation
| Symbol | Meaning |
|---|---|
[ ] |
Not started |
[~] |
In progress |
[x] |
Completed |
[!] |
Blocked |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 1.1.1 | Create Cargo workspace | Create Cargo.toml with workspace members: budtiktok-core, budtiktok-simd, budtiktok-gpu, budtiktok-ipc, budtiktok-coordinator, budtiktok-cli, budtiktok-bench. Use Rust 2021 edition. Set resolver = "2". |
None | toml [workspace] members = ["crates/*"] resolver = "2" [workspace.dependencies] tokio = { version = "1.35", features = ["full"] } ahash = "0.8" serde = { version = "1.0", features = ["derive"] } |
[x] |
| 1.1.2 | Configure workspace dependencies | Add shared dependencies at workspace level: tokio, rayon, ahash, serde, tracing, thiserror, anyhow, bytes, parking_lot, crossbeam. Pin versions for reproducibility. |
1.1.1 | Use workspace inheritance: tokio.workspace = true in member crates |
[x] |
| 1.1.3 | Create crate structure | Create all member crates with proper Cargo.toml, src/lib.rs. Set up re-exports in main budtiktok facade crate. Configure feature flags: simd, gpu, distributed, full. |
1.1.1 | Directory structure: crates/{core,simd,gpu,ipc,coordinator,cli,bench}/ |
[x] |
| 1.1.4 | Configure build.rs | Create build script for: SIMD feature detection at compile time, generate version info, embed git hash, detect CUDA/ROCm. Output cargo:rustc-cfg directives. | 1.1.3 | Use cc crate for C compilation if needed. Check for avx512f, avx2, sse4.2, neon |
[x] |
| 1.1.5 | Set up rustfmt and clippy | Create rustfmt.toml with project style. Create .clippy.toml with lint configuration. Add #![deny(clippy::all)] to all crates. Configure CI to fail on warnings. |
1.1.3 | Enable pedantic lints, disable overly noisy ones | [x] |
| 1.1.6 | Configure memory allocators | Add jemalloc (Linux) and mimalloc (macOS/Windows) as optional global allocators. Create feature flags: jemalloc, mimalloc. Benchmark both. Default to jemalloc on Linux. |
1.1.3 | rust #[cfg(all(target_os = "linux", feature = "jemalloc"))] #[global_allocator] static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc; |
[x] |
| 1.1.7 | Set up error handling | Create budtiktok-core/src/error.rs with BudError enum using thiserror. Define error types: VocabError, TokenizeError, IoError, ConfigError, IpcError. Implement From traits. |
1.1.3 | Use #[error("...")] attributes for messages. Add #[from] for conversions. |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 1.2.1 | Create GitHub Actions workflow | Main workflow: build, test, lint, benchmark. Matrix: ubuntu-latest, macos-latest, windows-latest. Targets: x86_64-unknown-linux-gnu, aarch64-unknown-linux-gnu, x86_64-apple-darwin, aarch64-apple-darwin. |
1.1.5 | Use actions-rs/toolchain, actions-rs/cargo. Cache target directory. |
[x] |
| 1.2.2 | Add benchmark CI | Run benchmarks on every PR. Compare against baseline. Post results as PR comment. Store historical data. Alert on >10% regression. | 1.2.1 | Use criterion with --save-baseline. Compare with critcmp. |
[x] |
| 1.2.3 | Add coverage reporting | Generate code coverage with cargo-llvm-cov. Upload to Codecov. Set minimum coverage threshold (80%). Fail CI if below threshold. |
1.2.1 | Use llvm-cov for accurate Rust coverage. Exclude test files. |
[x] |
| 1.2.4 | Add security scanning | Run cargo-audit for vulnerability scanning. Run cargo-deny for license compliance. Fail CI on critical vulnerabilities. |
1.2.1 | Configure deny.toml with allowed licenses: MIT, Apache-2.0, BSD |
[x] |
| 1.2.5 | Create release workflow | Automated releases on git tag. Build release binaries for all platforms. Create GitHub release with changelog. Publish to crates.io. Build and push Docker images. | 1.2.1 | Use cargo-dist for binary distribution. Sign releases. |
[x] |
| 1.2.6 | Add Docker build pipeline | Multi-stage Dockerfile for minimal image size. Separate images for coordinator and worker. GPU-enabled image with CUDA runtime. ARM64 and AMD64 variants. | 1.2.1 | Base on rust:1.75-slim for build, debian:bookworm-slim for runtime |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 2.1.1 | Implement ASCII fast path | Check if string is pure ASCII using SIMD-style u64 operations. If all bytes < 128, skip normalization entirely. Return Cow::Borrowed for zero-copy. |
1.1.7 | ```rust pub fn is_ascii_fast(s: &str) -> bool { let bytes = s.as_bytes(); let chunks = bytes.chunks_exact(8); for chunk in chunks { let word = u64::from_ne_bytes(chunk.try_into().unwrap()); if word & 0x8080808080808080 != 0 { return false; } } bytes[chunks.remainder().len()..].iter().all( | &b |
| 2.1.2 | Generate Unicode data tables | Parse UnicodeData.txt from Unicode 15.1. Generate: canonical decomposition map, composition pairs, canonical combining classes, quick check properties. Use build.rs or pre-generate. | 1.1.4 | Using unicode-normalization crate for data tables. Custom tables deferred for future optimization. |
[x] |
| 2.1.3 | Implement NFC Quick Check | Return Yes/No/Maybe without full normalization. Track CCC ordering. Use lookup table for quick check property. Return early on first No. |
2.1.2 | rust pub enum IsNormalized { Yes, No, Maybe } pub fn is_nfc_quick(s: &str) -> IsNormalized { let mut last_ccc = 0; for ch in s.chars() { if ch.is_ascii() { last_ccc = 0; continue; } let ccc = canonical_combining_class(ch); if last_ccc > ccc && ccc != 0 { return IsNormalized::No; } match quick_check_nfc(ch) { QC::No => return IsNormalized::No, QC::Maybe => result = IsNormalized::Maybe, QC::Yes => {} } last_ccc = ccc; } result } |
[x] |
| 2.1.4 | Implement Bloom filter for Latin-1 | Create 256-bit bloom filter for precomposed Latin-1 characters (0xC0-0x17F). O(1) check before full composition. ~5% false positive rate acceptable. | 2.1.2 | ```rust const PRECOMPOSED_BLOOM: [u64; 4] = [...]; // Generated fn might_be_precomposed(ch: char) -> bool { let cp = ch as u32; if cp < 0xC0 | |
| 2.1.5 | Implement canonical decomposition | Recursive decomposition following Unicode algorithm. Handle Hangul algorithmic decomposition (no table needed). Use stack-based iteration to avoid recursion overhead. | 2.1.2 | Hangul decomposition: SBase=0xAC00, LBase=0x1100, VBase=0x1161, TBase=0x11A7. Decompose syllable to L+V+T jamo. |
[x] |
| 2.1.6 | Implement canonical composition | Compose starter + combining character pairs. Handle composition exclusions. Use two-stage table for pair lookup. Handle blocked combining characters. | 2.1.2 | Using unicode-normalization crate's .nfc() for composition. Hangul composition implemented in compose_hangul(). |
[x] |
| 2.1.7 | Implement NFD/NFKD/NFKC | NFD: decompose canonically. NFKD: decompose with compatibility. NFC: decompose + compose. NFKC: decompose compat + compose. Share core logic. | 2.1.5, 2.1.6 | Add compatibility decomposition table (~6K entries). Use same algorithm with different table. | [x] |
| 2.1.8 | Add SIMD normalization path | Use AVX2 to process 32 bytes at once. Identify characters needing normalization. Fall back to scalar for complex sequences. | 2.1.7, 4.2.1 | SIMD identifies ASCII (fast) vs needs-work (slow path). Scalar handles actual normalization. | [x] |
| 2.1.9 | Implement streaming normalization | For very long strings, process in chunks. Maintain state across chunk boundaries. Handle combining sequences split across chunks. | 2.1.7 | Buffer incomplete combining sequences. Flush on end of stream. | [x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 2.2.1 | Define CategoryFlags bitfield | Create CategoryFlags(u32) with bits for all 30 Unicode General Categories. Implement is_letter(), is_mark(), is_number(), is_punctuation(), is_symbol(), is_separator(), is_other() as single bitwise AND. |
1.1.7 | ```rust #[derive(Clone, Copy)] pub struct CategoryFlags(u32); impl CategoryFlags { pub const LETTER_UPPERCASE: u32 = 1 << 0; pub const LETTER_LOWERCASE: u32 = 1 << 1; // ... 28 more pub const LETTER: u32 = Self::LETTER_UPPERCASE | Self::LETTER_LOWERCASE |
| 2.2.2 | Generate ASCII lookup table | Compile-time generate const ASCII_CATEGORIES: [CategoryFlags; 128]. Direct O(1) lookup for ASCII. Include in binary. |
2.2.1 | rust const ASCII_CATEGORIES: [CategoryFlags; 128] = { let mut table = [CategoryFlags(0); 128]; // 0-31: Cc (control) for i in 0..32 { table[i] = CategoryFlags::OTHER_CONTROL; } // 32: Zs (space) table[32] = CategoryFlags::SEPARATOR_SPACE; // ... populate all }; |
[x] |
| 2.2.3 | Generate Unicode lookup tables | Two-stage table for BMP (0-0xFFFF). Three-stage for supplementary planes. Compress via perfect hashing or trie. Target <50KB binary size. | 2.2.1, 2.1.2 | Using runtime lookups via Rust std lib and unicode-categories. ASCII fast path via const table. | [x] |
| 2.2.4 | Implement thread-local cache | 128-entry direct-mapped cache per thread. Key: char, Value: CategoryFlags. No locking needed. LRU approximation via CLOCK. | 2.2.3 | ```rust thread_local! { static CACHE: RefCell<[(char, CategoryFlags); 128]> = RefCell::new([('\0', CategoryFlags(0)); 128]); } pub fn get_category(ch: char) -> CategoryFlags { if (ch as u32) < 128 { return ASCII_CATEGORIES[ch as usize]; } CACHE.with( | cache |
| 2.2.5 | Implement specialized predicates | is_whitespace(): Zs + Cc whitespace. is_punctuation(): all P categories. is_cjk(): CJK ranges. is_emoji(): emoji ranges. All as single bitwise ops or range checks. |
2.2.1 | CJK ranges: 0x4E00-0x9FFF, 0x3400-0x4DBF, 0x20000-0x2A6DF, etc. Use matches! for efficient codegen. |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 2.3.1 | Implement CLOCK cache | Fixed-size cache with second-chance eviction. O(1) amortized insert/lookup. Thread-safe with parking_lot::RwLock. Better than LRU for workloads with scans. |
1.1.7 | rust pub struct ClockCache<K: Hash + Eq, V> { slots: Vec<Slot<K, V>>, hand: AtomicUsize, capacity: usize, hasher: ahash::RandomState, } struct Slot<K, V> { key: Option<K>, value: Option<V>, referenced: AtomicBool, } impl<K: Hash + Eq + Clone, V: Clone> ClockCache<K, V> { pub fn get(&self, key: &K) -> Option<V> { let hash = self.hasher.hash_one(key); let idx = (hash as usize) % self.capacity; // Linear probe for key for i in 0..8 { let slot_idx = (idx + i) % self.capacity; if self.slots[slot_idx].key.as_ref() == Some(key) { self.slots[slot_idx].referenced.store(true, Relaxed); return self.slots[slot_idx].value.clone(); } } None } } |
[x] |
| 2.3.2 | Implement sharded cache | Shard cache into N segments (default: num_cpus). Each shard independently locked. Reduces contention for concurrent access. | 2.3.1 | rust pub struct ShardedCache<K, V> { shards: Vec<ClockCache<K, V>>, } impl<K: Hash + Eq + Clone, V: Clone> ShardedCache<K, V> { pub fn get(&self, key: &K) -> Option<V> { let shard = self.shard_for(key); shard.get(key) } fn shard_for(&self, key: &K) -> &ClockCache<K, V> { let hash = ahash::RandomState::new().hash_one(key); &self.shards[(hash as usize) % self.shards.len()] } } |
[x] |
| 2.3.3 | Create multi-level cache | L1: Per-word token IDs (10K entries). L2: Subword lookups (50K entries). L3: Unicode composition (thread-local, 128). L4: Unicode category (thread-local, 128). | 2.3.2 | Each level has different key/value types. L1 caches full tokenization result. L2 caches individual subword matches. | [x] |
| 2.3.4 | Add cache statistics | Track: hits, misses, insertions, evictions, hit_rate. Atomic counters for thread safety. Export to Prometheus format. | 2.3.1 | rust pub struct CacheStats { hits: AtomicU64, misses: AtomicU64, insertions: AtomicU64, evictions: AtomicU64, } impl CacheStats { pub fn hit_rate(&self) -> f64 { let hits = self.hits.load(Relaxed); let total = hits + self.misses.load(Relaxed); if total == 0 { 0.0 } else { hits as f64 / total as f64 } } } |
[x] |
| 2.3.5 | Implement cache warmup | Pre-populate cache with frequent tokens from vocabulary. Use frequency information if available. Support async warmup on startup. | 2.3.3 | Load top 1000 most frequent words, tokenize them, cache results. | [x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 2.4.1 | Implement tokenizer.json parser | Parse HuggingFace tokenizer.json format. Extract: model type, vocab, merges, special tokens, normalizer config, pre-tokenizer config. Use serde_json. |
1.1.7 | Define structs matching HF schema: TokenizerConfig, Model, Vocab, Merges, AddedToken, Normalizer, PreTokenizer. Handle all variants. |
[x] |
| 2.4.2 | Create vocabulary structures | Vocab: bidirectional map token↔id. Use AHashMap<String, u32> and Vec<String> for O(1) both directions. Memory-map large vocabularies. |
2.4.1 | ```rust pub struct Vocab { token_to_id: AHashMap<String, u32>, id_to_token: Vec, } impl Vocab { pub fn get_id(&self, token: &str) -> Option { self.token_to_id.get(token).copied() } pub fn get_token(&self, id: u32) -> Option<&str> { self.id_to_token.get(id as usize).map( | s |
| 2.4.3 | Implement Trie data structure | Byte-indexed trie for prefix matching. 256-ary nodes for O(1) child lookup. Store token IDs at leaf/intermediate nodes. Support common prefix iteration. | 2.4.2 | rust pub struct Trie { nodes: Vec<TrieNode>, } struct TrieNode { children: [u32; 256], // 0 = no child, >0 = node index token_id: Option<u32>, // Some if this is end of token } impl Trie { pub fn insert(&mut self, token: &[u8], id: u32) { ... } pub fn get(&self, token: &[u8]) -> Option<u32> { ... } pub fn common_prefix_search(&self, text: &[u8]) -> impl Iterator<Item = (usize, u32)> { ... } } |
[x] |
| 2.4.4 | Implement cache-oblivious Trie | Reorder trie nodes using van Emde Boas layout for better cache performance. Parent and children in same cache line when possible. | 2.4.3 | BFS order for top levels, then recursive for subtrees. Measure cache miss rate with perf. | [x] |
| 2.4.5 | Build Aho-Corasick automaton | For special token matching. Use aho-corasick crate. Configure LeftmostLongest match semantics. Pre-build from added tokens with normalized=false. |
2.4.1 | ```rust use aho_corasick::{AhoCorasick, AhoCorasickBuilder, MatchKind}; let special_tokens: Vec<&str> = added_tokens.iter() .filter( | t |
| 2.4.6 | Implement Double-Array Trie | Alternative trie with better cache locality. Two arrays: base[] and check[]. O(1) transitions. More compact than 256-ary. | 2.4.2 | Double-array trie: base[s] + c = t and check[t] = s for transition s→t on character c. Build using optimal base value search. |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 2.5.1 | Implement arena allocator | Per-batch arena using bumpalo. Allocate all temporary data from arena. Reset arena between batches. Near-zero allocation overhead. |
1.1.6 | ```rust thread_local! { static ARENA: RefCell = RefCell::new(Bump::with_capacity(1024 * 1024)); } pub fn with_arena(f: impl FnOnce(&Bump) -> T) -> T { ARENA.with( | arena |
| 2.5.2 | Implement string interner | Deduplicate repeated strings in vocabulary. Single storage for each unique string. Return interned &str references. |
2.4.2 | Use string-interner crate or custom implementation with HashSet<Box<str>>. |
[x] |
| 2.5.3 | Add memory pool for encodings | Pre-allocate Encoding structs. Reuse across requests. Avoid repeated allocation of vectors. |
2.5.1 | Pool of Vec<u32> for token IDs, reuse with clear() instead of new allocation. |
[x] |
| 2.5.4 | Implement Cow-based strings | Use Cow<str> throughout to avoid allocations. Borrowed for unchanged strings, Owned only when modified. |
1.1.7 | rust pub fn normalize<'a>(&self, input: &'a str) -> Cow<'a, str> { if is_ascii_fast(input) { Cow::Borrowed(input) } else { Cow::Owned(self.normalize_unicode(input)) } } |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 3.1.1 | Implement core WordPiece algorithm | Greedy longest-match from left to right. Try longest substring first, shrink until vocab match. Handle continuation prefix (##). Return [UNK] if no match. |
2.4.2 | ```rust pub fn tokenize_word(&self, word: &str) -> Vec { if let Some(id) = self.vocab.get_id(word) { return vec![id]; // Whole word match } let mut tokens = Vec::with_capacity(word.len() / 3); let mut start = 0; while start < word.len() { let mut end = word.len(); let mut found = false; while start < end { let substr = &word[start..end]; let lookup = if start > 0 { format!("{}{}", self.prefix, substr) } else { substr.to_string() }; if let Some(id) = self.vocab.get_id(&lookup) { tokens.push(id); found = true; break; } // Shrink by one char (UTF-8 aware) end = word[..end].char_indices().last() .map( | (i, _) |
| 3.1.2 | Add byte-length optimization | If byte_len <= max_chars, skip char counting (majority case). Only count chars if byte length exceeds limit. |
3.1.1 | rust if word.len() <= self.max_input_chars_per_word { // Safe to proceed, byte len <= char len } else if word.chars().count() > self.max_input_chars_per_word { return vec![self.unk_id]; } |
[x] |
| 3.1.3 | Implement BERT normalizer | Clean text, handle Chinese chars, strip accents, lowercase. Configurable via options. | 2.1.7 | clean_text: remove control chars (0x00-0x1F except whitespace), replace \t\n\r with space. handle_chinese_chars: add space around CJK. strip_accents: NFD + filter Mn category. lowercase: to_lowercase(). |
[x] |
| 3.1.4 | Implement BERT pre-tokenizer | Split on whitespace, isolate punctuation. Handle degree symbols and special punctuation correctly. | 2.2.5 | Split on is_whitespace(), then for each word, isolate is_punctuation() characters as separate tokens. |
[x] |
| 3.1.5 | Implement BERT post-processor | Insert [CLS] at start, [SEP] at end. For pairs: [CLS] A [SEP] B [SEP]. Generate type_ids (0 for first, 1 for second). | 3.1.1 | rust pub fn post_process(&self, encoding: Encoding, pair: Option<Encoding>) -> Encoding { let mut ids = vec![self.cls_id]; ids.extend(encoding.ids); ids.push(self.sep_id); let mut type_ids = vec![0; ids.len()]; if let Some(pair) = pair { let pair_start = ids.len(); ids.extend(pair.ids); ids.push(self.sep_id); type_ids.extend(vec![1; ids.len() - pair_start]); } // ... build full Encoding } |
[x] |
| 3.1.6 | Add caching integration | Check cache before tokenization. Cache results for cache-worthy strings (<256 chars). Use word as cache key. | 2.3.3, 3.1.1 | rust pub fn tokenize_cached(&self, word: &str) -> Vec<u32> { if word.len() <= 256 { if let Some(cached) = self.cache.get(word) { return cached; } } let result = self.tokenize_word(word); if word.len() <= 256 { self.cache.insert(word.to_string(), result.clone()); } result } |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 3.2.1 | Implement Viterbi decoder | Dynamic programming to find optimal segmentation. Track best score and backpointer at each position. Use Trie for efficient prefix enumeration. | 2.4.3 | rust pub fn encode_viterbi(&self, text: &str) -> Vec<u32> { let bytes = text.as_bytes(); let n = bytes.len(); let mut best_score = vec![f64::NEG_INFINITY; n + 1]; let mut best_prev = vec![0usize; n + 1]; best_score[0] = 0.0; for i in 0..n { if best_score[i] == f64::NEG_INFINITY { continue; } for (len, token_id) in self.trie.common_prefix_search(&bytes[i..]) { let j = i + len; let score = best_score[i] + self.scores[token_id as usize]; if score > best_score[j] { best_score[j] = score; best_prev[j] = i; best_token[j] = token_id; } } } // Backtrack let mut tokens = Vec::new(); let mut pos = n; while pos > 0 { tokens.push(best_token[pos]); pos = best_prev[pos]; } tokens.reverse(); tokens } |
[x] |
| 3.2.2 | Add byte fallback | When no token matches, fall back to single byte token <0xNN>. Ensure all inputs are tokenizable. Handle partial UTF-8. |
3.2.1 | For each position without valid token, use <0x{:02X}> format byte token. Pre-compute byte token IDs 0-255. |
[x] |
| 3.2.3 | Implement Metaspace pre-tokenizer | Replace spaces with ▁ (U+2581). Add ▁ at start of words. Handle add_prefix_space option. |
2.1.7 | rust pub fn pre_tokenize(&self, text: &str) -> String { let mut result = String::with_capacity(text.len() + text.matches(' ').count()); if self.add_prefix_space && !text.starts_with(' ') { result.push('▁'); } for ch in text.chars() { if ch == ' ' { result.push('▁'); } else { result.push(ch); } } result } |
[x] |
| 3.2.4 | Implement N-best decoding | A* search for top-N segmentations. Use priority queue ordered by score. Limit agenda size to prevent memory explosion. | 3.2.1 | rust pub fn encode_nbest(&self, text: &str, n: usize) -> Vec<Vec<u32>> { let mut agenda: BinaryHeap<Hypothesis> = BinaryHeap::new(); // ... A* search with hypothesis expansion // Return top n complete hypotheses } |
[x] |
| 3.2.5 | Implement stochastic sampling | Forward-filtering backward-sampling. Compute alpha (log-sum-exp) scores. Sample from distribution. Use temperature parameter. | 3.2.1 | rust pub fn sample(&self, text: &str, temperature: f64) -> Vec<u32> { let alpha = self.forward_pass(text); // log-sum-exp to each position self.backward_sample(&alpha, temperature) } |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 3.3.1 | Build Aho-Corasick automaton for BPE | Create AC automaton from all vocabulary tokens. Used for efficient suffix enumeration in O(n) algorithm. | 2.4.5 | Build from all vocab tokens, not just special tokens. | [x] |
| 3.3.2 | Build compatibility table | For each token pair (a, b), check if merge exists. Store merged token if compatible. HashMap for O(1) lookup. | 2.4.1 | rust pub struct CompatibilityTable { table: AHashMap<(u32, u32), u32>, } impl CompatibilityTable { pub fn from_merges(merges: &[(String, String)], vocab: &Vocab) -> Self { let mut table = AHashMap::new(); for (a, b) in merges { let a_id = vocab.get_id(a)?; let b_id = vocab.get_id(b)?; let merged = format!("{}{}", a, b); let merged_id = vocab.get_id(&merged)?; table.insert((a_id, b_id), merged_id); } Self { table } } pub fn get(&self, a: u32, b: u32) -> Option<u32> { self.table.get(&(a, b)).copied() } } |
[x] |
| 3.3.3 | Implement O(n) BPE encoder | Use DP with AC automaton. Track (token_count, last_token) at each position. Enumerate suffixes, check compatibility, update DP. | 3.3.1, 3.3.2 | ```rust pub fn encode_linear(&self, text: &[u8]) -> Vec { let n = text.len(); let mut dp: Vec<Option<(usize, u32)>> = vec![None; n + 1]; dp[0] = Some((0, u32::MAX)); // (count, last_token) for i in 0..n { let Some((count, last_token)) = dp[i] else { continue }; for mat in self.ac.find_overlapping(&text[i..]) { let token_id = mat.pattern().as_u32(); let end = i + mat.end(); // Check compatibility let compatible = last_token == u32::MAX | |
| 3.3.4 | Implement byte-level BPE | Map bytes 0-255 to printable Unicode (GPT-2 style). Process bytes instead of chars. | 3.3.3 | ```rust lazy_static! { static ref BYTE_TO_CHAR: [char; 256] = { let mut arr = ['\0'; 256]; let mut n = 0u32; for i in 0..256u8 { arr[i as usize] = match i { b'!'..=b'~' | 0xA1..=0xAC |
| 3.3.5 | Add dropout support | Randomly skip merges during training/augmentation. Use probability parameter. For inference, disable. | 3.3.3 | Add dropout: Option<f32> parameter. If Some, skip merge with given probability using rand::random(). |
[x] |
| 3.3.6 | Implement fallback O(n log n) algorithm | Traditional heap-based merge for comparison/validation. OctonaryHeap for efficiency. | 2.4.2 | For short sequences or validation, use merge-based algorithm. Compare results with O(n) for correctness. | [x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 3.4.1 | Define Tokenizer trait | Core trait for all tokenizers. Methods: encode, encode_batch, decode, decode_batch, vocab_size, token_to_id, id_to_token. |
1.1.7 | rust pub trait Tokenizer: Send + Sync { fn encode(&self, text: &str, add_special_tokens: bool) -> Result<Encoding>; fn encode_batch(&self, texts: &[&str], add_special_tokens: bool) -> Result<Vec<Encoding>>; fn decode(&self, ids: &[u32], skip_special_tokens: bool) -> Result<String>; fn vocab_size(&self) -> usize; fn token_to_id(&self, token: &str) -> Option<u32>; fn id_to_token(&self, id: u32) -> Option<&str>; } |
[x] |
| 3.4.2 | Implement Encoding struct | Output of tokenization. Fields: ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing. | 3.4.1 | rust pub struct Encoding { pub ids: Vec<u32>, pub type_ids: Vec<u32>, pub tokens: Vec<String>, pub offsets: Vec<(usize, usize)>, pub attention_mask: Vec<u32>, pub special_tokens_mask: Vec<u32>, pub overflowing: Vec<Encoding>, } |
[x] |
| 3.4.3 | Implement auto-detection | Detect tokenizer type from tokenizer.json. Look for model.type field. Return appropriate implementation. |
2.4.1 | rust pub fn from_file(path: &Path) -> Result<Box<dyn Tokenizer>> { let config: TokenizerConfig = serde_json::from_reader(File::open(path)?)?; match config.model.type_.as_str() { "WordPiece" => Ok(Box::new(WordPieceTokenizer::from_config(config)?)), "BPE" => Ok(Box::new(BpeTokenizer::from_config(config)?)), "Unigram" => Ok(Box::new(UnigramTokenizer::from_config(config)?)), _ => Err(Error::UnsupportedModel), } } |
[x] |
| 3.4.4 | Implement truncation | Strategies: LongestFirst, OnlyFirst, OnlySecond. Handle max_length. Generate overflow for stride. | 3.4.2 | rust pub fn truncate(&mut self, max_length: usize, stride: usize, strategy: TruncationStrategy) { if self.ids.len() <= max_length { return; } let overflow_len = self.ids.len() - max_length; // ... split and store in overflowing } |
[x] |
| 3.4.5 | Implement padding | Strategies: BatchLongest, Fixed(usize). Direction: Left, Right. Pad to multiple_of if specified. | 3.4.2 | rust pub fn pad(&mut self, target_length: usize, pad_id: u32, direction: PaddingDirection) { while self.ids.len() < target_length { match direction { Left => { self.ids.insert(0, pad_id); self.attention_mask.insert(0, 0); } Right => { self.ids.push(pad_id); self.attention_mask.push(0); } } } } |
[x] |
| 3.4.6 | Implement batch encoding | Process multiple texts efficiently. Use rayon for parallelism when batch > threshold. Apply padding uniformly. | 3.4.1, 3.4.5 | ```rust fn encode_batch(&self, texts: &[&str], add_special_tokens: bool) -> Result<Vec> { let encodings: Vec<_> = if texts.len() > 8 { texts.par_iter().map( | t |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 4.1.1 | Implement CPU feature detection | Detect SIMD capabilities at runtime. Cache results in static. Support x86_64 (AVX-512, AVX2, SSE4.2) and aarch64 (NEON, SVE). | 1.1.4 | rust pub struct CpuFeatures { pub avx512f: bool, pub avx512bw: bool, pub avx2: bool, pub sse42: bool, pub neon: bool, pub sve: bool, } lazy_static! { pub static ref CPU: CpuFeatures = CpuFeatures { #[cfg(target_arch = "x86_64")] avx512f: is_x86_feature_detected!("avx512f"), #[cfg(target_arch = "x86_64")] avx2: is_x86_feature_detected!("avx2"), // ... }; } |
[x] |
| 4.1.2 | Create dispatch macro | Macro to generate function variants for each SIMD level. Auto-select at runtime. | 4.1.1 | rust macro_rules! simd_dispatch { ($name:ident, $fn:ident, $($arg:ident: $ty:ty),*) => { pub fn $name($($arg: $ty),*) { #[cfg(target_arch = "x86_64")] { if CPU.avx512f { return unsafe { $fn ## _avx512($($arg),*) }; } if CPU.avx2 { return unsafe { $fn ## _avx2($($arg),*) }; } } $fn ## _scalar($($arg),*) } }; } |
[x] |
| 4.1.3 | Define SimdBackend trait | Trait for SIMD implementations. Methods for each accelerated operation. Implementations for each SIMD level. | 4.1.1 | rust pub trait SimdBackend: Send + Sync { fn classify_chars(&self, input: &[u8], output: &mut [u8]); fn find_delimiters(&self, input: &[u8]) -> Vec<usize>; fn validate_utf8(&self, input: &[u8]) -> Result<(), usize>; // ... } |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 4.2.1 | Implement AVX-512 char classification | Process 64 bytes per iteration. Use _mm512_cmpeq_epi8_mask for comparisons. Return 64-bit mask. |
4.1.3 | ```rust #[target_feature(enable = "avx512f", enable = "avx512bw")] unsafe fn classify_whitespace_avx512(input: &[u8]) -> u64 { let chunk = _mm512_loadu_si512(input.as_ptr() as *const _); let space = _mm512_set1_epi8(b' ' as i8); let tab = _mm512_set1_epi8(b'\t' as i8); let newline = _mm512_set1_epi8(b'\n' as i8); let space_mask = _mm512_cmpeq_epi8_mask(chunk, space); let tab_mask = _mm512_cmpeq_epi8_mask(chunk, tab); let nl_mask = _mm512_cmpeq_epi8_mask(chunk, newline); space_mask | tab_mask |
| 4.2.2 | Implement AVX-512 pre-tokenization | Find word boundaries in 64-byte chunks. Extract positions from bitmask. | 4.2.1 | rust pub fn pretokenize_avx512(input: &[u8]) -> Vec<(usize, usize)> { let mut boundaries = Vec::new(); let mut i = 0; while i + 64 <= input.len() { let mask = unsafe { classify_whitespace_avx512(&input[i..]) }; // Extract positions from mask while mask != 0 { let pos = mask.trailing_zeros() as usize; boundaries.push(i + pos); mask &= mask - 1; // Clear lowest bit } i += 64; } // Handle remainder with scalar boundaries } |
[x] |
| 4.2.3 | Implement AVX-512 UTF-8 validation | Use simdutf algorithm. Validate continuation byte patterns. 64 bytes per iteration. | 4.1.3 | Port algorithm from simdutf paper. Check: ASCII (high bit 0), 2-byte (110xxxxx 10xxxxxx), 3-byte, 4-byte sequences. | [x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 4.3.1 | Implement AVX2 char classification | Process 32 bytes per iteration. Use _mm256_cmpeq_epi8 and _mm256_movemask_epi8. |
4.1.3 | rust #[target_feature(enable = "avx2")] unsafe fn classify_whitespace_avx2(input: &[u8]) -> u32 { let chunk = _mm256_loadu_si256(input.as_ptr() as *const _); let space = _mm256_set1_epi8(b' ' as i8); let cmp = _mm256_cmpeq_epi8(chunk, space); _mm256_movemask_epi8(cmp) as u32 } |
[x] |
| 4.3.2 | Implement Teddy algorithm | SIMD fingerprinting for special token matching. Group tokens by first 2-3 bytes. | 2.4.5 | Use _mm256_shuffle_epi8 for fingerprint lookup. Match against buckets. Verify candidates. |
[x] |
| 4.3.3 | Implement AVX2 hash computation | Parallel hash computation for vocabulary lookup. Use AES instructions if available. | 2.4.2 | Compute 8 hashes in parallel using _mm256 operations. |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 4.4.1 | Implement SSE4.2 string operations | Use PCMPESTRI/PCMPESTRM for delimiter scanning. 16 bytes per iteration. | 4.1.3 | ```rust #[target_feature(enable = "sse4.2")] unsafe fn find_delimiter_sse42(input: &[u8], delims: &[u8; 16]) -> Option { let haystack = _mm_loadu_si128(input.as_ptr() as *const _); let needles = _mm_loadu_si128(delims.as_ptr() as *const _); let idx = _mm_cmpestri( needles, delims.iter().take_while( | &&b |
| 4.4.2 | Implement SSE4.2 range checks | Use PCMPISTRM with ranges for Unicode detection. | 4.1.3 | Define ranges in 128-bit register. Use _SIDD_CMP_RANGES mode. |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 4.5.1 | Implement NEON char classification | Process 16 bytes per iteration. Use vceqq_u8 for comparisons. |
4.1.3 | rust #[cfg(target_arch = "aarch64")] #[target_feature(enable = "neon")] unsafe fn classify_whitespace_neon(input: &[u8]) -> u16 { let chunk = vld1q_u8(input.as_ptr()); let space = vdupq_n_u8(b' '); let cmp = vceqq_u8(chunk, space); // Convert to bitmask let narrow = vshrn_n_u16(vreinterpretq_u16_u8(cmp), 4); vget_lane_u64(vreinterpret_u64_u8(narrow), 0) as u16 } |
[x] |
| 4.5.2 | Implement NEON UTF-8 validation | ARM NEON version of simdutf algorithm. Use vtbl for table lookups. |
4.1.3 | Similar to AVX2 but with NEON intrinsics. | [x] |
| 4.5.3 | Add SVE/SVE2 support | Scalable vectors for newer ARM. Vector-length agnostic code. | 4.5.1 | Use sve intrinsics when available. Support 128-2048 bit vectors. |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 4.6.1 | Implement SWAR byte operations | Process 8 bytes as u64. Find zero bytes, specific bytes without SIMD. | 4.1.3 | rust #[inline] fn has_zero_byte(x: u64) -> bool { const LO: u64 = 0x0101_0101_0101_0101; const HI: u64 = 0x8080_8080_8080_8080; x.wrapping_sub(LO) & !x & HI != 0 } #[inline] fn has_byte(x: u64, byte: u8) -> bool { has_zero_byte(x ^ (0x0101_0101_0101_0101 * byte as u64)) } |
[x] |
| 4.6.2 | Implement branchless classification | No branches in hot path. Use arithmetic comparison. | 4.1.3 | ```rust #[inline] fn is_ascii_whitespace_branchless(b: u8) -> bool { // space=32, tab=9, newline=10, cr=13 let is_space = (b == 32) as u8; let is_tab = (b == 9) as u8; let is_nl = (b == 10) as u8; let is_cr = (b == 13) as u8; (is_space | is_tab |
| 4.6.3 | Implement loop unrolling | Process 4-8 elements per iteration manually. | 4.1.3 | Use chunks_exact(4) and process all 4 without loop. |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 5.1.1 | Add CubeCL dependencies | Add cubecl with CUDA, ROCm, WGPU backends. Configure feature flags. |
1.1.2 | toml [dependencies] cubecl = { version = "0.2", features = ["cuda", "wgpu"] } (using cudarc instead) |
[x] |
| 5.1.2 | Create GPU device abstraction | Detect available GPUs. Create device handles. Support multi-GPU. | 5.1.1 | rust pub struct GpuDevice { backend: GpuBackend, device_id: usize, memory: usize, } pub fn detect_gpus() -> Vec<GpuDevice> { ... } |
[x] |
| 5.1.3 | Implement memory management | Allocate device memory. Manage pinned host memory. Implement memory pool. | 5.1.2 | Pre-allocate pinned buffers for input/output. Reuse across batches. | [x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 5.2.1 | Implement GPU vocabulary lookup | Upload vocab to GPU. Hash-based lookup. Each thread processes one word. | 5.1.3, 2.4.2 | FNV-1a hash with linear probing. PTX kernel compiled via nvrtc. | [x] |
| 5.2.2 | Implement GPU pre-tokenization | Parallel boundary detection. Stream compaction for positions. | 5.1.3 | Each thread checks one byte for whitespace. CPU fallback for complex cases. | [x] |
| 5.2.3 | Implement GPU WordPiece | Parallelize across words. Each thread tokenizes one word. | 5.2.1 | Uses VocabLookupKernel for subword tokenization with greedy longest match. | [x] |
| 5.2.4 | Implement async pipeline | Overlap CPU-GPU transfers with computation. Double buffering. | 5.1.3 | DoubleBuffer struct for async pipeline. Pinned buffers for fast transfers. | [x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 5.3.1 | Create GpuTokenizer wrapper | Implement Tokenizer trait. Manage GPU resources. Handle fallback to CPU. | 5.2.3, 3.4.1 | GpuTokenizer with GpuBackend enum. Returns error for small batches (CPU fallback signal). | [x] |
| 5.3.2 | Implement batch size optimization | Profile throughput vs batch size. Auto-tune optimal batch size. | 5.3.1 | Run warmup batches at sizes [8, 16, 32, 64, 128, 256]. Select size with best throughput. | [x] |
| 5.3.3 | Add multi-GPU support | Distribute batches across GPUs. Load balance. | 5.3.1 | GpuLoadBalancer with RoundRobin, LeastLoaded, Sequential strategies. | [x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 6.1.1 | Detect NUMA topology | Use libnuma or /sys/devices/system/node. Get nodes, CPUs per node, memory. | 1.1.7 | rust pub struct NumaTopology { nodes: Vec<NumaNode>, } pub struct NumaNode { id: usize, cpus: Vec<usize>, memory_mb: usize, } pub fn detect_numa() -> NumaTopology { ... } |
[ ] |
| 6.1.2 | Implement CPU affinity | Bind threads to specific CPUs. Use sched_setaffinity on Linux. |
6.1.1 | rust #[cfg(target_os = "linux")] pub fn set_affinity(cpus: &[usize]) -> Result<()> { use libc::{sched_setaffinity, cpu_set_t, CPU_SET, CPU_ZERO}; let mut set: cpu_set_t = unsafe { std::mem::zeroed() }; unsafe { CPU_ZERO(&mut set); for &cpu in cpus { CPU_SET(cpu, &mut set); } sched_setaffinity(0, std::mem::size_of_val(&set), &set) }; Ok(()) } |
[ ] |
| 6.1.3 | Implement NUMA memory binding | Allocate memory on local NUMA node. Use mbind or numa_alloc_local. |
6.1.1 | ```rust pub fn numa_alloc_local(size: usize) -> *mut u8 { unsafe { libc::mmap(null_mut(), size, PROT_READ | PROT_WRITE, MAP_PRIVATE |
| 6.1.4 | Create per-NUMA Tokio runtime | Separate runtime per NUMA node. Pin runtime threads to node CPUs. | 6.1.2 | ```rust pub fn create_numa_runtimes(topology: &NumaTopology) -> Vec { topology.nodes.iter().map( | node |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 6.2.1 | Implement shared memory ring buffer | Memory-mapped ring buffer. Lock-free producer/consumer. Variable-size messages. | 2.5.1 | rust pub struct ShmRingBuffer { mmap: MmapMut, write_pos: AtomicU64, read_pos: AtomicU64, capacity: usize, } impl ShmRingBuffer { pub fn send(&self, data: &[u8]) -> Result<()> { let len = data.len() as u64; let pos = self.write_pos.fetch_add(len + 8, SeqCst); // Write length prefix, then data // Memory fence // Update readable position } } |
[ ] |
| 6.2.2 | Implement flume channels | High-performance bounded/unbounded channels. For same-process communication. | 1.1.2 | rust use flume::{bounded, Sender, Receiver}; pub struct TokenizerChannel { tx: Sender<TokenizeRequest>, rx: Receiver<TokenizeResponse>, } |
[ ] |
| 6.2.3 | Implement gRPC service | Define protobuf schema. Use tonic for Rust gRPC. Support streaming. | 1.1.2 | protobuf service BudTokenizer { rpc Tokenize(TokenizeRequest) returns (TokenizeResponse); rpc TokenizeBatch(stream TokenizeRequest) returns (stream TokenizeResponse); } |
[ ] |
| 6.2.4 | Implement RDMA support | Use rust-ibverbs. RDMA WRITE for zero-copy transfer. | 1.1.2 | Register memory regions. Exchange QP info. Post RDMA WRITE operations. | [ ] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 6.3.1 | Implement worker registry | Track tokenizer workers. Store endpoint, capabilities, load. Support dynamic registration. | 6.2.3 | rust pub struct WorkerRegistry { workers: DashMap<WorkerId, WorkerInfo>, } pub struct WorkerInfo { endpoint: String, capabilities: Capabilities, load: AtomicU64, last_heartbeat: AtomicU64, } |
[ ] |
| 6.3.2 | Implement load balancer | Multiple strategies: round-robin, least-loaded, token-aware. | 6.3.1 | rust pub trait LoadBalancer: Send + Sync { fn select_worker(&self, request: &TokenizeRequest) -> WorkerId; } pub struct TokenAwareBalancer { registry: Arc<WorkerRegistry>, } impl LoadBalancer for TokenAwareBalancer { fn select_worker(&self, request: &TokenizeRequest) -> WorkerId { let estimated_tokens = estimate_tokens(&request.text); // Select worker with lowest (load + queued_tokens) } } |
[~] |
| 6.3.3 | Implement health monitoring | Periodic health checks. Detect failed workers. Auto-remove unhealthy. | 6.3.1 | rust pub struct HealthMonitor { registry: Arc<WorkerRegistry>, check_interval: Duration, timeout: Duration, } impl HealthMonitor { pub async fn run(&self) { loop { for (id, worker) in self.registry.workers.iter() { if !self.check_health(&worker).await { self.registry.workers.remove(&id); } } tokio::time::sleep(self.check_interval).await; } } } |
[~] |
| 6.3.4 | Implement request router | Accept HTTP/gRPC requests. Route to workers. Aggregate responses. | 6.3.2, 6.2.3 | Handle /tokenize and /tokenize_batch endpoints. Forward to selected worker. Return results. |
[ ] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 7.1.1 | Define PreTokenizedRequest | Struct for pre-tokenized data. Fields: request_id, token_ids, attention_mask, token_type_ids. | 3.4.2 | rust #[derive(Serialize, Deserialize)] pub struct PreTokenizedRequest { pub request_id: u64, pub token_ids: Vec<u32>, pub attention_mask: Vec<u32>, pub token_type_ids: Option<Vec<u32>>, pub priority: u8, } |
[x] |
| 7.1.2 | Implement efficient serialization | Use bincode for compact binary format. Support zero-copy where possible. | 7.1.1 | rust impl PreTokenizedRequest { pub fn serialize(&self) -> Vec<u8> { bincode::serialize(self).unwrap() } pub fn deserialize(data: &[u8]) -> Result<Self> { bincode::deserialize(data).map_err(Into::into) } } |
[x] |
| 7.1.3 | Add schema versioning | Version field in serialization. Support backward compatibility. | 7.1.1 | Prefix with version byte. Support reading old versions. | [x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 7.2.1 | Implement TokenBudgetRouter | Match LatentBud's max_batch_tokens. Track padded token count. Flush when exceeded. | 7.1.1 | rust pub struct TokenBudgetRouter { max_batch_tokens: usize, current_batch: Vec<PreTokenizedRequest>, current_max_len: usize, } impl TokenBudgetRouter { pub fn add(&mut self, req: PreTokenizedRequest) -> Option<Vec<PreTokenizedRequest>> { let req_len = req.token_ids.len(); let new_max = self.current_max_len.max(req_len); let new_padded = new_max * (self.current_batch.len() + 1); if new_padded > self.max_batch_tokens && !self.current_batch.is_empty() { let batch = std::mem::take(&mut self.current_batch); self.current_batch.push(req); self.current_max_len = req_len; return Some(batch); } self.current_batch.push(req); self.current_max_len = new_max; None } } |
[x] |
| 7.2.2 | Add timeout-based flushing | Flush batch after timeout even if not full. Configurable timeout. | 7.2.1 | Use tokio::time::timeout to flush after e.g. 10ms of no new requests. |
[x] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 7.3.1 | Add pre-tokenized endpoint | New endpoint in LatentBud: /v1/embeddings/pretokenized. Accept PreTokenizedRequest. |
7.1.1 | Modify LatentBud's router to handle pre-tokenized input. | [ ] |
| 7.3.2 | Modify BatchHandler | Detect pre-tokenized requests. Skip tokenization. Pass directly to model. | 7.3.1 | Add flag to request indicating pre-tokenized. | [ ] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 8.1.1 | Implement CircuitBreaker | States: Closed, Open, HalfOpen. Trip on consecutive failures. Reset after success. | 1.1.7 | rust pub struct CircuitBreaker { state: AtomicU8, failure_count: AtomicU32, last_failure_time: AtomicU64, threshold: u32, reset_timeout: Duration, } impl CircuitBreaker { pub async fn call<F, T, E>(&self, f: F) -> Result<T, CircuitBreakerError<E>> where F: Future<Output = Result<T, E>>, { match self.state.load(Acquire) { CLOSED => { match f.await { Ok(v) => { self.failure_count.store(0, Relaxed); Ok(v) } Err(e) => { let count = self.failure_count.fetch_add(1, Relaxed) + 1; if count >= self.threshold { self.trip(); } Err(CircuitBreakerError::Inner(e)) } } } OPEN => { if self.should_try_reset() { self.state.store(HALF_OPEN, Release); // Try one request... } else { Err(CircuitBreakerError::Open) } } // ... } } } |
[ ] |
| 8.1.2 | Integrate with coordinator | Wrap embedder calls. One breaker per embedder. | 8.1.1, 6.3.4 | Each embedder endpoint gets its own CircuitBreaker instance. | [ ] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 8.2.1 | Implement exponential backoff | Base delay × 2^attempt. Cap at max delay. Add jitter. | 1.1.7 | rust pub struct RetryPolicy { max_attempts: u32, base_delay: Duration, max_delay: Duration, jitter: bool, } impl RetryPolicy { pub async fn execute<F, T, E>(&self, mut f: impl FnMut() -> F) -> Result<T, E> where F: Future<Output = Result<T, E>>, { for attempt in 0..self.max_attempts { match f().await { Ok(v) => return Ok(v), Err(e) if attempt + 1 < self.max_attempts => { let delay = self.calculate_delay(attempt); tokio::time::sleep(delay).await; } Err(e) => return Err(e), } } unreachable!() } fn calculate_delay(&self, attempt: u32) -> Duration { let base = self.base_delay.as_millis() as u64; let delay = (base * 2u64.pow(attempt)).min(self.max_delay.as_millis() as u64); let jitter = if self.jitter { rand::random::<u64>() % (delay / 4) } else { 0 }; Duration::from_millis(delay + jitter) } } |
[ ] |
| ID | Task | Details | Dependencies | Algorithm/Implementation | Status |
|---|---|---|---|---|---|
| 8.3.1 | Define Prometheus metrics | Counters, histograms, gauges for all key metrics. | 1.1.2 | rust lazy_static! { pub static ref REQUESTS_TOTAL: IntCounter = IntCounter::new("budtiktok_requests_total", "Total requests").unwrap(); pub static ref TOKENS_PROCESSED: IntCounter = IntCounter::new("budtiktok_tokens_processed", "Tokens processed").unwrap(); pub static ref TOKENIZATION_DURATION: Histogram = Histogram::with_opts( HistogramOpts::new("budtiktok_tokenization_duration_seconds", "Tokenization latency") .buckets(vec![0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1]) ).unwrap(); pub static ref QUEUE_DEPTH: IntGauge = IntGauge::new("budtiktok_queue_depth", "Queue depth").unwrap(); } |
[ ] |
| 8.3.2 | Implement metrics endpoint | Serve /metrics in Prometheus format. | 8.3.1 | rust async fn metrics_handler() -> impl IntoResponse { let encoder = TextEncoder::new(); let metric_families = prometheus::gather(); let mut buffer = Vec::new(); encoder.encode(&metric_families, &mut buffer).unwrap(); ([(header::CONTENT_TYPE, "text/plain")], buffer) } |
[ ] |
| 8.3.3 | Add tracing integration | Structured logging with tracing. Span propagation. Export to Jaeger. | 1.1.2 | Use tracing, tracing-subscriber, tracing-opentelemetry. Add #[instrument] to key functions. |
[ ] |
| ID | Test Category | Test Cases | Implementation Details | Status |
|---|---|---|---|---|
| 9.1.1 | Unicode Normalization | - ASCII passthrough: is_ascii_fast("hello") returns true- ASCII with high bytes: is_ascii_fast("héllo") returns false- NFC quick check: is_nfc_quick("café") returns Yes- NFC quick check combining: is_nfc_quick("cafe\u{0301}") returns No- Full NFC normalization: nfc("cafe\u{0301}") == "café"- NFD decomposition: nfd("é") == "e\u{0301}"- Hangul decomposition: nfd("한") == "한"- Streaming normalization: chunks produce same result as full |
tests/unit_tests/unicode_normalization.rs - 20+ test cases covering all specified scenarios. |
[x] |
| 9.1.2 | Unicode Categories | - ASCII category lookup: get_category('a') == Letter_Lowercase- Punctuation: is_punctuation('!') == true- CJK detection: is_cjk('中') == true- Emoji detection: covers common emoji - Category flags: bitwise operations work correctly - Cache effectiveness: second lookup faster |
tests/unit_tests/unicode_categories.rs - 20+ test cases for categories, CJK, punctuation, flags. |
[x] |
| 9.1.3 | Token Cache | - Insert and retrieve - CLOCK eviction works correctly - Concurrent access safety - Stats tracking accurate - Sharded cache reduces contention - Multi-level cache hierarchy |
tests/unit_tests/cache.rs - 459 lines of cache tests with concurrency scenarios. |
[x] |
| 9.1.4 | Vocabulary | - Load tokenizer.json - Token to ID lookup - ID to token lookup - Trie prefix search - Aho-Corasick matching - Double-array trie consistency |
tests/unit_tests/vocabulary.rs - 544 lines testing vocab, trie, and AC matching. |
[x] |
| 9.1.5 | WordPiece Algorithm | - Basic tokenization: "hello" -> ["hello"] - Subword split: "unhappiness" -> ["un", "##happiness"] - Unknown word: "xyz123" -> ["[UNK]"] - Max chars limit - Continuation prefix - Cache integration - BERT normalizer - BERT pre-tokenizer - Post-processing with [CLS]/[SEP] |
tests/unit_tests/wordpiece.rs - 682 lines with comprehensive WordPiece tests. |
[x] |
| 9.1.6 | Unigram Algorithm | - Viterbi optimal segmentation - Byte fallback for unknown chars - N-best returns multiple paths - Sampling produces valid segmentations - Metaspace pre-tokenization - Score-based selection |
tests/unit_tests/unigram.rs - 681 lines testing Viterbi, n-best, sampling. |
[x] |
| 9.1.7 | BPE Algorithm | - O(n) produces same result as O(n²) - Compatibility table correct - Byte-level encoding - Merge ordering - Dropout produces different results |
tests/unit_tests/bpe.rs - 844 lines testing O(n) BPE, byte-level, dropout. |
[x] |
| 9.1.8 | SIMD Implementations | - AVX-512 matches scalar - AVX2 matches scalar - SSE4.2 matches scalar - NEON matches scalar - SWAR matches scalar - Edge cases: partial chunks, empty input |
tests/unit_tests/simd.rs + tests/simd_validation.rs + tests/isa_consistency.rs - 16+ ISA consistency tests. |
[x] |
| 9.1.9 | Memory Management | - Arena allocates correctly - Arena reset works - String interner deduplicates - Memory pool reuse - No memory leaks (use miri) |
tests/unit_tests/memory.rs - 766 lines testing arena, interner, pools. |
[x] |
| ID | Test Category | Test Cases | Implementation Details | Status |
|---|---|---|---|---|
| 9.2.1 | End-to-End Tokenization | - Load real tokenizer, encode text, verify output - Batch encoding consistency - Truncation and padding - Special token handling - Offset tracking accuracy |
tests/integration_tests/tokenization.rs - End-to-end tokenization tests. |
[x] |
| 9.2.2 | Distributed Pipeline | - Start coordinator and workers - Send requests through full pipeline - Verify results match direct tokenization - Test with varying batch sizes |
tests/integration_tests/distributed.rs - 16 distributed pipeline tests. |
[x] |
| 9.2.3 | Failure Recovery | - Kill worker mid-request, verify retry - Circuit breaker trips and recovers - Network partition handling - Timeout handling |
Partial - distributed tests cover some scenarios. Circuit breaker not fully implemented. | [~] |
| 9.2.4 | GPU Integration | - CPU and GPU produce same results - Multi-GPU distribution - Fallback to CPU when GPU unavailable - Memory management |
GPU module not implemented (Section 5). | [ ] |
| 9.2.5 | LatentBud Integration | - Pre-tokenized requests accepted - Token budget routing correct - End-to-end embedding pipeline |
tests/integration_tests/latentbud.rs - LatentBud integration tests. |
[x] |
| ID | Test Category | Test Cases | Implementation Details | Status |
|---|---|---|---|---|
| 9.3.1 | Output Regression | - Fixed inputs produce fixed outputs - Store expected outputs in fixtures - Detect any output change |
tests/regression.rs::regression_tests::outputs::* - Golden file testing. |
[x] |
| 9.3.2 | Performance Regression | - Benchmark must not regress >10% - Track p50, p95, p99 latency - Track throughput tokens/sec |
tests/regression.rs::regression_tests::performance::* - Performance baseline tests. |
[x] |
| 9.3.3 | Memory Regression | - Peak memory must not increase >10% - No memory leaks |
tests/regression.rs::regression_tests::memory::* - Memory leak and allocation tests. |
[x] |
| ID | Test Category | Properties | Implementation Details | Status |
|---|---|---|---|---|
| 9.4.1 | Tokenization Properties | - encode(decode(ids)) produces valid (possibly different) ids - decode(encode(text)) preserves semantics - batch encode == individual encodes - truncated + overflow == original |
tests/property_tests/tokenization.rs - Property-based tokenization tests. |
[x] |
| 9.4.2 | SIMD Properties | - All SIMD implementations produce identical results - Results match scalar baseline - No out-of-bounds access |
tests/property_tests/simd.rs - 16+ SIMD property tests. |
[x] |
| 9.4.3 | Concurrency Properties | - Concurrent access produces consistent results - No data races - Cache remains consistent |
tests/property_tests/concurrency.rs - 11 concurrency property tests. |
[x] |
| ID | Test Target | Approach | Implementation Details | Status |
|---|---|---|---|---|
| 9.5.1 | Tokenizer Input | Fuzz arbitrary UTF-8 strings | fuzz/fuzz_targets/fuzz_tokenize.rs + fuzz_unicode.rs - Fuzzing tokenizer and Unicode functions. |
[x] |
| 9.5.2 | tokenizer.json Parser | Fuzz JSON input | fuzz/fuzz_targets/fuzz_json_parser.rs - Fuzzing JSON config parser. |
[x] |
| 9.5.3 | IPC Deserialization | Fuzz serialized messages | fuzz/fuzz_targets/fuzz_ipc.rs - Fuzzing IPC message deserialization. |
[x] |
| ID | Task | Details | How to Use | When to Use | Status |
|---|---|---|---|---|---|
| 10.1.1 | Set up perf integration | Configure for Linux perf profiling. Add debug symbols to release builds. |
bash # Build with debug symbols RUSTFLAGS="-C debuginfo=2" cargo build --release # Record profile perf record -g ./target/release/budtiktok-bench perf report |
When investigating CPU hotspots. Use after benchmarks show unexpected slowdowns. | [ ] |
| 10.1.2 | Set up flamegraph | Install cargo-flamegraph. Configure for both CPU and memory. |
bash cargo install flamegraph cargo flamegraph --bin budtiktok-bench Opens flamegraph in browser. |
Visualizing where time is spent. Use to identify optimization targets. | [ ] |
| 10.1.3 | Set up samply | Modern sampling profiler for Rust. Cross-platform. | bash cargo install samply samply record ./target/release/budtiktok-bench Opens Firefox Profiler UI. |
Cross-platform profiling. Better UI than flamegraph. | [ ] |
| 10.1.4 | Set up cachegrind | Valgrind tool for cache analysis. | bash valgrind --tool=cachegrind ./target/release/budtiktok-bench cg_annotate cachegrind.out.* |
When cache misses suspected. Use after SIMD optimizations. | [ ] |
| 10.1.5 | Set up DHAT | Valgrind tool for heap profiling. | bash valgrind --tool=dhat ./target/release/budtiktok-bench View dhat.out.* in browser. |
Finding allocation hotspots. Use when memory usage too high. | [ ] |
| 10.1.6 | Set up heaptrack | Modern heap profiler. Better than massif. | bash heaptrack ./target/release/budtiktok-bench heaptrack_gui heaptrack.*.gz |
Detailed heap analysis. Finding memory leaks. | [ ] |
| 10.1.7 | Add tracing instrumentation | Add #[instrument] attributes. Configure span recording. |
rust #[tracing::instrument(skip(self))] pub fn encode(&self, text: &str) -> Encoding { ... } Enable with RUST_LOG=trace. |
Understanding control flow. Distributed tracing. | [ ] |
| ID | Task | Details | How to Use | When to Use | Status |
|---|---|---|---|---|---|
| 10.2.1 | Add timing instrumentation | Measure key operations. Export as metrics. | rust pub struct Profiler { tokenization_ns: AtomicU64, normalization_ns: AtomicU64, lookup_ns: AtomicU64, } impl Profiler { pub fn time<T>(&self, metric: &AtomicU64, f: impl FnOnce() -> T) -> T { let start = Instant::now(); let result = f(); metric.fetch_add(start.elapsed().as_nanos() as u64, Relaxed); result } } |
Always in debug builds. Enable via feature flag in release. | [ ] |
| 10.2.2 | Add cache statistics | Track hit rates per cache level. | Export via /debug/cache_stats endpoint. |
Tuning cache sizes. Identifying cache-unfriendly patterns. | [ ] |
| 10.2.3 | Add SIMD path tracking | Track which SIMD path is used. | Log on startup: "Using AVX-512 backend". Add counter for fallbacks. | Verifying SIMD is used. Debugging performance issues. | [ ] |
| 10.2.4 | Create profiling CLI | budtiktok profile subcommand. Run benchmarks, collect profiles. |
bash budtiktok profile --input data/test.txt --output profile.json |
Easy profiling without setup. | [ ] |
# Profiling Guide
## When to Profile
1. **Before optimization**: Establish baseline
2. **After major changes**: Verify no regression
3. **When benchmarks regress**: Identify cause
4. **Before release**: Ensure production-ready
## Profiling Workflow
### 1. CPU Profiling (samply/flamegraph)
```bash
# Quick flamegraph
cargo flamegraph --bin budtiktok-bench -- --benchmark wordpiece
# Detailed with samply
samply record ./target/release/budtiktok-bench
# Opens interactive UIWhat to look for:
- Functions taking >10% of time
- Unexpected standard library calls
- Memory allocation (
malloc,__rust_alloc) - Lock contention (
parking_lot,pthread_mutex)
valgrind --tool=cachegrind ./target/release/budtiktok-bench
cg_annotate --auto=yes cachegrind.out.*What to look for:
- D1 miss rate >5% (L1 data cache)
- LL miss rate >1% (Last level cache)
- Functions with high miss counts
heaptrack ./target/release/budtiktok-bench
heaptrack_gui heaptrack.*.gzWhat to look for:
- Allocation hotspots
- Memory growth over time
- Temporary allocations in hot paths
# Check CPU features
cat /proc/cpuinfo | grep -E 'avx|sse|neon'
# Verify runtime detection
RUST_LOG=debug ./target/release/budtiktok-bench 2>&1 | grep -i simd
# Disassemble to verify SIMD instructions
objdump -d target/release/budtiktok-bench | grep -E 'vmov|vpcmp|vadd'- Algorithmic: O(n) vs O(n²) - biggest impact
- Memory access: Cache locality, prefetching
- SIMD: Vectorize hot loops
- Allocation: Reduce allocations in hot paths
- Parallelism: Use all cores effectively
---
## 11. Accuracy Testing Suite
### 11.1 HuggingFace Gold Standard
| ID | Task | Details | Implementation | Status |
|----|------|---------|----------------|--------|
| 11.1.1 | Create accuracy test framework | Compare BudTikTok output against HuggingFace tokenizers. Report any differences. | ```rust pub struct AccuracyTest { hf_tokenizer: tokenizers::Tokenizer, bud_tokenizer: Box<dyn Tokenizer>, } impl AccuracyTest { pub fn compare(&self, text: &str) -> AccuracyResult { let hf_result = self.hf_tokenizer.encode(text, true)?; let bud_result = self.bud_tokenizer.encode(text, true)?; AccuracyResult { input: text.to_string(), hf_ids: hf_result.get_ids().to_vec(), bud_ids: bud_result.ids.clone(), match_: hf_result.get_ids() == &bud_result.ids, } } } ``` | `[ ]` |
| 11.1.2 | Create test datasets | Curate datasets covering edge cases. | - ASCII text (Wikipedia English)<br>- Unicode text (Wikipedia multilingual)<br>- Code (GitHub samples)<br>- Edge cases (empty, whitespace, special chars)<br>- Long documents<br>- Many short strings | `[ ]` |
| 11.1.3 | Implement accuracy CLI | `budtiktok accuracy` command. | ```bash budtiktok accuracy \ --tokenizer bert-base-uncased \ --dataset tests/data/wikipedia_sample.txt \ --output accuracy_report.json ``` | `[ ]` |
| 11.1.4 | Add CI accuracy checks | Run accuracy tests in CI. Fail on any mismatch. | GitHub Action that runs accuracy tests against all supported tokenizers. | `[ ]` |
### 11.2 Model Coverage
| ID | Model | Tokenizer Type | Test Data | Status |
|----|-------|---------------|-----------|--------|
| 11.2.1 | bert-base-uncased | WordPiece | Wikipedia EN, GLUE | `[ ]` |
| 11.2.2 | bert-base-multilingual-cased | WordPiece | Wikipedia multi | `[ ]` |
| 11.2.3 | gpt2 | BPE | OpenWebText | `[ ]` |
| 11.2.4 | roberta-base | BPE | BookCorpus | `[ ]` |
| 11.2.5 | xlm-roberta-base | Unigram | CC-100 | `[ ]` |
| 11.2.6 | t5-base | Unigram | C4 | `[ ]` |
| 11.2.7 | llama-2-7b | BPE | Wikipedia | `[ ]` |
| 11.2.8 | mistral-7b | BPE | Wikipedia | `[ ]` |
| 11.2.9 | BAAI/bge-small-en-v1.5 | WordPiece | MTEB | `[ ]` |
| 11.2.10 | sentence-transformers/all-MiniLM-L6-v2 | WordPiece | STS | `[ ]` |
### 11.3 Edge Case Tests
| ID | Category | Test Cases | Status |
|----|----------|-----------|--------|
| 11.3.1 | Empty/Whitespace | Empty string, single space, multiple spaces, tabs, newlines, mixed | `[ ]` |
| 11.3.2 | Unicode | Combining characters, ZWJ sequences, RTL text, surrogates, BOM | `[ ]` |
| 11.3.3 | Long Text | 1K, 10K, 100K, 1M characters | `[ ]` |
| 11.3.4 | Special Characters | All ASCII punctuation, math symbols, currency, emoji | `[ ]` |
| 11.3.5 | Code | Python, JavaScript, Rust, C++, with special syntax | `[ ]` |
| 11.3.6 | CJK | Chinese, Japanese, Korean text, mixed with ASCII | `[ ]` |
| 11.3.7 | Arabic/Hebrew | RTL scripts, mixed with LTR | `[ ]` |
| 11.3.8 | Normalization | Pre-composed vs decomposed Unicode, NFKC equivalents | `[ ]` |
---
## 12. Comparison Benchmarking Tool
### 12.1 Benchmark Framework
| ID | Task | Details | Implementation | Status |
|----|------|---------|----------------|--------|
| 12.1.1 | Create benchmark harness | Unified framework for comparing tokenizers. | ```rust pub struct BenchmarkHarness { tokenizers: Vec<(String, Box<dyn Tokenizer>)>, datasets: Vec<Dataset>, } impl BenchmarkHarness { pub fn run(&self) -> BenchmarkResults { let mut results = BenchmarkResults::new(); for (name, tokenizer) in &self.tokenizers { for dataset in &self.datasets { let metrics = self.benchmark_one(tokenizer, dataset); results.add(name, &dataset.name, metrics); } } results } fn benchmark_one(&self, tokenizer: &dyn Tokenizer, dataset: &Dataset) -> Metrics { // Warmup for _ in 0..100 { let _ = tokenizer.encode(&dataset.samples[0], true); } // Measure let mut latencies = Vec::with_capacity(dataset.samples.len()); let start = Instant::now(); for sample in &dataset.samples { let t0 = Instant::now(); let enc = tokenizer.encode(sample, true).unwrap(); latencies.push(t0.elapsed()); } let total_time = start.elapsed(); let total_tokens: usize = latencies.len(); // Assuming 1 encoding each Metrics { throughput_samples_per_sec: dataset.samples.len() as f64 / total_time.as_secs_f64(), latency_p50: percentile(&latencies, 0.50), latency_p95: percentile(&latencies, 0.95), latency_p99: percentile(&latencies, 0.99), } } } ``` | `[ ]` |
| 12.1.2 | Add HuggingFace tokenizers backend | Wrap HF tokenizers for comparison. | ```rust pub struct HfTokenizer { inner: tokenizers::Tokenizer, } impl Tokenizer for HfTokenizer { fn encode(&self, text: &str, add_special_tokens: bool) -> Result<Encoding> { let enc = self.inner.encode(text, add_special_tokens)?; Ok(Encoding { ids: enc.get_ids().to_vec(), // ... convert other fields }) } } ``` | `[ ]` |
| 12.1.3 | Add TEI backend | Benchmark against TEI's tokenization. | Use TEI's gRPC API or embed its tokenization code. | `[ ]` |
| 12.1.4 | Add BlazeText backend | Benchmark against BlazeText. | If available as library, link directly. Otherwise benchmark via CLI. | `[ ]` |
| 12.1.5 | Create benchmark CLI | `budtiktok benchmark` command. | ```bash budtiktok benchmark \ --tokenizers hf,budtiktok,tei \ --models bert-base-uncased,gpt2,t5-base \ --datasets wikipedia,code,multilingual \ --output results.json \ --format markdown ``` | `[ ]` |
### 12.2 Benchmark Datasets
| ID | Dataset | Description | Size | Status |
|----|---------|-------------|------|--------|
| 12.2.1 | Wikipedia EN | English Wikipedia articles | 10K samples | `[ ]` |
| 12.2.2 | Wikipedia Multi | Multilingual Wikipedia | 10K samples, 10 languages | `[ ]` |
| 12.2.3 | Code | GitHub code samples | 10K samples, 5 languages | `[ ]` |
| 12.2.4 | Short Text | Tweets, queries | 100K samples, <50 chars | `[ ]` |
| 12.2.5 | Long Text | Documents | 1K samples, >10K chars | `[ ]` |
| 12.2.6 | ShareGPT | Conversational data | 10K samples | `[ ]` |
### 12.3 Benchmark Reports
| ID | Task | Details | Implementation | Status |
|----|------|---------|----------------|--------|
| 12.3.1 | Generate markdown report | Human-readable comparison tables. | ```markdown # BudTikTok Benchmark Results ## Throughput (samples/sec) | Model | HuggingFace | TEI | BlazeText | BudTikTok | Speedup | |-------|-------------|-----|-----------|-----------|--------| | bert-base | 10,000 | 12,000 | 50,000 | 150,000 | **15x** | ## Latency (p99, microseconds) ... ``` | `[ ]` |
| 12.3.2 | Generate JSON report | Machine-readable for CI integration. | Store full metrics, system info, versions. | `[ ]` |
| 12.3.3 | Generate charts | Visualization of results. | Use plotters crate or output SVG. Bar charts for throughput, line charts for latency distribution. | `[ ]` |
| 12.3.4 | Add to CI | Run benchmarks in CI, track over time. | Store results in GitHub artifacts. Generate comparison over time. | `[ ]` |
### 12.4 Benchmark Configuration
```toml
# benchmark.toml
[harness]
warmup_iterations = 100
measurement_iterations = 10000
timeout_per_sample_ms = 1000
[tokenizers]
hf = { enabled = true, path = "huggingface/tokenizers" }
tei = { enabled = true, endpoint = "http://localhost:8080" }
blazetext = { enabled = false, reason = "not available" }
budtiktok = { enabled = true, simd = "auto", gpu = false }
[datasets]
wikipedia_en = { path = "data/wikipedia_en.txt", max_samples = 10000 }
code = { path = "data/code_samples.txt", max_samples = 10000 }
[models]
bert = { path = "bert-base-uncased/tokenizer.json" }
gpt2 = { path = "gpt2/tokenizer.json" }
t5 = { path = "t5-base/tokenizer.json" }
[output]
format = ["json", "markdown", "svg"]
output_dir = "benchmark_results"
| ID | Task | Details | Implementation | Status |
|---|---|---|---|---|
| 13.1.1 | Create Dockerfile | Multi-stage build. Minimal runtime. | dockerfile FROM rust:1.75-slim as builder WORKDIR /app COPY . . RUN cargo build --release --features full FROM debian:bookworm-slim COPY --from=builder /app/target/release/budtiktok /usr/local/bin/ EXPOSE 8080 ENTRYPOINT ["budtiktok", "serve"] |
[ ] |
| 13.1.2 | Create GPU Dockerfile | CUDA runtime support. | dockerfile FROM nvidia/cuda:12.3-runtime-ubuntu22.04 COPY --from=builder /app/target/release/budtiktok /usr/local/bin/ ENV CUDA_VISIBLE_DEVICES=all |
[ ] |
| 13.1.3 | Create docker-compose | Multi-container deployment. | Coordinator + workers + monitoring stack. | [ ] |
| ID | Task | Details | Implementation | Status |
|---|---|---|---|---|
| 13.2.1 | Create Helm chart | Deployments, Services, ConfigMaps. | Chart with coordinator, workers, HPA, PDB. | [ ] |
| 13.2.2 | Create Kubernetes operator | Custom resource, auto-scaling. | Use kube-rs to build operator. | [ ] |
| 13.2.3 | Add monitoring stack | Prometheus, Grafana dashboards. | Pre-configured dashboards for BudTikTok metrics. | [ ] |
| ID | Task | Details | Status |
|---|---|---|---|
| 14.1.1 | Write rustdoc | Document all public APIs. | [ ] |
| 14.1.2 | Create examples | Code examples for common use cases. | [ ] |
| 14.1.3 | Publish to docs.rs | Automated on release. | [ ] |
| ID | Task | Details | Status |
|---|---|---|---|
| 14.2.1 | Quick start guide | Get started in 5 minutes. | [ ] |
| 14.2.2 | Configuration reference | All config options. | [ ] |
| 14.2.3 | Deployment guide | Single-node, multi-node, Kubernetes. | [ ] |
| 14.2.4 | Performance tuning guide | NUMA, caching, SIMD. | [ ] |
| 14.2.5 | Migration guide | From HuggingFace tokenizers. | [ ] |
| Section | Total Tasks | Completed |
|---|---|---|
| 1. Project Setup | 13 | 13 |
| 2. Core Infrastructure | 35 | 35 |
| 3. Tokenization Algorithms | 27 | 27 |
| 4. SIMD Acceleration | 18 | 18 |
| 5. GPU Tokenization | 10 | 0 |
| 6. Distributed Architecture | 15 | 2 |
| 7. LatentBud Integration | 6 | 5 |
| 8. Resilience and Operations | 8 | 0 |
| 9. Test Suites (TDD) | 19 | 18 |
| 10. Profiling Tools | 11 | 0 |
| 11. Accuracy Testing | 14 | 0 |
| 12. Benchmarking Tool | 10 | 0 |
| 13. Deployment | 6 | 0 |
| 14. Documentation | 8 | 0 |
| Total | 200 | 118 |
- 1.1.1 Create Cargo workspace
- 2.1.1 ASCII fast path
- 2.2.1 CategoryFlags
- 2.4.1 tokenizer.json parser
- 3.1.1 WordPiece algorithm
- 3.4.1 Tokenizer trait
- 9.1.5 WordPiece unit tests
- 11.1.1 Accuracy test framework
- 12.1.1 Benchmark harness
- 4.1.1 SIMD detection
- 4.2.1 AVX-512 implementation
Estimated tasks to MVP: 50 tasks
Last Updated: 2025-12-17 Latest: 128/200 tasks done. Section 5 (GPU Tokenization) now complete with CUDA backend using cudarc. GpuTokenizer with multi-GPU support, vocabulary lookup kernel, pre-tokenization, WordPiece, and batch size auto-tuning. 45+ GPU tests passing. Remaining: Distributed (Section 6), Resilience (Section 8), Profiling (Section 10), Accuracy Testing (Section 11), Benchmarking (Section 12), Deployment (Section 13), Documentation (Section 14).