Skip to content

Commit 4370ddb

Browse files
committed
feat: Add real attention, KV cache, RoPE, and tokenizer to BitNet backend
Resolves the three blocking gaps that prevented end-to-end inference: 1. **Real attention layer** (was pass-through placeholder): - AttentionWeights struct with Q/K/V/O ternary projections - GQA (Grouped Query Attention) with configurable num_heads / num_kv_heads - Pre-computed RoPE cos/sin tables (apply_rope) - Per-layer KV cache for autoregressive generation - forward_token() for efficient single-token inference with cache - forward_layer_cached() with full attention computation - forward_layer_nocache() legacy path for backwards compatibility 2. **Tokenizer integration** (was raw bytes → token IDs): - load_tokenizer_from_gguf() extracts vocab + merges from GGUF metadata - Byte-level fallback tokenizer (260 tokens) when GGUF has no vocab - TokenizerBridge implements crate-level Tokenizer trait - tok() accessor for direct tokenizer access 3. **generate() uses tokenizer** (was returning [token_id] strings): - Encodes prompt via BPE tokenizer before forward pass - Decodes generated tokens back to text - generate_cached() for KV-cached autoregressive generation - get_embeddings() now uses tokenizer for text encoding - reset_cache() to clear KV state between sequences Tests: 174/174 bitnet tests pass (9 new: RoPE, KV cache, tokenizer roundtrip, attention weights, byte-level fallback, cache operations) https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
1 parent 828e500 commit 4370ddb

1 file changed

Lines changed: 667 additions & 57 deletions

File tree

0 commit comments

Comments
 (0)