Commit 4370ddb
committed
feat: Add real attention, KV cache, RoPE, and tokenizer to BitNet backend
Resolves the three blocking gaps that prevented end-to-end inference:
1. **Real attention layer** (was pass-through placeholder):
- AttentionWeights struct with Q/K/V/O ternary projections
- GQA (Grouped Query Attention) with configurable num_heads / num_kv_heads
- Pre-computed RoPE cos/sin tables (apply_rope)
- Per-layer KV cache for autoregressive generation
- forward_token() for efficient single-token inference with cache
- forward_layer_cached() with full attention computation
- forward_layer_nocache() legacy path for backwards compatibility
2. **Tokenizer integration** (was raw bytes → token IDs):
- load_tokenizer_from_gguf() extracts vocab + merges from GGUF metadata
- Byte-level fallback tokenizer (260 tokens) when GGUF has no vocab
- TokenizerBridge implements crate-level Tokenizer trait
- tok() accessor for direct tokenizer access
3. **generate() uses tokenizer** (was returning [token_id] strings):
- Encodes prompt via BPE tokenizer before forward pass
- Decodes generated tokens back to text
- generate_cached() for KV-cached autoregressive generation
- get_embeddings() now uses tokenizer for text encoding
- reset_cache() to clear KV state between sequences
Tests: 174/174 bitnet tests pass (9 new: RoPE, KV cache, tokenizer roundtrip,
attention weights, byte-level fallback, cache operations)
https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK1 parent 828e500 commit 4370ddb
1 file changed
Lines changed: 667 additions & 57 deletions
0 commit comments