Commit 4370ddb

committed

feat: Add real attention, KV cache, RoPE, and tokenizer to BitNet backend

Resolves the three blocking gaps that prevented end-to-end inference: 1. **Real attention layer** (was pass-through placeholder): - AttentionWeights struct with Q/K/V/O ternary projections - GQA (Grouped Query Attention) with configurable num_heads / num_kv_heads - Pre-computed RoPE cos/sin tables (apply_rope) - Per-layer KV cache for autoregressive generation - forward_token() for efficient single-token inference with cache - forward_layer_cached() with full attention computation - forward_layer_nocache() legacy path for backwards compatibility 2. **Tokenizer integration** (was raw bytes → token IDs): - load_tokenizer_from_gguf() extracts vocab + merges from GGUF metadata - Byte-level fallback tokenizer (260 tokens) when GGUF has no vocab - TokenizerBridge implements crate-level Tokenizer trait - tok() accessor for direct tokenizer access 3. **generate() uses tokenizer** (was returning [token_id] strings): - Encodes prompt via BPE tokenizer before forward pass - Decodes generated tokens back to text - generate_cached() for KV-cached autoregressive generation - get_embeddings() now uses tokenizer for text encoding - reset_cache() to clear KV state between sequences Tests: 174/174 bitnet tests pass (9 new: RoPE, KV cache, tokenizer roundtrip, attention weights, byte-level fallback, cache operations) https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK

1 parent 828e500 commit 4370ddbCopy full SHA for 4370ddb

1 file changed

crates/ruvllm/src/bitnet
- backend.rs

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 4370ddb

File tree

0 commit comments