Problem / Motivation
Currently, V dequantization processes all positions regardless of their attention weight. For long contexts, most positions have near-zero attention weights after softmax. Skipping dequantization for these positions saves significant compute.
llama.cpp benchmarks show +22.8% decode speedup at 32K context with threshold 1e-6.
Solution
After computing attention scores and softmax:
- Identify positions where
attention_weight > 1e-6
- Only dequantize V at those positions
- Compute weighted sum only over dequantized positions
Works on both CPU (fused decode path) and GPU (fused kernel).
Key files
Acceptance criteria
Problem / Motivation
Currently, V dequantization processes all positions regardless of their attention weight. For long contexts, most positions have near-zero attention weights after softmax. Skipping dequantization for these positions saves significant compute.
llama.cpp benchmarks show +22.8% decode speedup at 32K context with threshold 1e-6.
Solution
After computing attention scores and softmax:
attention_weight > 1e-6Works on both CPU (fused decode path) and GPU (fused kernel).
Key files
turboquant/src/cache/fused_cpu.rs— CPU path (after Fused CPU decode: tensor wrapper + integration in PqoCache CPU path #19)turboquant/src/cache/cuda/kernels/tq_attention_kernel.cu— GPU kernelmistralrs-paged-attn/src/cuda/tq_paged_attention.cu— PA kernel (after Fused PagedAttention kernel for compressed KV cache #27)Acceptance criteria
cargo fmt --checkclean