Sparse V dequantization — only dequant where attention weight > threshold

## Problem / Motivation

Currently, V dequantization processes all positions regardless of their attention weight. For long contexts, most positions have near-zero attention weights after softmax. Skipping dequantization for these positions saves significant compute.

llama.cpp benchmarks show +22.8% decode speedup at 32K context with threshold 1e-6.


## Solution

After computing attention scores and softmax:
1. Identify positions where `attention_weight > 1e-6`
2. Only dequantize V at those positions
3. Compute weighted sum only over dequantized positions

Works on both CPU (fused decode path) and GPU (fused kernel).

### Key files
- `turboquant/src/cache/fused_cpu.rs` — CPU path (after #019)
- `turboquant/src/cache/cuda/kernels/tq_attention_kernel.cu` — GPU kernel
- `mistralrs-paged-attn/src/cuda/tq_paged_attention.cu` — PA kernel (after #027)

## Acceptance criteria
- [ ] Sparse V produces correct output (max diff < 0.001 vs full dequant)
- [ ] Threshold is configurable (default 1e-6)
- [ ] Benchmark: measurable speedup at context >= 4096
- [ ] Quality test: no regression
- [ ] `cargo fmt --check` clean

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse V dequantization — only dequant where attention weight > threshold #36

Problem / Motivation

Solution

Key files

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sparse V dequantization — only dequant where attention weight > threshold #36

Description

Problem / Motivation

Solution

Key files

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions