Skip to content

Sparse V dequantization — only dequant where attention weight > threshold #36

@SaschaOnTour

Description

@SaschaOnTour

Problem / Motivation

Currently, V dequantization processes all positions regardless of their attention weight. For long contexts, most positions have near-zero attention weights after softmax. Skipping dequantization for these positions saves significant compute.

llama.cpp benchmarks show +22.8% decode speedup at 32K context with threshold 1e-6.

Solution

After computing attention scores and softmax:

  1. Identify positions where attention_weight > 1e-6
  2. Only dequantize V at those positions
  3. Compute weighted sum only over dequantized positions

Works on both CPU (fused decode path) and GPU (fused kernel).

Key files

Acceptance criteria

  • Sparse V produces correct output (max diff < 0.001 vs full dequant)
  • Threshold is configurable (default 1e-6)
  • Benchmark: measurable speedup at context >= 4096
  • Quality test: no regression
  • cargo fmt --check clean

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions