Tom Turney Independent Researcher GitHub: @TheTom
The TurboQuant paper (ICLR 2026) tests KV cache compression with symmetric configurations (same quantization for both K and V). In practice, symmetric turbo quantization is catastrophic on certain model families with low-bit weight quantization (Q4_K_M). We show that this failure is entirely K-side: V compression has zero measurable effect on attention quality when K precision is maintained.
This asymmetry arises from the attention mechanism itself. K determines attention routing via softmax, which amplifies small errors exponentially. V errors scale linearly through the weighted sum. On Qwen2.5-7B Q4_K_M, symmetric turbo3 produces PPL 3,556 (catastrophic), while asymmetric q8_0-K + turbo3-V produces PPL 6.71 (+2.0% vs baseline). Same total compression budget, opposite quality outcomes.
We validate this finding across 7 models (1.5B to 104B), 3 weight quantizations (Q4_K_M, Q6_K, Q8_0), 3 GPU backends (Metal, CUDA, HIP), and 2 Apple Silicon generations (M2 Pro, M5 Max). The practical recommendation is simple: compress V maximally, spend bits on K.
KV cache compression reduces memory consumption during LLM inference, enabling longer context windows and larger models on consumer hardware. Google's TurboQuant (ICLR 2026) achieves 4.6-5.12x compression via Walsh-Hadamard rotation and polar quantization. The paper validates quality using fp16 model weights.
In practice, most users run quantized model weights (Q4_K_M, Q6_K) to fit models in limited VRAM. When TurboQuant KV cache compression is applied on top of already-quantized weights, the errors compound. We refer to this as quantization stacking: weight quantization error + KV cache quantization error.
The question is: does stacking affect K and V equally?
The attention mechanism computes:
K determines which tokens receive attention weight via softmax. V determines what information flows from those tokens. These are fundamentally different roles:
-
K errors shift the Q*K dot product scores. Softmax amplifies small score differences exponentially: a perturbation of
$\epsilon$ in logit space produces an$O(e^\epsilon)$ change in the attention distribution. On sensitive models, this can flip which tokens dominate the output. -
V errors scale linearly through the weighted sum. If a V element is off by
$\epsilon$ , the output error is bounded by$w_i \cdot \epsilon$ where$w_i$ is the attention weight for that position. For non-dominant positions (most of them at long context),$w_i \approx 0$ and the error vanishes.
This predicts that K precision should matter far more than V precision for output quality. Our experiments confirm this.
All PPL measured on wikitext-2-raw (512 context, 20 chunks) unless noted. All tests on Metal with flash attention, full GPU offload.
This model demonstrates the asymmetry most dramatically.
| K | V | M5 Max PPL | M2 Pro PPL | vs q8_0 | Status |
|---|---|---|---|---|---|
| q8_0 | q8_0 | 6.577 | 6.579 | baseline | healthy |
| q8_0 | turbo4 | 6.644 | 6.603 | +1.0% | healthy |
| q8_0 | turbo3 | 6.707 | 6.715 | +2.0% | healthy |
| q8_0 | turbo2 | 6.911 | — | +5.1% | healthy |
| turbo4 | turbo4 | 217.7 | 227.5 | catastrophic | avoid |
| turbo3 | turbo3 | 3,556 | 3,778 | catastrophic | avoid |
V compression is essentially free: q8_0-K + turbo2-V (2-bit V, 6.4x V compression) gives +5.1% PPL. The model is fully coherent.
K compression is catastrophic: turbo4-K + turbo4-V (same bits on both) gives PPL 218. The model is incoherent.
Same total bits, opposite results. The difference is 500x in PPL depending on which cache you compress.
Cross-hardware matched: M2 Pro and M5 Max produce equivalent quality patterns.
| K | V | PPL | vs q8_0 | Status |
|---|---|---|---|---|
| turbo3 | turbo3 | 8,641+ | catastrophic | avoid |
Symmetric turbo is catastrophic on Qwen2.5 at all sizes, not just 7B. The sensitivity is architecture-dependent.
| K | V | M5 Max PPL | vs q8_0 | Status |
|---|---|---|---|---|
| q8_0 | q8_0 | 4.690 | baseline | healthy |
| q8_0 | turbo4 | 4.702 | +0.3% | healthy |
| q8_0 | turbo3 | 4.742 | +1.1% | healthy |
| q8_0 | turbo2 | 4.835 | +3.1% | healthy |
| turbo4 | turbo4 | 4.770 | +1.7% | healthy |
| turbo3 | turbo3 | 4.886 | +4.2% | healthy |
With Q8_0 weights, both symmetric and asymmetric work. The quantization stacking effect only appears when weight quantization is already aggressive (Q4_K_M). V compression gradient is clear: turbo4-V +0.3%, turbo3-V +1.1%, turbo2-V +3.1%. All healthy.
| K | V | PPL | vs q8_0 | Status |
|---|---|---|---|---|
| turbo3 | turbo3 | 4.987 | healthy | healthy |
Q4_K_M symmetric turbo is not universally broken. Mistral-24B handles it fine. The sensitivity is model-family-dependent, not purely a function of weight quantization.
| K | V | PPL | vs q8_0 (3.257) | Status |
|---|---|---|---|---|
| q8_0 | turbo4 | 3.301 | +1.3% | healthy |
| q8_0 | turbo3 | 3.325 | +2.1% | healthy |
| q8_0 | turbo2 | 3.568 | +9.5% | healthy |
| turbo4 | turbo4 | 3.461 | +6.3% | healthy |
| turbo3 | turbo3 | 3.629 | +11.4% | usable |
| turbo2 | turbo2 | 5.161 | +58.5% | degraded |
Asymmetric rescue: q8_0/turbo2 = +9.5% vs symmetric turbo2/turbo2 = +58.5%. 6x improvement in quality degradation from the same V format, just by preserving K precision.
Llama-70B tolerates symmetric turbo3 (+11.4%), but asymmetric q8_0/turbo3 is better (+2.1%). Larger models absorb quantization stacking better than smaller ones.
| K | V | PPL | vs q8_0 (6.192) | Status |
|---|---|---|---|---|
| q8_0 | turbo4 | 6.211 | +0.3% | healthy |
| q8_0 | turbo3 | 6.296 | +1.7% | healthy |
| q8_0 | turbo2 | 6.678 | +7.9% | healthy |
| turbo4 | turbo4 | 6.312 | +1.9% | healthy |
| turbo3 | turbo3 | 6.415 | +3.6% | healthy |
| turbo2 | turbo2 | 7.049 | +13.8% | usable |
104B tolerates symmetric turbo even better than 70B. turbo4/turbo4 is +1.9%, nearly lossless. The trend is clear: bigger models have more capacity to absorb quantization stacking.
| K | V | PPL | vs q8_0 (7.794) | Status |
|---|---|---|---|---|
| q8_0 | turbo4 | 7.876 | +1.0% | healthy |
| turbo4 | turbo4 | 401.4 | catastrophic | avoid |
| turbo3 | turbo3 | 81,277 | catastrophic | avoid |
Asymmetric q8_0/turbo4 confirmed on AMD. Symmetric Q4_K_M failure is consistent across all three GPU vendors (Metal, CUDA, HIP).
Why do some models survive symmetric turbo on Q4_K_M while others don't?
The hypothesis: K cache quantization error compounds multiplicatively with weight quantization error in the attention logit computation. The Q*K dot product involves both quantized-weight Q projections and quantized-cache K values. When both are lossy, the logit errors can exceed softmax's tolerance threshold.
The evidence supports this:
| Model | Weights | Symmetric turbo3 PPL | Status |
|---|---|---|---|
| Qwen2.5-1.5B | Q4_K_M | 8,641 | catastrophic |
| Qwen2.5-7B | Q4_K_M | 3,556 | catastrophic |
| Mistral-24B | Q4_K_M | 4.987 | healthy |
| Llama-70B | Q4_K_M | 3.629 | healthy |
| Command-R+ 104B | Q4_K_M | 6.415 | healthy |
The sensitivity is model-family-dependent. Qwen2.5 is sensitive at all sizes. Llama, Mistral, and Cohere are tolerant. Within tolerant families, larger models absorb stacking better (104B: +3.6% vs 70B: +11.4%).
V compression adds negligible error regardless of weight quantization. This is because V errors don't pass through softmax and don't interact multiplicatively with weight quantization in the same way.
# Safe default: asymmetric, full K precision, compressed V
llama-server -m model.gguf -ctk q8_0 -ctv turbo4 -fa on# Symmetric: more compression, validated on Llama/Mistral/Cohere 24B+
llama-server -m model.gguf -ctk turbo3 -ctv turbo3 -fa on# Asymmetric with turbo2-V (Boundary V auto-enabled)
llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa on- Q8_0+ weights? Symmetric turbo works. Use
-ctk turbo4 -ctv turbo4or-ctk turbo3 -ctv turbo3. - Q4_K_M weights, unknown model? Start asymmetric:
-ctk q8_0 -ctv turbo4. - Q4_K_M weights, Qwen2.5? Must use asymmetric. Symmetric is catastrophic.
- Q4_K_M weights, Llama/Mistral/Cohere 24B+? Symmetric likely works, but validate PPL first.
Asymmetric K/V trades some K-side compression for quality safety. The V-side savings still dominate for long context:
| Config | K bpv | V bpv | KV compression vs fp16 | Quality |
|---|---|---|---|---|
| q8_0/q8_0 | 8.5 | 8.5 | 1.9x | baseline |
| q8_0/turbo4 | 8.5 | 4.25 | 2.5x | +0.3-1.3% |
| q8_0/turbo3 | 8.5 | 3.125 | 2.8x | +1.1-2.1% |
| q8_0/turbo2 | 8.5 | 2.125 | 3.0x | +3.1-9.5% |
| turbo3/turbo3 | 3.125 | 3.125 | 5.1x | model-dependent |
| turbo4/turbo4 | 4.25 | 4.25 | 3.8x | +1.7-6.3% |
Even asymmetric q8_0/turbo3 achieves 2.8x total KV compression. At 128K context on a 70B model, that saves ~5GB of KV cache memory.
These findings have been independently confirmed by multiple researchers:
@sztlink (Felipe Sztutman) — Qwen3-4B, RTX 4090, tonbistudio/turboquant-pytorch (2026-03-31):
- fp16-K + 2bit-V: 1.000 cosine similarity, 100% top-1 match
- All degradation from K compression, V has zero effect
- Boundary layer gap scales with K compression (-0.001 at 4-bit K, -0.010 at 2-bit K)
- Layer 0 K isolation test: protecting boundary K layers (fp16) did NOT reduce the gap — gap grew more negative, meaning boundary layers already compress better than middle layers. Mechanistic explanation: extreme K norms (146.8 at layer 0) produce tightly Gaussian distributions after normalization, which are ideal for Lloyd-Max codebooks. Middle layers (K norm 20–40) have more variable distributions and compress slightly less accurately. Confirms uniform K compression is stable — no per-layer K precision tuning needed
@HyperionMS2040 — 10-model CUDA sweep, RTX 3090 (2026-03-30):
- Qwen2.5-7B symmetric turbo3: catastrophic (PPL 3,472). Llama 3.1 8B symmetric turbo3: +6.4%
- q8_0/turbo4 "lossless across all tested architectures" (4 architectures)
- Confirms model-family-dependent sensitivity pattern
AMD HIP validation — RX 9070 XT, gfx1201 (2026-03-29):
- Asymmetric q8_0/turbo4 confirmed: +1.0% PPL
- Symmetric catastrophic: PPL 81,277 (turbo3/turbo3 on Qwen2.5-7B Q4_K_M)
- Consistent with Metal and CUDA findings
- (Author's own testing on Windows AMD hardware)
@jesusmb1995 — Vulkan backend, Mistral-7B Q4_K_S (2026-03-31):
- Mixed K/V Vulkan results support asymmetric thesis: q8_0-K + pq3_0-V = PPL 6.9426 (+0.64%), f16-K + pq4_0-V = PPL 6.8901 (-0.12%)
- V compression nearly free when K precision maintained, consistent across a fourth GPU backend (Vulkan)
- Note: uses
pq/tbqtype naming (jesusmb1995's Vulkan implementation), same underlying WHT + PolarQuant algorithm
@stragulus — Vulkan, AMD Radeon 7900 XTX (2026-03-31):
- Independent confirmation on jesusmb1995's Vulkan branch: q8_0-K + pq3_0-V = PPL 6.9226 (+0.37%), f16-K + pq4_0-V = PPL 6.8837 (-0.19%)
- V compression free on fifth hardware/backend combination (Vulkan + AMD discrete)
@sztlink (Felipe Sztutman) — Qwen3-30B-A3B Q4_K_M, RTX 4090, AmesianX v1.2.0 (2026-04-01):
- Production PPL validation (wikitext-2-raw, ctx=512, 20 chunks). First SM89 (Ada) data for this model
- q8_0/tbq3 (asymmetric): PPL 7.5910 (+0.57% vs f16 7.5477) — tightest asymmetric result to date, confirms "V is free" on MoE
- tbq3/tbq3 (symmetric): PPL 9.5221 (+26.16%) — catastrophic. Extends Qwen symmetric sensitivity from Qwen2.5 dense to Qwen3 MoE
- tbq4/tbq4 (symmetric): PPL 7.9950 (+5.93%) — healthy, matching q4_0. The turbo4-K catastrophic failures reported by zekrom-vale were caused by kernel bugs in TheTom's fork (documented in turbo4-resurrection paper) and head_dim misdetection in AmesianX's fork (fixed in v1.3.0), not an inherent turbo4 or IQ4_XS issue. AmesianX independently confirmed turbo4-K works correctly: PPL 6.73 (+7.5%) on Qwen3-30B-A3B and 6.20 (+0.6%) on Qwen3.5-27B distill
- Adds Qwen3 MoE to Section 4 sensitivity table: Qwen family is sensitive across architectures (dense and MoE), not just Qwen2.5
@Madreag — Optimized CUDA fork, 4 GPUs (SM86x2/SM89/SM120), 1,351+ iterations (2026-04-01):
- Full asymmetric K/V PPL matrix (ctx=512, Qwen3.5-27B Q6_K): V type dominates PPL (columns vary more than rows). K=turbo3/V=turbo3 (6.7251) approximately equals K=q8_0/V=turbo3 (6.6885). Independent CUDA confirmation that K precision matters less than V precision for quality, and that asymmetric is not needed on Q6_K+ weights
- Detailed results with KL divergence, NIAH, and speed across 4 GPUs and 3 architecture generations
@mudler (Ettore Di Giacinto) — APEX launch, LocalAI (44.7k stars) (2026-04-01):
- Released APEX (Adaptive Precision for EXpert Models), a MoE-aware mixed-precision weight quantization method for Qwen3.5-35B-A3B
- Explicitly recommends pairing APEX with TurboQuant KV cache compression in the launch announcement. First major open source project to officially integrate TurboQuant into their recommended stack
- APEX + TurboQuant at 8K context: +14% prompt processing speedup across all APEX tiers, zero quality loss, 4.6x KV cache compression
- APEX Mini (12.2 GB) + TurboQuant = 35B MoE at 8K context on a 16GB consumer GPU
- Independently confirms boundary layer sensitivity for weights: "first and last 5 transformer layers are 10x more sensitive than the middle," consistent with our Boundary V finding for KV cache
-
Sensitivity root cause not proven. We observe that Qwen2.5 is sensitive and Llama/Mistral are not, but we have not identified the specific architectural feature that causes this. Possible factors: attention head count, KV head ratio, weight distribution, or training procedure.
-
PPL is a proxy. Perplexity on wikitext-2 at 512 context is a noisy single metric. Task-specific benchmarks (reasoning, code, retrieval) may show different sensitivity patterns.
-
Limited model families. Tested on Qwen2.5, Llama-3.1, Mistral, Command-R+, phi-4, and Qwen3.5. Other architectures (DeepSeek, Gemma, GPT-class) are untested.
-
Metal primary, CUDA/HIP secondary. Most experiments run on Apple Silicon Metal. CUDA and HIP results confirm the pattern but with fewer model/config combinations tested.
V cache compression is effectively free. K cache precision is the dominant lever for attention quality. This is a direct consequence of the attention mechanism: K errors are amplified exponentially through softmax, while V errors scale linearly through the weighted sum.
The practical implication is simple: use asymmetric K/V configs. Compress V aggressively (turbo2 or turbo3), keep K at q8_0 or higher. This gives meaningful compression (2.5-3.0x total KV) with minimal quality loss (+0.3-2.1% PPL) on all tested models, including those where symmetric turbo fails catastrophically.
For models validated as symmetric-tolerant (Llama, Mistral, Cohere at 24B+), symmetric turbo3/turbo3 offers better compression (5.1x) at acceptable quality (+3.6-11.4% PPL). But asymmetric is always the safe default.
The asymmetric K/V findings were first shared publicly in the llama.cpp TurboQuant discussion thread starting March 26, 2026. Initial recommendations for asymmetric configs and the observation that K precision dominates quality were posted there before being formalized in this paper.
- TurboQuant paper: arXiv 2504.19874 (ICLR 2026)
- TurboQuant+ implementation: github.com/TheTom/turboquant_plus
- llama.cpp fork: github.com/TheTom/llama-cpp-turboquant
- Sparse V dequantization: sparse-v-dequant.md
- Boundary V: layer-aware-v-compression.md
- M5 Max stress test: m5-max-stress-test.md
- Configuration recommendations: turboquant-recommendations.md