Asymmetric K/V Cache Compression: Why V is Free and K is Everything

Tom Turney Independent Researcher GitHub: @TheTom

Abstract

The TurboQuant paper (ICLR 2026) tests KV cache compression with symmetric configurations (same quantization for both K and V). In practice, symmetric turbo quantization is catastrophic on certain model families with low-bit weight quantization (Q4_K_M). We show that this failure is entirely K-side: V compression has zero measurable effect on attention quality when K precision is maintained.

This asymmetry arises from the attention mechanism itself. K determines attention routing via softmax, which amplifies small errors exponentially. V errors scale linearly through the weighted sum. On Qwen2.5-7B Q4_K_M, symmetric turbo3 produces PPL 3,556 (catastrophic), while asymmetric q8_0-K + turbo3-V produces PPL 6.71 (+2.0% vs baseline). Same total compression budget, opposite quality outcomes.

We validate this finding across 7 models (1.5B to 104B), 3 weight quantizations (Q4_K_M, Q6_K, Q8_0), 3 GPU backends (Metal, CUDA, HIP), and 2 Apple Silicon generations (M2 Pro, M5 Max). The practical recommendation is simple: compress V maximally, spend bits on K.

1. Introduction

KV cache compression reduces memory consumption during LLM inference, enabling longer context windows and larger models on consumer hardware. Google's TurboQuant (ICLR 2026) achieves 4.6-5.12x compression via Walsh-Hadamard rotation and polar quantization. The paper validates quality using fp16 model weights.

In practice, most users run quantized model weights (Q4_K_M, Q6_K) to fit models in limited VRAM. When TurboQuant KV cache compression is applied on top of already-quantized weights, the errors compound. We refer to this as quantization stacking: weight quantization error + KV cache quantization error.

The question is: does stacking affect K and V equally?

2. The Asymmetry: Softmax Amplification

The attention mechanism computes:

$$O = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) \cdot V$$

K determines which tokens receive attention weight via softmax. V determines what information flows from those tokens. These are fundamentally different roles:

K errors shift the Q*K dot product scores. Softmax amplifies small score differences exponentially: a perturbation of $\epsilon$ in logit space produces an $O(e^\epsilon)$ change in the attention distribution. On sensitive models, this can flip which tokens dominate the output.
V errors scale linearly through the weighted sum. If a V element is off by $\epsilon$, the output error is bounded by $w_i \cdot \epsilon$ where $w_i$ is the attention weight for that position. For non-dominant positions (most of them at long context), $w_i \approx 0$ and the error vanishes.

This predicts that K precision should matter far more than V precision for output quality. Our experiments confirm this.

3. Results

All PPL measured on wikitext-2-raw (512 context, 20 chunks) unless noted. All tests on Metal with flash attention, full GPU offload.

3.1 The Catastrophic Case: Qwen2.5-7B Q4_K_M

This model demonstrates the asymmetry most dramatically.

K	V	M5 Max PPL	M2 Pro PPL	vs q8_0	Status
q8_0	q8_0	6.577	6.579	baseline	healthy
q8_0	turbo4	6.644	6.603	+1.0%	healthy
q8_0	turbo3	6.707	6.715	+2.0%	healthy
q8_0	turbo2	6.911	—	+5.1%	healthy
turbo4	turbo4	217.7	227.5	catastrophic	avoid
turbo3	turbo3	3,556	3,778	catastrophic	avoid

V compression is essentially free: q8_0-K + turbo2-V (2-bit V, 6.4x V compression) gives +5.1% PPL. The model is fully coherent.

K compression is catastrophic: turbo4-K + turbo4-V (same bits on both) gives PPL 218. The model is incoherent.

Same total bits, opposite results. The difference is 500x in PPL depending on which cache you compress.

Cross-hardware matched: M2 Pro and M5 Max produce equivalent quality patterns.

3.2 Qwen2.5-1.5B Q4_K_M

K	V	PPL	vs q8_0	Status
turbo3	turbo3	8,641+	catastrophic	avoid

Symmetric turbo is catastrophic on Qwen2.5 at all sizes, not just 7B. The sensitivity is architecture-dependent.

3.3 phi-4-14B (Q8_0 Weights)

K	V	M5 Max PPL	vs q8_0	Status
q8_0	q8_0	4.690	baseline	healthy
q8_0	turbo4	4.702	+0.3%	healthy
q8_0	turbo3	4.742	+1.1%	healthy
q8_0	turbo2	4.835	+3.1%	healthy
turbo4	turbo4	4.770	+1.7%	healthy
turbo3	turbo3	4.886	+4.2%	healthy

With Q8_0 weights, both symmetric and asymmetric work. The quantization stacking effect only appears when weight quantization is already aggressive (Q4_K_M). V compression gradient is clear: turbo4-V +0.3%, turbo3-V +1.1%, turbo2-V +3.1%. All healthy.

3.4 Mistral-Small-24B Q4_K_M

K	V	PPL	vs q8_0	Status
turbo3	turbo3	4.987	healthy	healthy

Q4_K_M symmetric turbo is not universally broken. Mistral-24B handles it fine. The sensitivity is model-family-dependent, not purely a function of weight quantization.

3.5 Llama-3.1-70B Q4_K_M

K	V	PPL	vs q8_0 (3.257)	Status
q8_0	turbo4	3.301	+1.3%	healthy
q8_0	turbo3	3.325	+2.1%	healthy
q8_0	turbo2	3.568	+9.5%	healthy
turbo4	turbo4	3.461	+6.3%	healthy
turbo3	turbo3	3.629	+11.4%	usable
turbo2	turbo2	5.161	+58.5%	degraded

Asymmetric rescue: q8_0/turbo2 = +9.5% vs symmetric turbo2/turbo2 = +58.5%. 6x improvement in quality degradation from the same V format, just by preserving K precision.

Llama-70B tolerates symmetric turbo3 (+11.4%), but asymmetric q8_0/turbo3 is better (+2.1%). Larger models absorb quantization stacking better than smaller ones.

3.6 Command-R+ 104B Q4_K_M

K	V	PPL	vs q8_0 (6.192)	Status
q8_0	turbo4	6.211	+0.3%	healthy
q8_0	turbo3	6.296	+1.7%	healthy
q8_0	turbo2	6.678	+7.9%	healthy
turbo4	turbo4	6.312	+1.9%	healthy
turbo3	turbo3	6.415	+3.6%	healthy
turbo2	turbo2	7.049	+13.8%	usable

104B tolerates symmetric turbo even better than 70B. turbo4/turbo4 is +1.9%, nearly lossless. The trend is clear: bigger models have more capacity to absorb quantization stacking.

3.7 AMD HIP Validation (RX 9070 XT)

K	V	PPL	vs q8_0 (7.794)	Status
q8_0	turbo4	7.876	+1.0%	healthy
turbo4	turbo4	401.4	catastrophic	avoid
turbo3	turbo3	81,277	catastrophic	avoid

Asymmetric q8_0/turbo4 confirmed on AMD. Symmetric Q4_K_M failure is consistent across all three GPU vendors (Metal, CUDA, HIP).

4. The Quantization Stacking Model

Why do some models survive symmetric turbo on Q4_K_M while others don't?

The hypothesis: K cache quantization error compounds multiplicatively with weight quantization error in the attention logit computation. The Q*K dot product involves both quantized-weight Q projections and quantized-cache K values. When both are lossy, the logit errors can exceed softmax's tolerance threshold.

The evidence supports this:

Model	Weights	Symmetric turbo3 PPL	Status
Qwen2.5-1.5B	Q4_K_M	8,641	catastrophic
Qwen2.5-7B	Q4_K_M	3,556	catastrophic
Mistral-24B	Q4_K_M	4.987	healthy
Llama-70B	Q4_K_M	3.629	healthy
Command-R+ 104B	Q4_K_M	6.415	healthy

The sensitivity is model-family-dependent. Qwen2.5 is sensitive at all sizes. Llama, Mistral, and Cohere are tolerant. Within tolerant families, larger models absorb stacking better (104B: +3.6% vs 70B: +11.4%).

V compression adds negligible error regardless of weight quantization. This is because V errors don't pass through softmax and don't interact multiplicatively with weight quantization in the same way.

5. Practical Recommendations

For any model you haven't tested

# Safe default: asymmetric, full K precision, compressed V
llama-server -m model.gguf -ctk q8_0 -ctv turbo4 -fa on

For models validated as symmetric-tolerant

# Symmetric: more compression, validated on Llama/Mistral/Cohere 24B+
llama-server -m model.gguf -ctk turbo3 -ctv turbo3 -fa on

For maximum V compression

# Asymmetric with turbo2-V (Boundary V auto-enabled)
llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa on

Decision tree

Q8_0+ weights? Symmetric turbo works. Use -ctk turbo4 -ctv turbo4 or -ctk turbo3 -ctv turbo3.
Q4_K_M weights, unknown model? Start asymmetric: -ctk q8_0 -ctv turbo4.
Q4_K_M weights, Qwen2.5? Must use asymmetric. Symmetric is catastrophic.
Q4_K_M weights, Llama/Mistral/Cohere 24B+? Symmetric likely works, but validate PPL first.

6. Compression Analysis

Asymmetric K/V trades some K-side compression for quality safety. The V-side savings still dominate for long context:

Config	K bpv	V bpv	KV compression vs fp16	Quality
q8_0/q8_0	8.5	8.5	1.9x	baseline
q8_0/turbo4	8.5	4.25	2.5x	+0.3-1.3%
q8_0/turbo3	8.5	3.125	2.8x	+1.1-2.1%
q8_0/turbo2	8.5	2.125	3.0x	+3.1-9.5%
turbo3/turbo3	3.125	3.125	5.1x	model-dependent
turbo4/turbo4	4.25	4.25	3.8x	+1.7-6.3%

Even asymmetric q8_0/turbo3 achieves 2.8x total KV compression. At 128K context on a 70B model, that saves ~5GB of KV cache memory.

7. Independent Validation

These findings have been independently confirmed by multiple researchers:

@sztlink (Felipe Sztutman) — Qwen3-4B, RTX 4090, tonbistudio/turboquant-pytorch (2026-03-31):

fp16-K + 2bit-V: 1.000 cosine similarity, 100% top-1 match
All degradation from K compression, V has zero effect
Boundary layer gap scales with K compression (-0.001 at 4-bit K, -0.010 at 2-bit K)
Layer 0 K isolation test: protecting boundary K layers (fp16) did NOT reduce the gap — gap grew more negative, meaning boundary layers already compress better than middle layers. Mechanistic explanation: extreme K norms (146.8 at layer 0) produce tightly Gaussian distributions after normalization, which are ideal for Lloyd-Max codebooks. Middle layers (K norm 20–40) have more variable distributions and compress slightly less accurately. Confirms uniform K compression is stable — no per-layer K precision tuning needed

@HyperionMS2040 — 10-model CUDA sweep, RTX 3090 (2026-03-30):

Qwen2.5-7B symmetric turbo3: catastrophic (PPL 3,472). Llama 3.1 8B symmetric turbo3: +6.4%
q8_0/turbo4 "lossless across all tested architectures" (4 architectures)
Confirms model-family-dependent sensitivity pattern

AMD HIP validation — RX 9070 XT, gfx1201 (2026-03-29):

Asymmetric q8_0/turbo4 confirmed: +1.0% PPL
Symmetric catastrophic: PPL 81,277 (turbo3/turbo3 on Qwen2.5-7B Q4_K_M)
Consistent with Metal and CUDA findings
(Author's own testing on Windows AMD hardware)

@jesusmb1995 — Vulkan backend, Mistral-7B Q4_K_S (2026-03-31):

Mixed K/V Vulkan results support asymmetric thesis: q8_0-K + pq3_0-V = PPL 6.9426 (+0.64%), f16-K + pq4_0-V = PPL 6.8901 (-0.12%)
V compression nearly free when K precision maintained, consistent across a fourth GPU backend (Vulkan)
Note: uses pq/tbq type naming (jesusmb1995's Vulkan implementation), same underlying WHT + PolarQuant algorithm

@stragulus — Vulkan, AMD Radeon 7900 XTX (2026-03-31):

Independent confirmation on jesusmb1995's Vulkan branch: q8_0-K + pq3_0-V = PPL 6.9226 (+0.37%), f16-K + pq4_0-V = PPL 6.8837 (-0.19%)
V compression free on fifth hardware/backend combination (Vulkan + AMD discrete)

@sztlink (Felipe Sztutman) — Qwen3-30B-A3B Q4_K_M, RTX 4090, AmesianX v1.2.0 (2026-04-01):

Production PPL validation (wikitext-2-raw, ctx=512, 20 chunks). First SM89 (Ada) data for this model
q8_0/tbq3 (asymmetric): PPL 7.5910 (+0.57% vs f16 7.5477) — tightest asymmetric result to date, confirms "V is free" on MoE
tbq3/tbq3 (symmetric): PPL 9.5221 (+26.16%) — catastrophic. Extends Qwen symmetric sensitivity from Qwen2.5 dense to Qwen3 MoE
tbq4/tbq4 (symmetric): PPL 7.9950 (+5.93%) — healthy, matching q4_0. The turbo4-K catastrophic failures reported by zekrom-vale were caused by kernel bugs in TheTom's fork (documented in turbo4-resurrection paper) and head_dim misdetection in AmesianX's fork (fixed in v1.3.0), not an inherent turbo4 or IQ4_XS issue. AmesianX independently confirmed turbo4-K works correctly: PPL 6.73 (+7.5%) on Qwen3-30B-A3B and 6.20 (+0.6%) on Qwen3.5-27B distill
Adds Qwen3 MoE to Section 4 sensitivity table: Qwen family is sensitive across architectures (dense and MoE), not just Qwen2.5

@Madreag — Optimized CUDA fork, 4 GPUs (SM86x2/SM89/SM120), 1,351+ iterations (2026-04-01):

Full asymmetric K/V PPL matrix (ctx=512, Qwen3.5-27B Q6_K): V type dominates PPL (columns vary more than rows). K=turbo3/V=turbo3 (6.7251) approximately equals K=q8_0/V=turbo3 (6.6885). Independent CUDA confirmation that K precision matters less than V precision for quality, and that asymmetric is not needed on Q6_K+ weights
Detailed results with KL divergence, NIAH, and speed across 4 GPUs and 3 architecture generations

@mudler (Ettore Di Giacinto) — APEX launch, LocalAI (44.7k stars) (2026-04-01):

Released APEX (Adaptive Precision for EXpert Models), a MoE-aware mixed-precision weight quantization method for Qwen3.5-35B-A3B
Explicitly recommends pairing APEX with TurboQuant KV cache compression in the launch announcement. First major open source project to officially integrate TurboQuant into their recommended stack
APEX + TurboQuant at 8K context: +14% prompt processing speedup across all APEX tiers, zero quality loss, 4.6x KV cache compression
APEX Mini (12.2 GB) + TurboQuant = 35B MoE at 8K context on a 16GB consumer GPU
Independently confirms boundary layer sensitivity for weights: "first and last 5 transformer layers are 10x more sensitive than the middle," consistent with our Boundary V finding for KV cache

8. Limitations

Sensitivity root cause not proven. We observe that Qwen2.5 is sensitive and Llama/Mistral are not, but we have not identified the specific architectural feature that causes this. Possible factors: attention head count, KV head ratio, weight distribution, or training procedure.
PPL is a proxy. Perplexity on wikitext-2 at 512 context is a noisy single metric. Task-specific benchmarks (reasoning, code, retrieval) may show different sensitivity patterns.
Limited model families. Tested on Qwen2.5, Llama-3.1, Mistral, Command-R+, phi-4, and Qwen3.5. Other architectures (DeepSeek, Gemma, GPT-class) are untested.
Metal primary, CUDA/HIP secondary. Most experiments run on Apple Silicon Metal. CUDA and HIP results confirm the pattern but with fewer model/config combinations tested.

9. Conclusion

V cache compression is effectively free. K cache precision is the dominant lever for attention quality. This is a direct consequence of the attention mechanism: K errors are amplified exponentially through softmax, while V errors scale linearly through the weighted sum.

The practical implication is simple: use asymmetric K/V configs. Compress V aggressively (turbo2 or turbo3), keep K at q8_0 or higher. This gives meaningful compression (2.5-3.0x total KV) with minimal quality loss (+0.3-2.1% PPL) on all tested models, including those where symmetric turbo fails catastrophically.

For models validated as symmetric-tolerant (Llama, Mistral, Cohere at 24B+), symmetric turbo3/turbo3 offers better compression (5.1x) at acceptable quality (+3.6-11.4% PPL). But asymmetric is always the safe default.

10. Prior Discussion

The asymmetric K/V findings were first shared publicly in the llama.cpp TurboQuant discussion thread starting March 26, 2026. Initial recommendations for asymmetric configs and the observation that K precision dominates quality were posted there before being formalized in this paper.

References

TurboQuant paper: arXiv 2504.19874 (ICLR 2026)
TurboQuant+ implementation: github.com/TheTom/turboquant_plus
llama.cpp fork: github.com/TheTom/llama-cpp-turboquant
Sparse V dequantization: sparse-v-dequant.md
Boundary V: layer-aware-v-compression.md
M5 Max stress test: m5-max-stress-test.md
Configuration recommendations: turboquant-recommendations.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asymmetric K/V Cache Compression: Why V is Free and K is Everything

Abstract

1. Introduction

2. The Asymmetry: Softmax Amplification

3. Results

3.1 The Catastrophic Case: Qwen2.5-7B Q4_K_M

3.2 Qwen2.5-1.5B Q4_K_M

3.3 phi-4-14B (Q8_0 Weights)

3.4 Mistral-Small-24B Q4_K_M

3.5 Llama-3.1-70B Q4_K_M

3.6 Command-R+ 104B Q4_K_M

3.7 AMD HIP Validation (RX 9070 XT)

4. The Quantization Stacking Model

5. Practical Recommendations

For any model you haven't tested

For models validated as symmetric-tolerant

For maximum V compression

Decision tree

6. Compression Analysis

7. Independent Validation

8. Limitations

9. Conclusion

10. Prior Discussion

References

FilesExpand file tree

asymmetric-kv-compression.md

Latest commit

History

asymmetric-kv-compression.md

File metadata and controls

Asymmetric K/V Cache Compression: Why V is Free and K is Everything

Abstract

1. Introduction

2. The Asymmetry: Softmax Amplification

3. Results

3.1 The Catastrophic Case: Qwen2.5-7B Q4_K_M

3.2 Qwen2.5-1.5B Q4_K_M

3.3 phi-4-14B (Q8_0 Weights)

3.4 Mistral-Small-24B Q4_K_M

3.5 Llama-3.1-70B Q4_K_M

3.6 Command-R+ 104B Q4_K_M

3.7 AMD HIP Validation (RX 9070 XT)

4. The Quantization Stacking Model

5. Practical Recommendations

For any model you haven't tested

For models validated as symmetric-tolerant

For maximum V compression

Decision tree

6. Compression Analysis

7. Independent Validation

8. Limitations

9. Conclusion

10. Prior Discussion

References