Tom Turney Independent Researcher GitHub: @TheTom
We stress-test TurboQuant KV cache compression on Meta's Llama-3.1-70B-Instruct and CohereForAI Command-R+ 104B (both Q4_K_M) running on a single Apple M5 Max with 128GB unified memory. These are the largest models tested with TurboQuant to date.
Key findings:
-
104B at full 128K native context on a laptop. Command-R+ 104B with turbo3/turbo3 achieves PPL 4.024 at 128K context, using ~74GB of 128GB. 5.12x KV compression.
-
Larger models tolerate symmetric turbo better. 104B turbo3/turbo3 is +3.6% PPL (vs +11.4% on 70B). turbo4/turbo4 is +1.9%, nearly lossless.
-
turbo3 prefill is faster than q8_0 at 32K context. Consistent on both 70B (80.8 vs 75.2 t/s, +7.4%) and 104B (64.5 vs 62.3 t/s, +3.5%). Smaller KV cache = less memory bandwidth.
-
NIAH retrieval is perfect. 30/30 on 70B (q8_0 and turbo3). 10/10 on 104B turbo3 at 4K/8K.
-
macOS context wall identified and fixed. The default
iogpu.wired_limit_mbcaps GPU memory at ~75% of RAM. On 128GB, this causes Metal to stall at ~49K context on 70B+ models. Fix:sudo sysctl iogpu.wired_limit_mb=122880. One command, no reboot.
All tests used Metal flash attention with full GPU offload. Block size 128 throughout (5.12x compression for turbo3).
| Component | Spec |
|---|---|
| SoC | Apple M5 Max |
| Memory | 128GB unified (LPDDR5X) |
| GPU Cores | 40-core |
| Backend | Metal with flash attention |
| OS | macOS |
| Property | Value |
|---|---|
| Model | Meta-Llama-3.1-70B-Instruct |
| Weight quantization | Q4_K_M (GGUF, 40GB) |
| Layers | 80 |
| Attention heads | 64 |
| KV heads | 8 (GQA 8:1) |
| Head dimension | 128 |
| Native context | 128K |
- Branch:
feature/turboquant-kv-cache - Block size:
QK_TURBO3=128,QK_TURBO2=128 - Sparse V: enabled (default on M5+)
- Boundary V: not active (test predates auto-enable)
- Full GPU offload:
-ngl 99
| K | V | PPL | vs q8_0 | Status |
|---|---|---|---|---|
| q8_0 | q8_0 | 3.257 | baseline | healthy |
| q8_0 | turbo4 | 3.301 | +1.3% | healthy |
| q8_0 | turbo3 | 3.325 | +2.1% | healthy |
| turbo4 | turbo4 | 3.461 | +6.3% | healthy |
| turbo3 | turbo3 | 3.629 | +11.4% | usable |
| turbo2 | turbo2 | 5.161 | +58.5% | degraded |
Finding: Llama-70B Q4_K_M tolerates symmetric turbo quantization across all formats. This contrasts with Qwen2.5-7B Q4_K_M, where symmetric turbo3/turbo3 produces catastrophic PPL (3556). The 70B model has sufficient capacity to absorb the quantization stacking that breaks smaller models.
turbo2/turbo2 shows significant degradation (+58.5%) but is not catastrophic — the model remains coherent unlike the Qwen2.5-7B case.
| K | V | Context | Chunks | PPL |
|---|---|---|---|---|
| q8_0 | q8_0 | 8K | 4 | 3.617 |
| turbo4 | turbo4 | 8K | 4 | 3.770 |
| turbo3 | turbo3 | 8K | 4 | 3.937 |
| turbo3 | turbo3 | 32K | 2 | 4.839 |
| q8_0 | q8_0 | 48K | 1 | 3.575 |
| turbo3 | turbo3 | 48K | 1 | 4.019 |
PPL remains healthy at all tested context lengths. The turbo3 PPL at 48K (4.019) is higher than q8_0 at 48K (3.575), consistent with the +11.4% gap observed at 512 context.
| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|---|---|
| q8_0 | q8_0 | 166.8 | 10.9 |
| q8_0 | turbo4 | 174.4 | 10.1 |
| q8_0 | turbo3 | 174.8 | 10.2 |
| turbo4 | turbo4 | 173.8 | 10.3 |
| turbo3 | turbo3 | 165.0 | 9.9 |
| turbo2 | turbo2 | 170.5 | 9.9 |
Speed is flat across all configs at short context. The 40GB model weights dominate memory bandwidth; the KV cache at 512 tokens is negligible.
| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|---|---|
| q8_0 | q8_0 | 139.2 | 11.9 |
| turbo4 | turbo4 | 134.5 | 10.6 |
| turbo3 | turbo3 | 136.2 | 10.1 |
Still flat. KV cache at 8K is ~1.25GB (q8_0) or ~250MB (turbo3), both negligible vs 40GB weights.
| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|---|---|
| q8_0 | q8_0 | 75.2 | 10.4 |
| turbo4 | turbo4 | 72.5 | 10.5 |
| turbo3 | turbo3 | 80.8 | 10.2 |
turbo3 prefill is 7.4% faster than q8_0 (80.8 vs 75.2 t/s). At 32K context, the KV cache is large enough (~5GB for q8_0 vs ~1GB for turbo3) that reduced memory bandwidth during attention outweighs dequantization cost. This crossover point is consistent with the observation on smaller models (Qwen3.5-35B MoE on M1 Max, where turbo2 beats q8_0 at 65K prefill).
Decode remains flat at ~10 t/s across all configs. On a 70B model, decode is dominated by the 40GB weight read per token, not the KV cache.
Note: llama-bench with default 5 repetitions hangs on 70B at 32K+ context. All 32K measurements use
-r 1. Root cause unclear; appears to be Metal resource contention across reps.
Kamradt single-needle methodology. 5 depths (0%, 25%, 50%, 75%, 100%) × 3 context lengths (4K, 8K, 16K) × 2 cache types.
| Depth | 4K | 8K | 16K |
|---|---|---|---|
| 0% | PASS | PASS | PASS |
| 25% | PASS | PASS | PASS |
| 50% | PASS | PASS | PASS |
| 75% | PASS | PASS | PASS |
| 100% | PASS | PASS | PASS |
| Depth | 4K | 8K | 16K |
|---|---|---|---|
| 0% | PASS | PASS | PASS |
| 25% | PASS | PASS | PASS |
| 50% | PASS | PASS | PASS |
| 75% | PASS | PASS | PASS |
| 100% | PASS | PASS | PASS |
30/30 pass. Zero difference between q8_0 and turbo3. TurboQuant preserves retrieval accuracy at 5.12x KV cache compression on a 70B model.
| Config | KV Cache (MiB) | Model + Context (MiB) | Free (MiB) |
|---|---|---|---|
| q8_0/q8_0 | 8,160 | 48,991 | 61,108 |
| turbo3/turbo3 | ~3,000 | ~44,000 | ~66,000 |
With turbo3, the KV cache at 48K is 3GB instead of 8GB. Both fit comfortably in 128GB with 60+ GB to spare.
Without the fix, both q8_0 and turbo3 hang at 50K+ context:
| K | V | Context | Status |
|---|---|---|---|
| turbo3 | turbo3 | 48K | works (PPL 4.019) |
| q8_0 | q8_0 | 48K | works (PPL 3.575) |
| turbo3 | turbo3 | 50K | HANGS |
| turbo3 | turbo3 | 56K | HANGS |
| q8_0 | q8_0 | 64K | HANGS |
Root cause: macOS sets recommendedMaxWorkingSetSize to ~75% of physical RAM by default. On 128GB, this is ~96GB. When total GPU allocations (model + KV + compute buffers) exceed this limit, Metal's buffer allocation blocks indefinitely.
Fix: One command, no reboot required:
# Recommended: 90% of physical RAM (safe for sustained inference)
# Setting above 90% risks kernel panics under sustained load.
# 128GB Mac
sudo sysctl iogpu.wired_limit_mb=117964
# 96GB Mac
sudo sysctl iogpu.wired_limit_mb=88474
# 64GB Mac
sudo sysctl iogpu.wired_limit_mb=58982With the fix applied, 70B runs at 64K context (PPL 4.135). See Section 8 for 104B results at 128K.
Isolation experiments confirmed iogpu.wired_limit_mb is the sole fix needed. GGML_METAL_NO_RESIDENCY=1 and custom ubatch sizes had no measurable effect.
Note: The original stress tests used 122880 (96%) without issues, but community testing (@treblewoe) reports kernel panics at sustained load above 90%. 90% is the recommended safe default.
Llama-3.1-70B uses GQA 8:1 (8 KV heads for 64 attention heads). This means the KV cache is already 1/8th the size it would be without GQA. TurboQuant's compression ratios are the same (5.12x for turbo3 at block_size=128), but the absolute memory savings are smaller:
| Config | KV at 48K | vs fp16 KV |
|---|---|---|
| fp16 | ~15,360 MiB | 1.0x |
| q8_0 | 8,160 MiB | 1.9x |
| turbo3 | ~3,000 MiB | 5.1x |
On models without GQA (e.g., GPT-class with n_kv_heads = n_heads), TurboQuant's savings scale 8× larger in absolute terms.
| Model | Weights | Symmetric turbo3 PPL | Status |
|---|---|---|---|
| Qwen2.5-7B | Q4_K_M | 3,556 | catastrophic |
| Qwen2.5-1.5B | Q4_K_M | 8,641 | catastrophic |
| Mistral-24B | Q4_K_M | 4.987 | healthy |
| Llama-70B | Q4_K_M | 3.629 | healthy |
| Command-R+ 104B | Q4_K_M | 6.415 | healthy |
The Q4_K_M symmetric turbo sensitivity appears to be model-family-dependent, not purely size-dependent. Qwen2.5 is sensitive at all sizes. Mistral, Llama, and Command-R+ tolerate it. Larger models show better tolerance (104B +3.6% vs 70B +11.4%). For sensitive models, asymmetric q8_0-K + turbo-V is the recommended path.
| Property | Value |
|---|---|
| Model | CohereForAI Command-R+ 104B |
| Weight quantization | Q4_K_M (~59GB on disk, 58.4 GiB loaded) |
| Layers | 64 |
| Attention heads | 96 |
| KV heads | 8 (GQA 12:1) |
| Head dimension | 128 |
| Native context | 128K |
Metal configuration: iogpu.wired_limit_mb=122880 applied before testing.
| K | V | PPL | vs q8_0 | Status |
|---|---|---|---|---|
| q8_0 | q8_0 | 6.192 | baseline | healthy |
| q8_0 | turbo4 | 6.211 | +0.3% | healthy |
| q8_0 | turbo3 | 6.296 | +1.7% | healthy |
| q8_0 | turbo2 | 6.678 | +7.9% | healthy |
| turbo4 | turbo4 | 6.312 | +1.9% | healthy |
| turbo3 | turbo3 | 6.415 | +3.6% | healthy |
| turbo2 | turbo2 | 7.049 | +13.8% | usable |
104B tolerates symmetric turbo even better than 70B. turbo3/turbo3 is +3.6% (vs +11.4% on 70B). turbo4/turbo4 is nearly lossless at +1.9%. Bigger models = more headroom for quantization stacking.
| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|---|---|
| q8_0 | q8_0 | 125.0 | 8.5 |
| q8_0 | turbo4 | 128.5 | 7.6 |
| q8_0 | turbo3 | 129.4 | 7.3 |
| q8_0 | turbo2 | 127.3 | 7.8 |
| turbo4 | turbo4 | 118.0 | 7.9 |
| turbo3 | turbo3 | 125.7 | 6.8 |
| turbo2 | turbo2 | 126.5 | 7.7 |
| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|---|---|
| q8_0 | q8_0 | 101.6 | 7.6 |
| q8_0 | turbo3 | 100.1 | 8.1 |
| turbo3 | turbo3 | 98.4 | 7.6 |
| K | V | Prefill (t/s) | Decode (t/s) |
|---|---|---|---|
| q8_0 | q8_0 | 62.3 | 8.3 |
| turbo3 | turbo3 | 64.5 | 7.8 |
turbo3 prefill faster than q8_0 again at 32K (64.5 vs 62.3 t/s, +3.5%). Same crossover pattern as 70B.
| K | V | Context | PPL | Pass time | Status |
|---|---|---|---|---|---|
| turbo3 | turbo3 | 48K | 3.672 | 931s | works |
| turbo3 | turbo3 | 64K | 4.321 | 1481s | works |
| turbo3 | turbo3 | 96K | 4.170 | 2966s | works |
| turbo3 | turbo3 | 128K | 4.024 | 4996s | works |
104 billion parameters. 128K full native context. On a MacBook. PPL 4.024. turbo3 (5.12x KV compression). Peak memory ~74GB of 128GB. Pass time 83 minutes.
The context wall that blocked 70B at 49K is fully resolved with iogpu.wired_limit_mb=122880.
Kamradt single-needle methodology. 104B decode is slow (~8 t/s), so 16K tests timed out.
| Depth | 4K | 8K | 16K |
|---|---|---|---|
| 0% | PASS | PASS | timeout |
| 25% | PASS | PASS | — |
| 50% | PASS | PASS | — |
| 75% | PASS | PASS | — |
| 100% | PASS | PASS | — |
10/10 pass at 4K and 8K. 16K timed out due to slow decode, not a retrieval failure.
-
Two models tested. Results are for Llama-3.1-70B-Instruct and Command-R+ 104B, both Q4_K_M. Other 70B+ models (Qwen-72B, DeepSeek-67B) may behave differently.
-
Q4_K_M weights only. Q8_0 weights on 70B would likely show even better turbo PPL, but require ~70GB for weights alone, leaving limited room for KV cache on 128GB.
-
Metal only. CUDA and HIP backends were not tested at this model size.
-
Single run per config. PPL measurements are single-run, not averaged across multiple seeds. Error bars are from the wikitext-2 chunk variance, not run-to-run variance.
-
Boundary V not tested. The auto-enable for turbo2-V was added after this test. Boundary V on 70B turbo2 would likely recover significant quality from the +58.5% degradation.
-
104B NIAH limited. 16K context timed out due to slow decode (~8 t/s). Not a retrieval failure. Requires longer query timeout for full coverage.
-
macOS GPU memory fix required for 70B+ at long context. The
iogpu.wired_limit_mbsysctl must be set to ~90% of physical RAM (community testing confirmed >90% risks kernel panics under sustained load). Resets on reboot.
The key findings from this stress test have been independently confirmed:
Asymmetric K/V rescue (Sections 2, 8):
- @HyperionMS2040 — RTX 3090, 10-model CUDA sweep (2026-03-30): q8_0/turbo4 "lossless across all tested architectures" (4 architectures validated). Confirmed model-family-dependent sensitivity: Qwen2.5-7B symmetric turbo3 catastrophic (PPL 3,472) vs Llama 3.1 8B symmetric turbo3 +6.4%.
- @sztlink — RTX 4090, Qwen3-4B (2026-03-31): Full asymmetric matrix confirms V compression is completely free (1.000 cosine similarity with fp16-K + 2bit-V). All degradation from K.
- AMD HIP — RX 9070 XT, gfx1201 (2026-03-29): Asymmetric q8_0/turbo4 confirmed at +1.0% PPL. (Author's own testing on Windows AMD hardware.)
turbo3 prefill faster than q8_0 at long context (Sections 3, 8):
- @spiritbuun — RTX 3090 (X collaborator, ~2026-03-28): Reported near-parity prefill speed with dequant-then-MMA path on CUDA.
- @AmesianX — Blackwell DGX Spark (2026-03-30): turbo decode 63.5 t/s, faster than q8_0 (50.1 t/s) at 8K context.
- @dusterbloom — RTX 3090 (2026-03-30): Decode faster on 4/5 models with TBQ3 Flash Attention.
- @HyperionMS2040 — RTX 3090 (2026-03-30): Asymmetric q8_0-K/turbo3-V 14% faster decode than symmetric turbo3/turbo3 (120.9 vs 106 t/s).
macOS GPU memory wall (Section 5):
- @treblewoe (X collaborator, 2026-03-31): Confirmed the wall exists. Reported kernel panics at >90% allocation under sustained load (Minimax M2.5 3-bit starting at 100GB). Recommended 90% as safe ceiling. Docs updated accordingly.
- TurboQuant paper: arXiv 2504.19874 (ICLR 2026)
- TurboQuant+ implementation: github.com/TheTom/turboquant_plus
- llama.cpp fork: github.com/TheTom/llama-cpp-turboquant
- Block size optimization: block-size-experiment.md
- Sparse V dequant: sparse-v-dequant.md
- Configuration recommendations: turboquant-recommendations.md