TurboQuant M5 Max 128GB Stress Test — Large Model Validation

Tom Turney Independent Researcher GitHub: @TheTom

Abstract

We stress-test TurboQuant KV cache compression on Meta's Llama-3.1-70B-Instruct and CohereForAI Command-R+ 104B (both Q4_K_M) running on a single Apple M5 Max with 128GB unified memory. These are the largest models tested with TurboQuant to date.

Key findings:

104B at full 128K native context on a laptop. Command-R+ 104B with turbo3/turbo3 achieves PPL 4.024 at 128K context, using ~74GB of 128GB. 5.12x KV compression.
Larger models tolerate symmetric turbo better. 104B turbo3/turbo3 is +3.6% PPL (vs +11.4% on 70B). turbo4/turbo4 is +1.9%, nearly lossless.
turbo3 prefill is faster than q8_0 at 32K context. Consistent on both 70B (80.8 vs 75.2 t/s, +7.4%) and 104B (64.5 vs 62.3 t/s, +3.5%). Smaller KV cache = less memory bandwidth.
NIAH retrieval is perfect. 30/30 on 70B (q8_0 and turbo3). 10/10 on 104B turbo3 at 4K/8K.
macOS context wall identified and fixed. The default iogpu.wired_limit_mb caps GPU memory at ~75% of RAM. On 128GB, this causes Metal to stall at ~49K context on 70B+ models. Fix: sudo sysctl iogpu.wired_limit_mb=122880. One command, no reboot.

All tests used Metal flash attention with full GPU offload. Block size 128 throughout (5.12x compression for turbo3).

1. Setup

1.1 Hardware

Component	Spec
SoC	Apple M5 Max
Memory	128GB unified (LPDDR5X)
GPU Cores	40-core
Backend	Metal with flash attention
OS	macOS

1.2 Model

Property	Value
Model	Meta-Llama-3.1-70B-Instruct
Weight quantization	Q4_K_M (GGUF, 40GB)
Layers	80
Attention heads	64
KV heads	8 (GQA 8:1)
Head dimension	128
Native context	128K

1.3 Build

Branch: feature/turboquant-kv-cache
Block size: QK_TURBO3=128, QK_TURBO2=128
Sparse V: enabled (default on M5+)
Boundary V: not active (test predates auto-enable)
Full GPU offload: -ngl 99

2. Perplexity

2.1 Short Context (512 tokens, 20 chunks, wikitext-2-raw)

K	V	PPL	vs q8_0	Status
q8_0	q8_0	3.257	baseline	healthy
q8_0	turbo4	3.301	+1.3%	healthy
q8_0	turbo3	3.325	+2.1%	healthy
turbo4	turbo4	3.461	+6.3%	healthy
turbo3	turbo3	3.629	+11.4%	usable
turbo2	turbo2	5.161	+58.5%	degraded

Finding: Llama-70B Q4_K_M tolerates symmetric turbo quantization across all formats. This contrasts with Qwen2.5-7B Q4_K_M, where symmetric turbo3/turbo3 produces catastrophic PPL (3556). The 70B model has sufficient capacity to absorb the quantization stacking that breaks smaller models.

turbo2/turbo2 shows significant degradation (+58.5%) but is not catastrophic — the model remains coherent unlike the Qwen2.5-7B case.

2.2 Long Context

K	V	Context	Chunks	PPL
q8_0	q8_0	8K	4	3.617
turbo4	turbo4	8K	4	3.770
turbo3	turbo3	8K	4	3.937
turbo3	turbo3	32K	2	4.839
q8_0	q8_0	48K	1	3.575
turbo3	turbo3	48K	1	4.019

PPL remains healthy at all tested context lengths. The turbo3 PPL at 48K (4.019) is higher than q8_0 at 48K (3.575), consistent with the +11.4% gap observed at 512 context.

3. Speed

3.1 Short Context (512 tokens)

K	V	Prefill (t/s)	Decode (t/s)
q8_0	q8_0	166.8	10.9
q8_0	turbo4	174.4	10.1
q8_0	turbo3	174.8	10.2
turbo4	turbo4	173.8	10.3
turbo3	turbo3	165.0	9.9
turbo2	turbo2	170.5	9.9

Speed is flat across all configs at short context. The 40GB model weights dominate memory bandwidth; the KV cache at 512 tokens is negligible.

3.2 8K Context

K	V	Prefill (t/s)	Decode (t/s)
q8_0	q8_0	139.2	11.9
turbo4	turbo4	134.5	10.6
turbo3	turbo3	136.2	10.1

Still flat. KV cache at 8K is ~1.25GB (q8_0) or ~250MB (turbo3), both negligible vs 40GB weights.

3.3 32K Context

K	V	Prefill (t/s)	Decode (t/s)
q8_0	q8_0	75.2	10.4
turbo4	turbo4	72.5	10.5
turbo3	turbo3	80.8	10.2

turbo3 prefill is 7.4% faster than q8_0 (80.8 vs 75.2 t/s). At 32K context, the KV cache is large enough (~5GB for q8_0 vs ~1GB for turbo3) that reduced memory bandwidth during attention outweighs dequantization cost. This crossover point is consistent with the observation on smaller models (Qwen3.5-35B MoE on M1 Max, where turbo2 beats q8_0 at 65K prefill).

Decode remains flat at ~10 t/s across all configs. On a 70B model, decode is dominated by the 40GB weight read per token, not the KV cache.

Note: llama-bench with default 5 repetitions hangs on 70B at 32K+ context. All 32K measurements use -r 1. Root cause unclear; appears to be Metal resource contention across reps.

4. Needle-In-A-Haystack (NIAH)

Kamradt single-needle methodology. 5 depths (0%, 25%, 50%, 75%, 100%) × 3 context lengths (4K, 8K, 16K) × 2 cache types.

q8_0 (baseline)

Depth	4K	8K	16K
0%	PASS	PASS	PASS
25%	PASS	PASS	PASS
50%	PASS	PASS	PASS
75%	PASS	PASS	PASS
100%	PASS	PASS	PASS

turbo3 (5.12x compression)

Depth	4K	8K	16K
0%	PASS	PASS	PASS
25%	PASS	PASS	PASS
50%	PASS	PASS	PASS
75%	PASS	PASS	PASS
100%	PASS	PASS	PASS

30/30 pass. Zero difference between q8_0 and turbo3. TurboQuant preserves retrieval accuracy at 5.12x KV cache compression on a 70B model.

5. Maximum Context

5.1 Memory at 48K Context

Config	KV Cache (MiB)	Model + Context (MiB)	Free (MiB)
q8_0/q8_0	8,160	48,991	61,108
turbo3/turbo3	~3,000	~44,000	~66,000

With turbo3, the KV cache at 48K is 3GB instead of 8GB. Both fit comfortably in 128GB with 60+ GB to spare.

5.2 Context Wall (Identified and Fixed)

Without the fix, both q8_0 and turbo3 hang at 50K+ context:

K	V	Context	Status
turbo3	turbo3	48K	works (PPL 4.019)
q8_0	q8_0	48K	works (PPL 3.575)
turbo3	turbo3	50K	HANGS
turbo3	turbo3	56K	HANGS
q8_0	q8_0	64K	HANGS

Root cause: macOS sets recommendedMaxWorkingSetSize to ~75% of physical RAM by default. On 128GB, this is ~96GB. When total GPU allocations (model + KV + compute buffers) exceed this limit, Metal's buffer allocation blocks indefinitely.

Fix: One command, no reboot required:

# Recommended: 90% of physical RAM (safe for sustained inference)
# Setting above 90% risks kernel panics under sustained load.

# 128GB Mac
sudo sysctl iogpu.wired_limit_mb=117964

# 96GB Mac
sudo sysctl iogpu.wired_limit_mb=88474

# 64GB Mac
sudo sysctl iogpu.wired_limit_mb=58982

With the fix applied, 70B runs at 64K context (PPL 4.135). See Section 8 for 104B results at 128K.

Isolation experiments confirmed iogpu.wired_limit_mb is the sole fix needed. GGML_METAL_NO_RESIDENCY=1 and custom ubatch sizes had no measurable effect.

Note: The original stress tests used 122880 (96%) without issues, but community testing (@treblewoe) reports kernel panics at sustained load above 90%. 90% is the recommended safe default.

6. GQA Impact on Compression Savings

Llama-3.1-70B uses GQA 8:1 (8 KV heads for 64 attention heads). This means the KV cache is already 1/8th the size it would be without GQA. TurboQuant's compression ratios are the same (5.12x for turbo3 at block_size=128), but the absolute memory savings are smaller:

Config	KV at 48K	vs fp16 KV
fp16	~15,360 MiB	1.0x
q8_0	8,160 MiB	1.9x
turbo3	~3,000 MiB	5.1x

On models without GQA (e.g., GPT-class with n_kv_heads = n_heads), TurboQuant's savings scale 8× larger in absolute terms.

7. Comparison with Smaller Models

Model	Weights	Symmetric turbo3 PPL	Status
Qwen2.5-7B	Q4_K_M	3,556	catastrophic
Qwen2.5-1.5B	Q4_K_M	8,641	catastrophic
Mistral-24B	Q4_K_M	4.987	healthy
Llama-70B	Q4_K_M	3.629	healthy
Command-R+ 104B	Q4_K_M	6.415	healthy

The Q4_K_M symmetric turbo sensitivity appears to be model-family-dependent, not purely size-dependent. Qwen2.5 is sensitive at all sizes. Mistral, Llama, and Command-R+ tolerate it. Larger models show better tolerance (104B +3.6% vs 70B +11.4%). For sensitive models, asymmetric q8_0-K + turbo-V is the recommended path.

8. Command-R+ 104B Q4_K_M

8.1 Model

Property	Value
Model	CohereForAI Command-R+ 104B
Weight quantization	Q4_K_M (~59GB on disk, 58.4 GiB loaded)
Layers	64
Attention heads	96
KV heads	8 (GQA 12:1)
Head dimension	128
Native context	128K

Metal configuration: iogpu.wired_limit_mb=122880 applied before testing.

8.2 Perplexity (512 tokens, 20 chunks, wikitext-2-raw)

K	V	PPL	vs q8_0	Status
q8_0	q8_0	6.192	baseline	healthy
q8_0	turbo4	6.211	+0.3%	healthy
q8_0	turbo3	6.296	+1.7%	healthy
q8_0	turbo2	6.678	+7.9%	healthy
turbo4	turbo4	6.312	+1.9%	healthy
turbo3	turbo3	6.415	+3.6%	healthy
turbo2	turbo2	7.049	+13.8%	usable

104B tolerates symmetric turbo even better than 70B. turbo3/turbo3 is +3.6% (vs +11.4% on 70B). turbo4/turbo4 is nearly lossless at +1.9%. Bigger models = more headroom for quantization stacking.

8.3 Speed

512 Context

K	V	Prefill (t/s)	Decode (t/s)
q8_0	q8_0	125.0	8.5
q8_0	turbo4	128.5	7.6
q8_0	turbo3	129.4	7.3
q8_0	turbo2	127.3	7.8
turbo4	turbo4	118.0	7.9
turbo3	turbo3	125.7	6.8
turbo2	turbo2	126.5	7.7

8K Context

K	V	Prefill (t/s)	Decode (t/s)
q8_0	q8_0	101.6	7.6
q8_0	turbo3	100.1	8.1
turbo3	turbo3	98.4	7.6

32K Context

K	V	Prefill (t/s)	Decode (t/s)
q8_0	q8_0	62.3	8.3
turbo3	turbo3	64.5	7.8

turbo3 prefill faster than q8_0 again at 32K (64.5 vs 62.3 t/s, +3.5%). Same crossover pattern as 70B.

8.4 Context Scaling (the big test)

K	V	Context	PPL	Pass time	Status
turbo3	turbo3	48K	3.672	931s	works
turbo3	turbo3	64K	4.321	1481s	works
turbo3	turbo3	96K	4.170	2966s	works
turbo3	turbo3	128K	4.024	4996s	works

104 billion parameters. 128K full native context. On a MacBook. PPL 4.024. turbo3 (5.12x KV compression). Peak memory ~74GB of 128GB. Pass time 83 minutes.

The context wall that blocked 70B at 49K is fully resolved with iogpu.wired_limit_mb=122880.

8.5 NIAH (Needle-In-A-Haystack)

Kamradt single-needle methodology. 104B decode is slow (~8 t/s), so 16K tests timed out.

turbo3 (5.12x compression)

Depth	4K	8K	16K
0%	PASS	PASS	timeout
25%	PASS	PASS	—
50%	PASS	PASS	—
75%	PASS	PASS	—
100%	PASS	PASS	—

10/10 pass at 4K and 8K. 16K timed out due to slow decode, not a retrieval failure.

9. Limitations

Two models tested. Results are for Llama-3.1-70B-Instruct and Command-R+ 104B, both Q4_K_M. Other 70B+ models (Qwen-72B, DeepSeek-67B) may behave differently.
Q4_K_M weights only. Q8_0 weights on 70B would likely show even better turbo PPL, but require ~70GB for weights alone, leaving limited room for KV cache on 128GB.
Metal only. CUDA and HIP backends were not tested at this model size.
Single run per config. PPL measurements are single-run, not averaged across multiple seeds. Error bars are from the wikitext-2 chunk variance, not run-to-run variance.
Boundary V not tested. The auto-enable for turbo2-V was added after this test. Boundary V on 70B turbo2 would likely recover significant quality from the +58.5% degradation.
104B NIAH limited. 16K context timed out due to slow decode (~8 t/s). Not a retrieval failure. Requires longer query timeout for full coverage.
macOS GPU memory fix required for 70B+ at long context. The iogpu.wired_limit_mb sysctl must be set to ~90% of physical RAM (community testing confirmed >90% risks kernel panics under sustained load). Resets on reboot.

Addendum: Independent Validation (2026-03-31)

The key findings from this stress test have been independently confirmed:

Asymmetric K/V rescue (Sections 2, 8):

@HyperionMS2040 — RTX 3090, 10-model CUDA sweep (2026-03-30): q8_0/turbo4 "lossless across all tested architectures" (4 architectures validated). Confirmed model-family-dependent sensitivity: Qwen2.5-7B symmetric turbo3 catastrophic (PPL 3,472) vs Llama 3.1 8B symmetric turbo3 +6.4%.
@sztlink — RTX 4090, Qwen3-4B (2026-03-31): Full asymmetric matrix confirms V compression is completely free (1.000 cosine similarity with fp16-K + 2bit-V). All degradation from K.
AMD HIP — RX 9070 XT, gfx1201 (2026-03-29): Asymmetric q8_0/turbo4 confirmed at +1.0% PPL. (Author's own testing on Windows AMD hardware.)

turbo3 prefill faster than q8_0 at long context (Sections 3, 8):

@spiritbuun — RTX 3090 (X collaborator, ~2026-03-28): Reported near-parity prefill speed with dequant-then-MMA path on CUDA.
@AmesianX — Blackwell DGX Spark (2026-03-30): turbo decode 63.5 t/s, faster than q8_0 (50.1 t/s) at 8K context.
@dusterbloom — RTX 3090 (2026-03-30): Decode faster on 4/5 models with TBQ3 Flash Attention.
@HyperionMS2040 — RTX 3090 (2026-03-30): Asymmetric q8_0-K/turbo3-V 14% faster decode than symmetric turbo3/turbo3 (120.9 vs 106 t/s).

macOS GPU memory wall (Section 5):

@treblewoe (X collaborator, 2026-03-31): Confirmed the wall exists. Reported kernel panics at >90% allocation under sustained load (Minimax M2.5 3-bit starting at 100GB). Recommended 90% as safe ceiling. Docs updated accordingly.

References

TurboQuant paper: arXiv 2504.19874 (ICLR 2026)
TurboQuant+ implementation: github.com/TheTom/turboquant_plus
llama.cpp fork: github.com/TheTom/llama-cpp-turboquant
Block size optimization: block-size-experiment.md
Sparse V dequant: sparse-v-dequant.md
Configuration recommendations: turboquant-recommendations.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant M5 Max 128GB Stress Test — Large Model Validation

Abstract

1. Setup

1.1 Hardware

1.2 Model

1.3 Build

2. Perplexity

2.1 Short Context (512 tokens, 20 chunks, wikitext-2-raw)

2.2 Long Context

3. Speed

3.1 Short Context (512 tokens)

3.2 8K Context

3.3 32K Context

4. Needle-In-A-Haystack (NIAH)

q8_0 (baseline)

turbo3 (5.12x compression)

5. Maximum Context

5.1 Memory at 48K Context

5.2 Context Wall (Identified and Fixed)

6. GQA Impact on Compression Savings

7. Comparison with Smaller Models

8. Command-R+ 104B Q4_K_M

8.1 Model

8.2 Perplexity (512 tokens, 20 chunks, wikitext-2-raw)

8.3 Speed

512 Context

8K Context

32K Context

8.4 Context Scaling (the big test)

8.5 NIAH (Needle-In-A-Haystack)

turbo3 (5.12x compression)

9. Limitations

Addendum: Independent Validation (2026-03-31)

References

FilesExpand file tree

m5-max-stress-test.md

Latest commit

History

m5-max-stress-test.md

File metadata and controls

TurboQuant M5 Max 128GB Stress Test — Large Model Validation

Abstract

1. Setup

1.1 Hardware

1.2 Model

1.3 Build

2. Perplexity

2.1 Short Context (512 tokens, 20 chunks, wikitext-2-raw)

2.2 Long Context

3. Speed

3.1 Short Context (512 tokens)

3.2 8K Context

3.3 32K Context

4. Needle-In-A-Haystack (NIAH)

q8_0 (baseline)

turbo3 (5.12x compression)

5. Maximum Context

5.1 Memory at 48K Context

5.2 Context Wall (Identified and Fixed)

6. GQA Impact on Compression Savings

7. Comparison with Smaller Models

8. Command-R+ 104B Q4_K_M

8.1 Model

8.2 Perplexity (512 tokens, 20 chunks, wikitext-2-raw)

8.3 Speed

512 Context

8K Context

32K Context

8.4 Context Scaling (the big test)

8.5 NIAH (Needle-In-A-Haystack)

turbo3 (5.12x compression)

9. Limitations

Addendum: Independent Validation (2026-03-31)

References