Tom Turney Independent Researcher GitHub: @TheTom
TurboQuant's turbo3 format stores quantized KV cache values in 32-element blocks, each with its own 16-bit norm. Since the Walsh-Hadamard rotation and norm correction operate on 128-element groups (matching head_dim), all 4 blocks within a group receive the same corrected norm value. This redundancy wastes 6 bytes per group (3 duplicate norms).
We tested whether increasing the storage block size from 32 to 128 (one block per rotation group) affects perplexity, speed, or correctness. Across 3 model architectures (dense, dense Qwen, hybrid MoE), 2 context lengths (512, 8192), and 2 Apple Silicon platforms (M5 Max, M2 Pro), perplexity was identical to 4 decimal places at all block sizes. Speed deltas were within measurement noise. The result is a free compression improvement: turbo3 at block_size=128 achieves 5.12x compression (vs 4.57x at block_size=32) with no measurable quality or speed cost.
All testing used q8_0-K + turbo-V cache configurations on Metal (Apple Silicon). Symmetric turbo-K paths and CUDA backends were not validated in this pass.
TurboQuant's quantization operates at two granularities:
-
Rotation group (
QK_TURBO3_GROUP = 128): The Walsh-Hadamard Transform (WHT) operates on 128-element groups matching the model's head dimension. After rotation, coordinates are approximately Gaussian with known variance, enabling optimal scalar quantization. The group-level norm correction (grp_norm / recon_norm) is computed over all 128 elements. -
Storage block (
QK_TURBO3 = 32): Each block of 32 quantized elements is stored with its own 16-bit norm, quantization indices, and sign bits. One rotation group contains 4 storage blocks.
The storage block size affects only how many elements share one stored norm value. It does not affect the rotation, centroid selection, or norm correction math.
| Block Size | Block Layout | bits/value | Compression vs fp16 |
|---|---|---|---|
| 32 (current) | norm(2B) + qs(8B) + signs(4B) = 14B per 32 | 3.50 | 4.57x |
| 64 | norm(2B) + qs(16B) + signs(8B) = 26B per 64 | 3.25 | 4.92x |
| 128 | norm(2B) + qs(32B) + signs(16B) = 50B per 128 | 3.125 | 5.12x |
The compression improvement comes from amortizing the 2-byte norm over more elements. At block_size=128, one norm covers the entire rotation group, eliminating all redundancy.
All 4 blocks within a 128-element rotation group receive the same corrected norm. This is because norm correction is computed at the group level:
grp_norm = ||original_group||
recon_norm = ||reconstructed_centroids||
corrected_norm = grp_norm / recon_norm
Every block in the group gets corrected_norm. Changing from 4 blocks to 1 block stores this value once instead of four times, but the quantization and dequantization math is identical.
Block size is controlled by QK_TURBO3 and QK_TURBO2 defines in ggml/src/ggml-common.h. We added derived macros for the flash attention template parameters:
#define NL_TURBO3 (QK_TURBO3 / 16)
#define NL_TURBO3_VEC (QK_TURBO3 / 4)This replaced ~250 hardcoded nl values in Metal flash attention template instantiations, making block size a one-line edit. The dequantization functions, set-rows kernels, and quantization functions all use QK_TURBO3 symbolically and require no additional changes.
| Model | Architecture | Weights | Layers | head_dim |
|---|---|---|---|---|
| phi-4 | Dense (Phi-3) | Q8_0 | 40 | 128 |
| Qwen2.5-7B-Instruct | Dense (Qwen2) | Q4_K_M | 28 | 128 |
| Qwen3.5-35B-A3B | Hybrid MoE (GDN + attention) | Q8_0 | 40 (10 KV) | 128 |
| Machine | SoC | Memory | Bandwidth |
|---|---|---|---|
| M5 Max | Apple M5 Max | 128 GB | 546 GB/s |
| M2 Pro | Apple M2 Pro | 16 GB | 200 GB/s |
All PPL and speed tests used --cache-type-k q8_0 --cache-type-v turbo3 (asymmetric). This is the recommended production configuration for models where K precision matters. Symmetric turbo-K paths were not tested in this block-size pass.
- PPL:
llama-perplexityon wikitext-2-raw, 512 and 8192 context, 4-20 chunks - Speed:
llama-benchwith pp512/pp4096 prefill and tg128 decode
| Model | Block 32 | Block 64 | Block 128 |
|---|---|---|---|
| phi-4 Q8_0 (M5, 20 chunks) | 6.6105 | 6.6105 | 6.6105 |
| Qwen2.5-7B Q4_K_M (M5, 10 chunks) | 7.4471 | — | 7.4471 |
| Qwen2.5-7B Q4_K_M (M2, 10 chunks) | 7.4727 | — | 7.4727 |
| Qwen3.5-35B-A3B Q8_0 (M5, 10 chunks) | 7.0298 | — | 7.0298 |
| Model | Context | Block 32 | Block 128 |
|---|---|---|---|
| phi-4 Q8_0 (M5, 4 chunks) | 8K | 5.7134 | 5.7134 |
| Qwen2.5-7B Q4_K_M (M5, 4 chunks) | 8K | 5.9547 | 5.9547 |
| Qwen2.5-7B Q4_K_M (M2, 4 chunks) | 8K | 5.9477 | 5.9477 |
| phi-4 Q8_0 (M5, 4 chunks) | 32K | 6.1873 | 6.1873 |
PPL is identical to 4 decimal places across all tested models, context lengths (512, 8K, 32K), and hardware. This confirms the theoretical prediction: block size does not affect quantization math.
| Cache Config | Model | Block 32 | Block 128 |
|---|---|---|---|
| q8_0-K / turbo3-V | phi-4 Q8_0 | 6.6105 | 6.6105 |
| q8_0-K / turbo2-V | phi-4 Q8_0 | 6.7346 | 6.7346 |
| q8_0-K / turbo2-V | Qwen2.5-7B Q4_K_M | 7.7989 | 7.7989 |
| q8_0-K / turbo2-V | Qwen2.5-7B Q4_K_M (M2) | 7.7807 | 7.7807 |
| turbo3-K / turbo3-V | phi-4 Q8_0 | 6.9208 | 6.9208 |
| turbo3-K / turbo3-V | Qwen3.5-35B-A3B Q8_0 | 7.0627 | 7.0627 |
| turbo4-K / turbo4-V (always block128) | phi-4 Q8_0 | 6.6624 | 6.6624 |
PPL is identical across all tested cache paths — both asymmetric (q8_0-K + turbo-V) and symmetric (turbo3-K + turbo3-V). turbo2 parity confirmed on both phi-4 and Qwen2.5-7B (including M2). turbo4 already uses block_size=128 natively.
| Depth | 4K Context |
|---|---|
| 0% | pass |
| 50% | pass |
| 100% | pass |
3/3 pass at block_size=128. This is a block128-only run (not a strict A/B with block32), but given that PPL is byte-identical across all other tests, NIAH parity is expected.
| Model | Hardware | Block 32 | Block 128 |
|---|---|---|---|
| phi-4 Q8_0 (14B, dense) | M5 | 6.7398 | 6.7398 |
| Qwen2.5-7B Q4_K_M (dense) | M5 | 7.7852 | 7.7852 |
| Qwen3.5-35B-A3B Q8_0 (MoE) | M5 | 7.0725 | 7.0725 |
| Qwen2.5-7B Q4_K_M (dense) | M2 | 7.8384 | 7.8384 |
PPL identical across all 3 architectures and both hardware platforms. The layer-adaptive V policy (boundary layers q8_0-V, rest turbo2-V) is block-size-independent.
| Model | Metric | Block 32 | Block 128 | Delta |
|---|---|---|---|---|
| phi-4 14B | pp4096 (t/s) | 701 ± 3 | 708 ± 37 | +1.0% |
| phi-4 14B | tg128 (t/s) | 29.55 ± 0.36 | 29.38 ± 0.88 | -0.6% |
| Qwen2.5-7B | pp512 (t/s) | 1939 ± 283 | 1886 ± 316 | -2.7% |
| Qwen2.5-7B | tg128 (t/s) | 77.84 ± 0.26 | 78.34 ± 0.60 | +0.6% |
| Qwen3.5-35B MoE | pp512 (t/s) | 2775 ± 35 | 2773 ± 23 | -0.1% |
| Qwen3.5-35B MoE | tg128 (t/s) | 77.54 ± 0.64 | 77.75 ± 0.50 | +0.3% |
All deltas are within measurement noise. The Qwen2.5-7B prefill shows high variance (±283 t/s) making the -2.7% unreliable. Decode throughput is flat across all models.
A targeted follow-up tested the same model at multiple context lengths on both M2 Pro and M5 Max to determine whether block_size=128 produces a speed benefit specifically on bandwidth-constrained Apple Silicon.
M2 Pro (16 GB, 200 GB/s bandwidth):
| Context | Metric | Block 32 | Block 128 | Delta |
|---|---|---|---|---|
| 512 | prefill (t/s) | 1458 ± 8 | 1466 ± 4 | +0.6% |
| 512 | decode (t/s) | 64.42 ± 0.14 | 66.32 ± 0.40 | +3.0% |
| 8K | prefill (t/s) | 808 ± 0.3 | 811 ± 0.3 | +0.4% |
| 8K | decode (t/s) | 64.05 ± 0.32 | 68.34 ± 0.97 | +6.7% |
| 16K | prefill (t/s) | 538 ± 0.2 | 539 ± 0.3 | +0.3% |
| 16K | decode (t/s) | 64.03 ± 0.44 | 66.60 ± 0.58 | +4.0% |
M2 Pro shows a consistent decode improvement at block_size=128: +3.0% at 512, +6.7% at 8K, +4.0% at 16K. Prefill is effectively flat.
M5 Max matched comparison (128 GB, 546 GB/s bandwidth, same model/config):
| Context | Metric | Block 32 | Block 128 | Delta |
|---|---|---|---|---|
| 512 | prefill (t/s) | 9521 ± 45 | 9377 ± 49 | -1.5% |
| 512 | decode (t/s) | 161.41 ± 2.85 | 152.47 ± 4.49 | -5.5%* |
| 8K | prefill (t/s) | 4853 ± 49 | 4802 ± 68 | -1.0% |
| 8K | decode (t/s) | 161.93 ± 0.39 | 159.80 ± 1.42 | -1.3% |
| 16K | prefill (t/s) | 2999 ± 42 | 3074 ± 43 | +2.5% |
| 16K | decode (t/s) | 155.13 ± 2.71 | 161.52 ± 0.28 | +4.1% |
*The 512 decode -5.5% has high variance (±4.49 t/s). M5 decode deltas are inconsistent in direction and within noise, consistent with earlier phi-4 14B results on M5.
The M2 decode gain is consistent and reproducible across three context lengths. The matched M5 run does not show a comparable benefit. One plausible interpretation is that M2's lower memory bandwidth (200 GB/s vs 546 GB/s) makes each byte of reduced norm storage proportionally more valuable during V-cache dequant. This is a hypothesis based on the hardware difference, not a proven mechanism.
Caveats: This is a single-model result (Qwen2.5-1.5B) on a single M2 variant (M2 Pro). Larger models crash in llama-bench on M2 16GB. Other M2 variants and M1-family chips have different memory subsystems and may show different behavior.
For phi-4 (40 layers, head_dim=128), V cache memory at 512 context:
| Block Size | V Cache (MiB) | Savings vs Block 32 |
|---|---|---|
| 32 | 43.75 | baseline |
| 64 | 40.62 | 7.2% |
| 128 | 39.06 | 10.7% |
The savings scale linearly with context length. At 128K context, the difference is ~1.2 GB.
The block size result is mechanically straightforward. The WHT rotation, centroid codebook, and norm correction are all computed at the 128-element group level. The storage block size determines only how many 2-byte norms are written per group. At block_size=32, four identical norm values are stored. At block_size=128, one is stored. The quantization indices and sign bits are identical regardless.
This means the current block_size=32 default has been storing 3 redundant norms per group since the initial implementation. The 4.57x compression figure reported for turbo3 was conservative. The achievable compression for the same quantization math is 5.12x.
Credit to @AmesianX whose CUDA TurboQuant implementation with block_size=256 prompted this investigation. His implementation reports 5.2x compression at +1.7% PPL. We have not inspected the internals of his implementation. One plausible interpretation is that 256 refers to the rotation group size (not just storage block size), which would use a 256x256 rotation matrix for models with head_dim=256 (e.g., Qwen3.5-27B). That would be a fundamentally different experiment from ours (changing rotation scope, not just storage chunking). For models with head_dim=128, a 256-element rotation group would not work. Whether his approach changes the rotation group size is not established here.
-
Backend scope. Tested on Metal (Apple Silicon) only. CUDA block size changes would require separate kernel updates and validation.
-
M2 speed coverage. M2 speed testing used Qwen2.5-1.5B (the only model small enough for llama-bench on 16 GB). The +3-7% decode gain was consistent across 3 context lengths but is a single-model result. Larger-model speed behavior on M2 is inferred from PPL parity, not directly measured.
-
NIAH. Tested at block_size=128 only (phi-4, symmetric turbo3, 3 depths at 4K). 3/3 pass. Strict A/B vs block32 was not performed, but is expected to match given byte-identical PPL.
-
Serialization. Changing block size changes the on-disk turbo3/turbo2 block struct layout. Existing serialized cache state files would be incompatible. Since this is a pre-release branch, this is acceptable.
-
Symmetric turbo4-K. Symmetric turbo3-K + turbo3-V was validated (phi-4 dense, Qwen3.5-35B MoE). Symmetric turbo4-K + turbo4-V was not separately tested at block_size=128, though turbo4 already uses block_size=128 natively.
For the tested Metal (Apple Silicon) paths, block_size=128 is a safe default change:
- Zero PPL regression across all tested configurations:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3-K + turbo3-V (phi-4 dense, Qwen3.5-35B MoE)
- Boundary V / LA-V7 (phi-4, Qwen2.5-7B, Qwen3.5-35B MoE, M5 + M2)
- 3 model architectures, 3 context lengths (512, 8K, 32K), 2 hardware platforms
- 12% better compression (5.12x vs 4.57x for turbo3)
- No speed regression on M5 Max (flat across 3 models)
- Consistent +3-7% decode improvement on the tested M2 Pro setup (Qwen2.5-1.5B, 3 context lengths)
- NIAH 3/3 pass at block_size=128 (phi-4, symmetric turbo3)
- Eliminates redundant norm storage
Before shipping as a default change, CUDA backends should be separately validated.
To test block_size=128 locally:
-
Edit
ggml/src/ggml-common.h:#define QK_TURBO3 128 // was 32 #define QK_TURBO2 128 // was 32
-
Rebuild:
cmake --build build -j$(nproc) -
Run PPL:
./build/bin/llama-perplexity \ -m your-model.gguf \ -f wikitext-2-raw/wiki.test.raw \ --cache-type-k q8_0 --cache-type-v turbo3 \ -c 512 --chunks 10 -ngl 99
The derived macros (NL_TURBO3, NL_TURBO3_VEC, NL_TURBO2, NL_TURBO2_VEC) propagate the block size change through all Metal flash attention template instantiations automatically.
The block size 128 finding has been independently confirmed:
- @dusterbloom — RTX 3090 (2026-03-30): Tested TBQ3 Flash Attention across 5 model families with block_size=128 builds. Results consistent with our Metal findings.
- @HyperionMS2040 — RTX 3090 (2026-03-30): CUDA warp-to-block mapping fix (PR #32) enabled correct block_size=128 support on CUDA. PPL validated on multiple models after fix.
- @AmesianX — Blackwell DGX Spark (2026-03-30): Used block size 256 (QK_K) for fused FA kernels. Different tradeoff: 256 gives slightly better compression but our testing shows 128 is better quality-per-bit on smaller models (7B-14B).