Skip to content

Latest commit

 

History

History
294 lines (195 loc) · 15 KB

File metadata and controls

294 lines (195 loc) · 15 KB

Storage Block Size Optimization for TurboQuant KV Cache Compression

Tom Turney Independent Researcher GitHub: @TheTom


Abstract

TurboQuant's turbo3 format stores quantized KV cache values in 32-element blocks, each with its own 16-bit norm. Since the Walsh-Hadamard rotation and norm correction operate on 128-element groups (matching head_dim), all 4 blocks within a group receive the same corrected norm value. This redundancy wastes 6 bytes per group (3 duplicate norms).

We tested whether increasing the storage block size from 32 to 128 (one block per rotation group) affects perplexity, speed, or correctness. Across 3 model architectures (dense, dense Qwen, hybrid MoE), 2 context lengths (512, 8192), and 2 Apple Silicon platforms (M5 Max, M2 Pro), perplexity was identical to 4 decimal places at all block sizes. Speed deltas were within measurement noise. The result is a free compression improvement: turbo3 at block_size=128 achieves 5.12x compression (vs 4.57x at block_size=32) with no measurable quality or speed cost.

All testing used q8_0-K + turbo-V cache configurations on Metal (Apple Silicon). Symmetric turbo-K paths and CUDA backends were not validated in this pass.


1. Background

1.1 Block Size vs Group Size

TurboQuant's quantization operates at two granularities:

  • Rotation group (QK_TURBO3_GROUP = 128): The Walsh-Hadamard Transform (WHT) operates on 128-element groups matching the model's head dimension. After rotation, coordinates are approximately Gaussian with known variance, enabling optimal scalar quantization. The group-level norm correction (grp_norm / recon_norm) is computed over all 128 elements.

  • Storage block (QK_TURBO3 = 32): Each block of 32 quantized elements is stored with its own 16-bit norm, quantization indices, and sign bits. One rotation group contains 4 storage blocks.

The storage block size affects only how many elements share one stored norm value. It does not affect the rotation, centroid selection, or norm correction math.

1.2 Compression at Different Block Sizes

Block Size Block Layout bits/value Compression vs fp16
32 (current) norm(2B) + qs(8B) + signs(4B) = 14B per 32 3.50 4.57x
64 norm(2B) + qs(16B) + signs(8B) = 26B per 64 3.25 4.92x
128 norm(2B) + qs(32B) + signs(16B) = 50B per 128 3.125 5.12x

The compression improvement comes from amortizing the 2-byte norm over more elements. At block_size=128, one norm covers the entire rotation group, eliminating all redundancy.

1.3 Why PPL Should Be Identical

All 4 blocks within a 128-element rotation group receive the same corrected norm. This is because norm correction is computed at the group level:

grp_norm = ||original_group||
recon_norm = ||reconstructed_centroids||
corrected_norm = grp_norm / recon_norm

Every block in the group gets corrected_norm. Changing from 4 blocks to 1 block stores this value once instead of four times, but the quantization and dequantization math is identical.


2. Experimental Setup

2.1 Implementation

Block size is controlled by QK_TURBO3 and QK_TURBO2 defines in ggml/src/ggml-common.h. We added derived macros for the flash attention template parameters:

#define NL_TURBO3     (QK_TURBO3 / 16)
#define NL_TURBO3_VEC (QK_TURBO3 / 4)

This replaced ~250 hardcoded nl values in Metal flash attention template instantiations, making block size a one-line edit. The dequantization functions, set-rows kernels, and quantization functions all use QK_TURBO3 symbolically and require no additional changes.

2.2 Models

Model Architecture Weights Layers head_dim
phi-4 Dense (Phi-3) Q8_0 40 128
Qwen2.5-7B-Instruct Dense (Qwen2) Q4_K_M 28 128
Qwen3.5-35B-A3B Hybrid MoE (GDN + attention) Q8_0 40 (10 KV) 128

2.3 Hardware

Machine SoC Memory Bandwidth
M5 Max Apple M5 Max 128 GB 546 GB/s
M2 Pro Apple M2 Pro 16 GB 200 GB/s

2.4 Cache Configuration

All PPL and speed tests used --cache-type-k q8_0 --cache-type-v turbo3 (asymmetric). This is the recommended production configuration for models where K precision matters. Symmetric turbo-K paths were not tested in this block-size pass.

2.5 Benchmark Tools

  • PPL: llama-perplexity on wikitext-2-raw, 512 and 8192 context, 4-20 chunks
  • Speed: llama-bench with pp512/pp4096 prefill and tg128 decode

3. Results

3.1 Perplexity: Block Size Has Zero Effect

Short Context (512, wikitext-2-raw)

Model Block 32 Block 64 Block 128
phi-4 Q8_0 (M5, 20 chunks) 6.6105 6.6105 6.6105
Qwen2.5-7B Q4_K_M (M5, 10 chunks) 7.4471 7.4471
Qwen2.5-7B Q4_K_M (M2, 10 chunks) 7.4727 7.4727
Qwen3.5-35B-A3B Q8_0 (M5, 10 chunks) 7.0298 7.0298

Long Context (wikitext-2-raw)

Model Context Block 32 Block 128
phi-4 Q8_0 (M5, 4 chunks) 8K 5.7134 5.7134
Qwen2.5-7B Q4_K_M (M5, 4 chunks) 8K 5.9547 5.9547
Qwen2.5-7B Q4_K_M (M2, 4 chunks) 8K 5.9477 5.9477
phi-4 Q8_0 (M5, 4 chunks) 32K 6.1873 6.1873

PPL is identical to 4 decimal places across all tested models, context lengths (512, 8K, 32K), and hardware. This confirms the theoretical prediction: block size does not affect quantization math.

Turbo Type and Cache Path Parity (M5, 512 ctx, 10 chunks)

Cache Config Model Block 32 Block 128
q8_0-K / turbo3-V phi-4 Q8_0 6.6105 6.6105
q8_0-K / turbo2-V phi-4 Q8_0 6.7346 6.7346
q8_0-K / turbo2-V Qwen2.5-7B Q4_K_M 7.7989 7.7989
q8_0-K / turbo2-V Qwen2.5-7B Q4_K_M (M2) 7.7807 7.7807
turbo3-K / turbo3-V phi-4 Q8_0 6.9208 6.9208
turbo3-K / turbo3-V Qwen3.5-35B-A3B Q8_0 7.0627 7.0627
turbo4-K / turbo4-V (always block128) phi-4 Q8_0 6.6624 6.6624

PPL is identical across all tested cache paths — both asymmetric (q8_0-K + turbo-V) and symmetric (turbo3-K + turbo3-V). turbo2 parity confirmed on both phi-4 and Qwen2.5-7B (including M2). turbo4 already uses block_size=128 natively.

NIAH (phi-4 Q8_0, M5, symmetric turbo3, block_size=128)

Depth 4K Context
0% pass
50% pass
100% pass

3/3 pass at block_size=128. This is a block128-only run (not a strict A/B with block32), but given that PPL is byte-identical across all other tests, NIAH parity is expected.

Boundary V (TURBO_LAYER_ADAPTIVE=7, q8_0-K + turbo2-V)

Model Hardware Block 32 Block 128
phi-4 Q8_0 (14B, dense) M5 6.7398 6.7398
Qwen2.5-7B Q4_K_M (dense) M5 7.7852 7.7852
Qwen3.5-35B-A3B Q8_0 (MoE) M5 7.0725 7.0725
Qwen2.5-7B Q4_K_M (dense) M2 7.8384 7.8384

PPL identical across all 3 architectures and both hardware platforms. The layer-adaptive V policy (boundary layers q8_0-V, rest turbo2-V) is block-size-independent.

3.2 Speed: No Measurable Regression

M5 Max

Model Metric Block 32 Block 128 Delta
phi-4 14B pp4096 (t/s) 701 ± 3 708 ± 37 +1.0%
phi-4 14B tg128 (t/s) 29.55 ± 0.36 29.38 ± 0.88 -0.6%
Qwen2.5-7B pp512 (t/s) 1939 ± 283 1886 ± 316 -2.7%
Qwen2.5-7B tg128 (t/s) 77.84 ± 0.26 78.34 ± 0.60 +0.6%
Qwen3.5-35B MoE pp512 (t/s) 2775 ± 35 2773 ± 23 -0.1%
Qwen3.5-35B MoE tg128 (t/s) 77.54 ± 0.64 77.75 ± 0.50 +0.3%

All deltas are within measurement noise. The Qwen2.5-7B prefill shows high variance (±283 t/s) making the -2.7% unreliable. Decode throughput is flat across all models.

M2 Pro — Matched Performance Follow-up (Qwen2.5-1.5B Q4_K_M)

A targeted follow-up tested the same model at multiple context lengths on both M2 Pro and M5 Max to determine whether block_size=128 produces a speed benefit specifically on bandwidth-constrained Apple Silicon.

M2 Pro (16 GB, 200 GB/s bandwidth):

Context Metric Block 32 Block 128 Delta
512 prefill (t/s) 1458 ± 8 1466 ± 4 +0.6%
512 decode (t/s) 64.42 ± 0.14 66.32 ± 0.40 +3.0%
8K prefill (t/s) 808 ± 0.3 811 ± 0.3 +0.4%
8K decode (t/s) 64.05 ± 0.32 68.34 ± 0.97 +6.7%
16K prefill (t/s) 538 ± 0.2 539 ± 0.3 +0.3%
16K decode (t/s) 64.03 ± 0.44 66.60 ± 0.58 +4.0%

M2 Pro shows a consistent decode improvement at block_size=128: +3.0% at 512, +6.7% at 8K, +4.0% at 16K. Prefill is effectively flat.

M5 Max matched comparison (128 GB, 546 GB/s bandwidth, same model/config):

Context Metric Block 32 Block 128 Delta
512 prefill (t/s) 9521 ± 45 9377 ± 49 -1.5%
512 decode (t/s) 161.41 ± 2.85 152.47 ± 4.49 -5.5%*
8K prefill (t/s) 4853 ± 49 4802 ± 68 -1.0%
8K decode (t/s) 161.93 ± 0.39 159.80 ± 1.42 -1.3%
16K prefill (t/s) 2999 ± 42 3074 ± 43 +2.5%
16K decode (t/s) 155.13 ± 2.71 161.52 ± 0.28 +4.1%

*The 512 decode -5.5% has high variance (±4.49 t/s). M5 decode deltas are inconsistent in direction and within noise, consistent with earlier phi-4 14B results on M5.

The M2 decode gain is consistent and reproducible across three context lengths. The matched M5 run does not show a comparable benefit. One plausible interpretation is that M2's lower memory bandwidth (200 GB/s vs 546 GB/s) makes each byte of reduced norm storage proportionally more valuable during V-cache dequant. This is a hypothesis based on the hardware difference, not a proven mechanism.

Caveats: This is a single-model result (Qwen2.5-1.5B) on a single M2 variant (M2 Pro). Larger models crash in llama-bench on M2 16GB. Other M2 variants and M1-family chips have different memory subsystems and may show different behavior.

3.3 Memory Savings

For phi-4 (40 layers, head_dim=128), V cache memory at 512 context:

Block Size V Cache (MiB) Savings vs Block 32
32 43.75 baseline
64 40.62 7.2%
128 39.06 10.7%

The savings scale linearly with context length. At 128K context, the difference is ~1.2 GB.


4. Interpretation

The block size result is mechanically straightforward. The WHT rotation, centroid codebook, and norm correction are all computed at the 128-element group level. The storage block size determines only how many 2-byte norms are written per group. At block_size=32, four identical norm values are stored. At block_size=128, one is stored. The quantization indices and sign bits are identical regardless.

This means the current block_size=32 default has been storing 3 redundant norms per group since the initial implementation. The 4.57x compression figure reported for turbo3 was conservative. The achievable compression for the same quantization math is 5.12x.


5. Relation to External Implementations

Credit to @AmesianX whose CUDA TurboQuant implementation with block_size=256 prompted this investigation. His implementation reports 5.2x compression at +1.7% PPL. We have not inspected the internals of his implementation. One plausible interpretation is that 256 refers to the rotation group size (not just storage block size), which would use a 256x256 rotation matrix for models with head_dim=256 (e.g., Qwen3.5-27B). That would be a fundamentally different experiment from ours (changing rotation scope, not just storage chunking). For models with head_dim=128, a 256-element rotation group would not work. Whether his approach changes the rotation group size is not established here.


6. Limitations

  1. Backend scope. Tested on Metal (Apple Silicon) only. CUDA block size changes would require separate kernel updates and validation.

  2. M2 speed coverage. M2 speed testing used Qwen2.5-1.5B (the only model small enough for llama-bench on 16 GB). The +3-7% decode gain was consistent across 3 context lengths but is a single-model result. Larger-model speed behavior on M2 is inferred from PPL parity, not directly measured.

  3. NIAH. Tested at block_size=128 only (phi-4, symmetric turbo3, 3 depths at 4K). 3/3 pass. Strict A/B vs block32 was not performed, but is expected to match given byte-identical PPL.

  4. Serialization. Changing block size changes the on-disk turbo3/turbo2 block struct layout. Existing serialized cache state files would be incompatible. Since this is a pre-release branch, this is acceptable.

  5. Symmetric turbo4-K. Symmetric turbo3-K + turbo3-V was validated (phi-4 dense, Qwen3.5-35B MoE). Symmetric turbo4-K + turbo4-V was not separately tested at block_size=128, though turbo4 already uses block_size=128 natively.


7. Recommendation

For the tested Metal (Apple Silicon) paths, block_size=128 is a safe default change:

  • Zero PPL regression across all tested configurations:
    • Asymmetric q8_0-K + turbo{2,3}-V
    • Symmetric turbo3-K + turbo3-V (phi-4 dense, Qwen3.5-35B MoE)
    • Boundary V / LA-V7 (phi-4, Qwen2.5-7B, Qwen3.5-35B MoE, M5 + M2)
    • 3 model architectures, 3 context lengths (512, 8K, 32K), 2 hardware platforms
  • 12% better compression (5.12x vs 4.57x for turbo3)
  • No speed regression on M5 Max (flat across 3 models)
  • Consistent +3-7% decode improvement on the tested M2 Pro setup (Qwen2.5-1.5B, 3 context lengths)
  • NIAH 3/3 pass at block_size=128 (phi-4, symmetric turbo3)
  • Eliminates redundant norm storage

Before shipping as a default change, CUDA backends should be separately validated.


8. Reproduction

To test block_size=128 locally:

  1. Edit ggml/src/ggml-common.h:

    #define QK_TURBO3 128  // was 32
    #define QK_TURBO2 128  // was 32
  2. Rebuild: cmake --build build -j$(nproc)

  3. Run PPL:

    ./build/bin/llama-perplexity \
      -m your-model.gguf \
      -f wikitext-2-raw/wiki.test.raw \
      --cache-type-k q8_0 --cache-type-v turbo3 \
      -c 512 --chunks 10 -ngl 99

The derived macros (NL_TURBO3, NL_TURBO3_VEC, NL_TURBO2, NL_TURBO2_VEC) propagate the block size change through all Metal flash attention template instantiations automatically.


Addendum: Independent Validation (2026-03-31)

The block size 128 finding has been independently confirmed:

  • @dusterbloom — RTX 3090 (2026-03-30): Tested TBQ3 Flash Attention across 5 model families with block_size=128 builds. Results consistent with our Metal findings.
  • @HyperionMS2040 — RTX 3090 (2026-03-30): CUDA warp-to-block mapping fix (PR #32) enabled correct block_size=128 support on CUDA. PPL validated on multiple models after fix.
  • @AmesianX — Blackwell DGX Spark (2026-03-30): Used block size 256 (QK_K) for fused FA kernels. Different tradeoff: 256 gives slightly better compression but our testing shows 128 is better quality-per-bit on smaller models (7B-14B).