Skip to content

Latest commit

 

History

History
146 lines (106 loc) · 5.98 KB

File metadata and controls

146 lines (106 loc) · 5.98 KB

TurboQuant Benchmark Results

Comprehensive benchmarks of the PQO3 (PolarQuant Outlier, 3-bit) compressed KV-cache integrated into mistral.rs via the CompressedKVCache trait.

Test date: 2026-04-08 Hardware: NVIDIA GeForce RTX 3090 (24 GB VRAM) Methodology: 3 iterations per measurement, median reported Prompt: "The capital of France is" (quality check: output must contain "Paris")

Quality

All models produce correct text output with PQO3 compression — no quality degradation compared to Normal (uncompressed) KV-cache.

Model Architecture Layers Normal GPU/CPU PQO3 GPU/CPU PQO3-L2 GPU/CPU
Qwen3-0.6B qwen3 28 PASS / PASS PASS / PASS PASS / PASS
Llama-3.2-1B llama 16 PASS / PASS PASS / PASS PASS / PASS
Falcon3-1B llama 18 PASS / PASS PASS / PASS PASS / PASS

PQO3-L2 uses L2-norm normalization (Paper Algorithm 1) instead of MaxNorm (llama.cpp approach). Both produce identical quality.

GPU Performance + VRAM

PQO3 achieves equal or faster inference time compared to Normal, while dramatically reducing VRAM usage. The VRAM savings depend on the number of model layers — more layers mean a larger KV-cache, which benefits more from compression.

Qwen3-0.6B (28 layers, 8 KV-heads, head_dim=128)

Mode 1K ctx 4K ctx 16K ctx 32K ctx
Normal 5s / 1796 MiB 5s / 2500 MiB 8s / 5380 MiB 15s / 9124 MiB
PQO3 5s / 1572 MiB 6s / 1860 MiB 8s / 2948 MiB 15s / 4649 MiB
PQO3-L2 5s / 1572 MiB 5s / 1860 MiB 8s / 2948 MiB 14s / 4388 MiB
VRAM Savings 12% 26% 45% 49-52%

At 32K context, PQO3 uses less than half the VRAM with identical inference time. This is the primary use case: enabling longer contexts on limited VRAM.

Llama-3.2-1B (16 layers, 8 KV-heads, head_dim=128)

Mode 1K ctx 4K ctx 16K ctx 32K ctx
Normal 5s / 2884 MiB 6s / 3332 MiB 8s / 4932 MiB 12s / 7140 MiB
PQO3 5s / 2852 MiB 6s / 3268 MiB 8s / 4676 MiB 13s / 6596 MiB
VRAM Savings 1% 2% 5% 8%

Llama-3.2-1B has fewer layers (16 vs 28), so the KV-cache is a smaller fraction of total VRAM. The savings increase with context length but are modest for this model.

Falcon3-1B (18 layers, 8 KV-heads, head_dim=64)

Mode 1K ctx 4K ctx
Normal 5s / 3716 MiB 5s / 4292 MiB
PQO3 5s / 3716 MiB 6s / 4068 MiB
VRAM Savings 0% 5%

Note: Falcon3-1B has max_position_embeddings=8192. Results beyond 4K context are omitted as the model truncates longer prompts silently.

Key Insight: VRAM Savings Scale with Model Depth

The KV-cache size is proportional to num_layers x num_kv_heads x seq_len x head_dim. Models with more layers benefit significantly more from compression:

KV-Cache VRAM = num_layers x num_kv_heads x seq_len x head_dim x 2 (K+V) x dtype_bytes

For production models (7B+ with 32+ layers), the KV-cache dominates VRAM at long contexts, making PQO3 compression increasingly valuable.

CPU Performance

On CPU, PQO3 adds overhead due to quantization/dequantization without CUDA kernel acceleration. The overhead varies by model (more layers = more quant/dequant work).

Qwen3-0.6B (CPU, 28 layers)

Mode 128 ctx 512 ctx 1K ctx 2K ctx 4K ctx
Normal 16s 23s 32s 64s 182s
PQO3 21s 34s 48s 90s 231s
PQO3-L2 20s 33s 46s 90s 230s
Overhead +31% +48% +50% +41% +27%

Llama-3.2-1B (CPU, 16 layers)

Mode 128 ctx 512 ctx 1K ctx 2K ctx 4K ctx
Normal 24s 31s 41s 68s 158s
PQO3 25s 33s 44s 77s 172s
Overhead +4% +6% +7% +13% +9%

Falcon3-1B (CPU, 18 layers)

Mode 128 ctx 512 ctx 1K ctx 2K ctx 4K ctx
Normal 25s 33s 43s 73s 158s
PQO3 25s 36s 50s 84s 188s
Overhead 0% +9% +16% +15% +19%

CPU Summary

  • CPU overhead is model-dependent: 0-50% depending on layer count
  • More layers = more quantize/dequantize operations per step
  • At longer contexts, the overhead stabilizes (prefill dominates)
  • CPU mode is functional but not the recommended deployment target — GPU with fused kernel is the intended production path

MaxNorm vs L2Norm

Both normalization modes produce equivalent quality and performance:

  • MaxNorm (default): llama.cpp approach, max-abs normalization
  • L2Norm: Paper Algorithm 1, L2-norm to unit sphere

No measurable difference in quality, speed, or VRAM. MaxNorm is recommended as default for compatibility with llama.cpp codebooks.

Limitations

  • head_dim must be divisible by 32: Models with head_dim=80 (e.g., Phi-2) or other non-32-aligned dimensions are not supported. Most modern models (Llama, Qwen, Mistral, Gemma, Falcon, DeepSeek) use head_dim=128.
  • TQ3/TQ4 (QJL correction) quality: The QJL bias correction is mathematically unbiased but increases variance, which harms softmax ranking in attention. This confirms the llama.cpp finding. TQ3/TQ4 are implemented but produce degraded text quality. PQO3 is recommended.
  • Small models: VRAM savings are modest for models with few layers (<20). The compression benefit increases with model size.

Recommended Configuration

# GPU (recommended): PQO3 with MaxNorm — zero performance overhead
mistralrs run --pa-cache-type pqo3 -m <model> --device-layers "0:999"

# CPU (functional): works but with 10-50% overhead
mistralrs run --pa-cache-type pqo3 -m <model> --cpu