Comprehensive benchmarks of the PQO3 (PolarQuant Outlier, 3-bit) compressed KV-cache
integrated into mistral.rs via the
CompressedKVCache trait.
Test date: 2026-04-08 Hardware: NVIDIA GeForce RTX 3090 (24 GB VRAM) Methodology: 3 iterations per measurement, median reported Prompt: "The capital of France is" (quality check: output must contain "Paris")
All models produce correct text output with PQO3 compression — no quality degradation compared to Normal (uncompressed) KV-cache.
| Model | Architecture | Layers | Normal GPU/CPU | PQO3 GPU/CPU | PQO3-L2 GPU/CPU |
|---|---|---|---|---|---|
| Qwen3-0.6B | qwen3 | 28 | PASS / PASS | PASS / PASS | PASS / PASS |
| Llama-3.2-1B | llama | 16 | PASS / PASS | PASS / PASS | PASS / PASS |
| Falcon3-1B | llama | 18 | PASS / PASS | PASS / PASS | PASS / PASS |
PQO3-L2 uses L2-norm normalization (Paper Algorithm 1) instead of MaxNorm (llama.cpp approach). Both produce identical quality.
PQO3 achieves equal or faster inference time compared to Normal, while dramatically reducing VRAM usage. The VRAM savings depend on the number of model layers — more layers mean a larger KV-cache, which benefits more from compression.
| Mode | 1K ctx | 4K ctx | 16K ctx | 32K ctx |
|---|---|---|---|---|
| Normal | 5s / 1796 MiB | 5s / 2500 MiB | 8s / 5380 MiB | 15s / 9124 MiB |
| PQO3 | 5s / 1572 MiB | 6s / 1860 MiB | 8s / 2948 MiB | 15s / 4649 MiB |
| PQO3-L2 | 5s / 1572 MiB | 5s / 1860 MiB | 8s / 2948 MiB | 14s / 4388 MiB |
| VRAM Savings | 12% | 26% | 45% | 49-52% |
At 32K context, PQO3 uses less than half the VRAM with identical inference time. This is the primary use case: enabling longer contexts on limited VRAM.
| Mode | 1K ctx | 4K ctx | 16K ctx | 32K ctx |
|---|---|---|---|---|
| Normal | 5s / 2884 MiB | 6s / 3332 MiB | 8s / 4932 MiB | 12s / 7140 MiB |
| PQO3 | 5s / 2852 MiB | 6s / 3268 MiB | 8s / 4676 MiB | 13s / 6596 MiB |
| VRAM Savings | 1% | 2% | 5% | 8% |
Llama-3.2-1B has fewer layers (16 vs 28), so the KV-cache is a smaller fraction of total VRAM. The savings increase with context length but are modest for this model.
| Mode | 1K ctx | 4K ctx |
|---|---|---|
| Normal | 5s / 3716 MiB | 5s / 4292 MiB |
| PQO3 | 5s / 3716 MiB | 6s / 4068 MiB |
| VRAM Savings | 0% | 5% |
Note: Falcon3-1B has max_position_embeddings=8192. Results beyond 4K context are omitted as the model truncates longer prompts silently.
The KV-cache size is proportional to num_layers x num_kv_heads x seq_len x head_dim.
Models with more layers benefit significantly more from compression:
KV-Cache VRAM = num_layers x num_kv_heads x seq_len x head_dim x 2 (K+V) x dtype_bytes
For production models (7B+ with 32+ layers), the KV-cache dominates VRAM at long contexts, making PQO3 compression increasingly valuable.
On CPU, PQO3 adds overhead due to quantization/dequantization without CUDA kernel acceleration. The overhead varies by model (more layers = more quant/dequant work).
| Mode | 128 ctx | 512 ctx | 1K ctx | 2K ctx | 4K ctx |
|---|---|---|---|---|---|
| Normal | 16s | 23s | 32s | 64s | 182s |
| PQO3 | 21s | 34s | 48s | 90s | 231s |
| PQO3-L2 | 20s | 33s | 46s | 90s | 230s |
| Overhead | +31% | +48% | +50% | +41% | +27% |
| Mode | 128 ctx | 512 ctx | 1K ctx | 2K ctx | 4K ctx |
|---|---|---|---|---|---|
| Normal | 24s | 31s | 41s | 68s | 158s |
| PQO3 | 25s | 33s | 44s | 77s | 172s |
| Overhead | +4% | +6% | +7% | +13% | +9% |
| Mode | 128 ctx | 512 ctx | 1K ctx | 2K ctx | 4K ctx |
|---|---|---|---|---|---|
| Normal | 25s | 33s | 43s | 73s | 158s |
| PQO3 | 25s | 36s | 50s | 84s | 188s |
| Overhead | 0% | +9% | +16% | +15% | +19% |
- CPU overhead is model-dependent: 0-50% depending on layer count
- More layers = more quantize/dequantize operations per step
- At longer contexts, the overhead stabilizes (prefill dominates)
- CPU mode is functional but not the recommended deployment target — GPU with fused kernel is the intended production path
Both normalization modes produce equivalent quality and performance:
- MaxNorm (default): llama.cpp approach, max-abs normalization
- L2Norm: Paper Algorithm 1, L2-norm to unit sphere
No measurable difference in quality, speed, or VRAM. MaxNorm is recommended as default for compatibility with llama.cpp codebooks.
- head_dim must be divisible by 32: Models with head_dim=80 (e.g., Phi-2) or other non-32-aligned dimensions are not supported. Most modern models (Llama, Qwen, Mistral, Gemma, Falcon, DeepSeek) use head_dim=128.
- TQ3/TQ4 (QJL correction) quality: The QJL bias correction is mathematically unbiased but increases variance, which harms softmax ranking in attention. This confirms the llama.cpp finding. TQ3/TQ4 are implemented but produce degraded text quality. PQO3 is recommended.
- Small models: VRAM savings are modest for models with few layers (<20). The compression benefit increases with model size.
# GPU (recommended): PQO3 with MaxNorm — zero performance overhead
mistralrs run --pa-cache-type pqo3 -m <model> --device-layers "0:999"
# CPU (functional): works but with 10-50% overhead
mistralrs run --pa-cache-type pqo3 -m <model> --cpu