You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -74,52 +74,116 @@ MSE measured over 10,000 random vectors at d=128, matching paper values exactly.
74
74
75
75
## mistral.rs Integration
76
76
77
-
turboquant integrates transparently into [mistral.rs](https://github.com/EricLBuehler/mistral.rs) as a KV-cache quantization backend. All models are supported.
77
+
turboquant integrates into [mistral.rs](https://github.com/EricLBuehler/mistral.rs) via
78
+
the `CompressedKVCache` trait. All models with `head_dim` divisible by 32 are supported
79
+
(Llama, Qwen, Mistral, Falcon, Gemma, DeepSeek, and more).
78
80
79
81
```bash
80
-
# Run any model with TurboQuant TQ3 KV-cache compression
81
-
mistralrs run --pa-cache-type tq3 -m Qwen/Qwen3-0.6B
82
-
mistralrs run --pa-cache-type tq4 -m mistralai/Mistral-7B-Instruct-v0.3
3.**Fused CUDA kernel**: Our decode path reads directly from the compressed cache
149
+
in GPU shared memory — no full-dequantization tensor needed. This eliminates the
150
+
O(seq_len) memory overhead that makes other approaches slow at long contexts.
151
+
The result: **zero performance overhead** compared to uncompressed KV-cache on GPU.
97
152
98
-
**Key takeaway**: TQ3 overhead **decreases with context length** (11% → 10% → 6% → 1.2%) because prefill dominates at longer contexts and runs at the same speed. The decode throughput difference (dequantization cost) matters less as sequences grow — exactly the regime where KV-cache compression is needed most.
153
+
### Results compared to llama.cpp
99
154
100
-
A future GPU kernel implementation (Approach B) would reduce the decode overhead further. See [Approach B Roadmap](../docs/approach-b-roadmap.md).
155
+
llama.cpp's TQ3_0 implementation is CPU-only and uses a mixed codebook strategy.
156
+
Our GPU-accelerated PQO3 achieves:
101
157
102
-
### Optimizations
158
+
-**49% VRAM savings** at 32K context (Qwen3-0.6B, 28 layers)
159
+
-**Zero inference time overhead** on GPU (fused CUDA kernel)
160
+
-**Perfect text quality** across all tested models (Qwen3, Llama-3.2, Falcon3)
161
+
-**All models supported** via trait-based architecture (no per-model code changes)
103
162
104
-
The following optimizations were implemented to achieve near-zero overhead:
163
+
### References
105
164
106
-
-**Delta dequantization**: Avoids O(N^2) redundant work by only dequantizing newly added heads
107
-
-**Pre-allocated GPU tensor buffer**: Uses `slice_set`/`narrow` for O(1) per-step tensor updates instead of creating new tensors
108
-
-**Lazy quantization**: Defers quantization from prefill to first decode step, keeping prefill at full speed
109
-
-**Parallel head processing**: Uses rayon for multi-threaded quantization/dequantization across attention heads
110
-
-**Batch quantize**: Shares codebook and sign_pattern setup across heads in a batch
111
-
-**Zero-copy tensor data extraction**: Extracts tensor data without unnecessary allocations
112
-
-**Reusable Vec buffers**: Pre-allocated buffers reused across decode steps to avoid repeated allocation
165
+
- TurboQuant paper: [Zandieh et al., ICLR 2026](https://arxiv.org/pdf/2504.19874)
## Technical Comparison with llama.cpp TurboQuant (tq3_0)
115
173
116
174
This implementation differs from the [llama.cpp tq3_0 branch](https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0) in several important ways:
117
175
118
-
### 1. QJL Bias Correction (mandatory, not omitted)
176
+
### 1. QJL Bias Correction (implemented, but PQO recommended)
119
177
120
-
llama.cpp tq3_0 implements **only PolarQuant** (Stage 1) and omits QJL entirely. Without QJL, inner product estimates carry a systematic multiplicative bias of `2/pi` that accumulates across all keys in the softmax during attention. This bias is not visible in short-context benchmarks but **degrades quality at long contexts** (8k+ tokens), which is the primary use case for KV-cache compression.
Our implementation includes the full TURBOQUANTprod algorithm (Algorithm 2 from the paper) with QJL bias correction, guaranteeing `E[<y,x>_est] = <y,x>` (mathematically unbiased).
182
+
**However**: empirical testing confirms the
183
+
[llama.cpp finding](https://github.com/ggml-org/llama.cpp/discussions/20969) that QJL
184
+
increases variance, which harms softmax Top-K ranking in attention. The TQ3/TQ4 modes
185
+
(with QJL) currently produce degraded text quality. **PQO3 (PolarQuant Outlier, without
186
+
QJL) is the recommended mode** — it provides excellent compression with zero quality loss.
0 commit comments