Commit 9c84b2e
authored
Merge pull request #2 from SaschaOnTour/feat/rustqual-cleanup-v0.3.0
Complete implementation of the TurboQuant KV-cache compression library based on https://arxiv.org/abs/2502.02631. Compresses KV-cache to 3-4 bits per
value with minimal quality loss, enabling longer contexts and lower VRAM usage.
Compression Methods
Three modes implemented, all using block-level PolarQuant (block_size=32) with WHT rotation:
PQ (PolarQuant): Standard codebook quantization. Simplest mode, good baseline.
PQO (PolarQuant Outlier): All blocks use higher-bit outlier codebook. Best quality, recommended for production. Uses CUDA fused attention kernel for
decode.
TQ (TurboQuant): Standard codebook + QJL (Quantized Johnson-Lindenstrauss) bias correction. Mathematically unbiased inner-product estimates per paper
Algorithm 2.
Each available as 3-bit or 4-bit variant (PQ3, PQ4, PQO3, PQO4, TQ3, TQ4).
Architecture
CompressedKVCache trait (in separate mistralrs-kv-cache crate): Clean interface between inference engines and compression backends. prefill() + decode()
— the implementation decides internally between fused kernel and dequantization.
CacheConfig: Single configuration struct for all cache types. outlier_blocks and derived qjl_enabled() determine the mode.
CUDA kernels: Fused dequant+WHT+attention kernel for PQO decode (no full dequantization needed). Separate quantize and dequantize kernels for the
compression pipeline.
Trait-based module split: PqoCache and TqCache share common helpers (dequantize_full_impl, flatten_kv, quantize_kv_pair) via common.rs.
Code Quality
Rustqual: 100.0% (0 findings, 438 functions analyzed)
369 tests including paper verification, MSE validation, roundtrip tests, and CUDA integration tests
Module splits for SRP: codebook/tables.rs, packed/indices.rs, precomputed/{rotation,codebooks}.rs, cache/cuda/quantize.rs
Named constants for all magic numbers, proper error handling (no unwraps)40 files changed
Lines changed: 7510 additions & 750 deletions
File tree
- docs
- src
- cache
- cuda
- kernels
- precomputed
- codebook
- packed
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
| 2 | + | |
2 | 3 | | |
3 | 4 | | |
4 | 5 | | |
| |||
0 commit comments