Skip to content

Commit 9c84b2e

Browse files
authored
Merge pull request #2 from SaschaOnTour/feat/rustqual-cleanup-v0.3.0
Complete implementation of the TurboQuant KV-cache compression library based on https://arxiv.org/abs/2502.02631. Compresses KV-cache to 3-4 bits per value with minimal quality loss, enabling longer contexts and lower VRAM usage. Compression Methods Three modes implemented, all using block-level PolarQuant (block_size=32) with WHT rotation: PQ (PolarQuant): Standard codebook quantization. Simplest mode, good baseline. PQO (PolarQuant Outlier): All blocks use higher-bit outlier codebook. Best quality, recommended for production. Uses CUDA fused attention kernel for decode. TQ (TurboQuant): Standard codebook + QJL (Quantized Johnson-Lindenstrauss) bias correction. Mathematically unbiased inner-product estimates per paper Algorithm 2. Each available as 3-bit or 4-bit variant (PQ3, PQ4, PQO3, PQO4, TQ3, TQ4). Architecture CompressedKVCache trait (in separate mistralrs-kv-cache crate): Clean interface between inference engines and compression backends. prefill() + decode() — the implementation decides internally between fused kernel and dequantization. CacheConfig: Single configuration struct for all cache types. outlier_blocks and derived qjl_enabled() determine the mode. CUDA kernels: Fused dequant+WHT+attention kernel for PQO decode (no full dequantization needed). Separate quantize and dequantize kernels for the compression pipeline. Trait-based module split: PqoCache and TqCache share common helpers (dequantize_full_impl, flatten_kv, quantize_kv_pair) via common.rs. Code Quality Rustqual: 100.0% (0 findings, 438 functions analyzed) 369 tests including paper verification, MSE validation, roundtrip tests, and CUDA integration tests Module splits for SRP: codebook/tables.rs, packed/indices.rs, precomputed/{rotation,codebooks}.rs, cache/cuda/quantize.rs Named constants for all magic numbers, proper error handling (no unwraps)
2 parents 96eb972 + 1894cfe commit 9c84b2e

40 files changed

Lines changed: 7510 additions & 750 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
/target
2+
Cargo.lock
23
*.swp
34
*.swo
45
.idea/

0 commit comments

Comments
 (0)