All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Breaking: per-layer locking architecture.
PqoCacheandTqCachenow useVec<parking_lot::Mutex<LayerStorage>>internally, so calls for different layers no longer serialize on a global mutex. This enables concurrent forward passes (e.g. speculative decoding draft + target) to run without lock contention. - Breaking:
mistralrs-kv-cachetrait bumped to 0.3. All mutating trait methods now take&selfinstead of&mut self. Inference engines can now hold a plainArc<dyn CompressedKVCache>instead ofArc<Mutex<dyn CompressedKVCache>>. Seemistralrs-kv-cacheCHANGELOG 0.3.0 for the migration guide. CompressedStoragesplit: public API pivots toStorageMetadata+LayerStorage+LayerBuffers<'_>.CompressedStorageis removed.LayerStorage::buffers()replaces the four individualk_indices/v_indices/k_scales/v_scalesaccessors.- Lazy
GpuPrecomputedinit now usesstd::sync::OnceLockwith a helperensure_gpu_precomputed(), replacing the previous&mut selfensure_precomputedmethod on each cache. - Shared test-utility module:
turboquant::test_utilsis now#[doc(hidden)] pubso integration tests, benches, and examples can import the LCG helpers andmake_kv/pseudo_random_vecgenerators without each redefining them. The module is publicly reachable (and therefore part of the SemVer surface) but hidden from rustdoc; it is intended only for cross-file test/bench/example code.
- New concurrency tests (
tests/cache_concurrency_tests.rs):parallel_decode_different_layers— verifies two threads can decode into layer 0 and layer 1 simultaneously.parallel_prefill_no_corruption— compares parallel vs serial prefill.concurrent_reset_decode— stress-tests reset/decode race.layer_independence_under_contention— 8 threads × 30 decodes, all layers independent.
LayerStorage::validate()— cross-field invariant check, called fromappendviadebug_assert!to catch state inconsistencies.- Upstream rustqual bug reports — filed for three rustqual false-positives encountered during the refactor.
- IOSP violation in
TqCache::reset— switched to iterator-chain form so rustqual no longer counts it as a logic+call violation.
- Uncontended single-stream decode is unchanged (
parking_lot::Mutexis roughly 2× faster thanstd::sync::Mutexwhen uncontended). - Multi-stream / multi-layer concurrent decode is now truly parallel — previously all layers serialized on one mutex per cache.
See [0.2.0] for the prior documented release.
- CI hardening: All GitHub Actions pinned to immutable commit SHAs, explicit
permissions: contents: read,cargo auditstep added. - Dependabot: Added
.github/dependabot.ymlfor Cargo and GitHub Actions weekly updates. - Public API safety:
PqoCache::new(),TqCache::new(), andcompute_qjl_signs()now returnResultinstead of panicking on invalid input. - head_dim guard:
GpuPrecomputed::new()returns an error ifhead_dim > 1024(prevents launching CUDA kernels with too many threads per block). - CUDA pack helpers: Extracted
tq_pack_2bit,tq_pack_3bit,tq_pack_4bitintotq_common.h— eliminated 3x copy-pasted packing logic intq_quant_kernel.cu.
- mistral.rs integration: Transparent
KvCache::TurboQuantvariant — all 55+ models supported without model code changes. CLI:--pa-cache-type tq3. - Lazy quantization: Prefill stores raw vectors, quantization deferred to first decode step — 0% prefill overhead.
- Delta dequantization: Only dequantize new tokens per step — O(1) instead of O(N) per decode step.
- Batch quantize:
push_batch()with shared codebook/sign-pattern setup andQjlBatchResourcesfor efficient multi-vector quantization during flush. - Range dequantization:
dequantize_keys_range()/dequantize_values_range()for selective block dequantization without touching full history. - Parallel head processing: Rayon-based parallel quantize/dequantize across KV-heads.
- Pre-allocated GPU tensor buffer:
Tensor::slice_set+narrowinstead ofTensor::cat— eliminates O(N) copy overhead. - Zero-copy tensor extraction: Direct Candle storage access for tensor→Vec conversion without extra allocation.
- Reusable Vec buffers:
k_data_buf/v_data_bufavoid per-step allocation in decode hot path. DequantScratch: Reusable scratch buffer for dequantize operations, avoiding repeated allocation.EstimationContext: Caches QJL sign patterns for efficient repeated inner product estimation.- Benchmark script (
scripts/benchmark-b0.sh): Automated 4-variant benchmark (CPU/GPU × Normal/TQ3) with wall-clock timing and peak VRAM measurement. - Setup script (
scripts/setup-env.sh): Full environment setup for new machines (Rust, cargo tools, system deps, Node.js, rustqual from crates.io). - Developer guide (
docs/developer-guide.md): Complete project documentation with architecture diagram, setup instructions, and key files reference. - Approach B roadmap (
docs/approach-b-roadmap.md): GPU kernel roadmap with memory-bandwidth analysis showing TQ3 could be ~1.4x faster than FP16.
- PackedBlock unification: Replaced separate
BlockTQ2/BlockTQ3/BlockTQ4with singlePackedBlockstruct using genericpack_indices_chunkedhelper. - Config fields:
TurboQuantConfigfields changed topub(crate)with constructor pattern. - QjlBlock fields: Changed to
pub(crate)with accessor methods. - Rotation API: Unified
rotate()/inverse_rotate()into singlerotate(data, order: RotationOrder)function. entry_count(): Renamed fromlen()onQuantizedKVCacheto avoid rustqual false positives.- Codebook module split:
codebook.rs→codebook/mod.rs+codebook/gen.rsfor SRP compliance. - Test count: 317 → 327 tests.
- Bit-budget bug: TQ3 was using 3-bit PolarQuant + 1-bit QJL = 4 bits total. Fixed to use (b-1)-bit PolarQuant + 1-bit QJL = b bits total, matching the paper's Algorithm 2.
- Compression ratio: Now correctly achieves ~4.9x for TQ3 (was ~3.8x before bit-budget fix).
- Cache management: Fixed clone_in/clone_out/set_none for TurboQuant variant — previously crashed with "requested capacity above max seq len (0)".
- Pipeline activation: Fixed three blockers preventing TQ activation (supports_paged_attention check, CPU device check, init_cache_config early return).
- PolarQuant (Stage 1): Walsh-Hadamard rotation + optimal Lloyd-Max scalar quantization with dimension-specific Beta-distribution codebooks for 2/3/4-bit quantization.
- QJL bias correction (Stage 2): 1-bit Quantized Johnson-Lindenstrauss projection ensuring mathematically unbiased inner product estimates (
E[<y,x>_est] = <y,x>). - QuantizedKVCache: High-level API for storing quantized key-value pairs and computing attention scores with full TURBOQUANTprod algorithm (Algorithm 2 from the paper).
- Pre-computed codebooks: Optimal Lloyd-Max quantizers for the exact Beta distribution at practical LLM head dimensions (64, 128, 256) for 2/3/4-bit widths.
- Bit-packing: TQ3 layout identical to llama.cpp tq3_0 (8 indices into 3 bytes, same byte order) for potential interoperability.
- Hash-based Rademacher: Deterministic sign generation for QJL projection matrix, no cryptographic RNG needed.
- 317 tests, 330 functions, 100.0% quality score (rustqual).
- Examples:
basic_quantizeandkv_cache_demo. - CI workflow: test, clippy, fmt, doc.
- Dual license: MIT OR Apache-2.0.