|
2 | 2 |
|
3 | 3 | Rust implementation of Google's TurboQuant algorithm (Zandieh et al., ICLR 2026) for extreme KV-cache compression in LLM inference. |
4 | 4 |
|
5 | | -[](https://github.com/nicosql/turboquant/actions/workflows/ci.yml) |
| 5 | +[](https://github.com/SaschaOnTour/turboquant/actions/workflows/ci.yml) |
6 | 6 | [](https://docs.rs/turboquant) |
7 | 7 | [](https://crates.io/crates/turboquant) |
8 | 8 |
|
@@ -82,15 +82,22 @@ mistralrs run --pa-cache-type tq3 -m Qwen/Qwen3-0.6B |
82 | 82 | mistralrs run --pa-cache-type tq4 -m mistralai/Mistral-7B-Instruct-v0.3 |
83 | 83 | ``` |
84 | 84 |
|
85 | | -### Integration Benchmarks (CPU-only, Qwen3-0.6B, 512 prompt + 128 decode) |
| 85 | +### Integration Benchmarks (CPU-only, Qwen3-0.6B, 128 decode tokens) |
86 | 86 |
|
87 | | -| | Normal | TQ3 | Overhead | |
88 | | -|---|---|---|---| |
89 | | -| Prefill | 129.7 tok/s | 130.0 tok/s | **0% overhead** | |
90 | | -| Decode | 10.5 tok/s | 8.4 tok/s | **~20% (amortized, includes one-time flush)** | |
91 | | -| KV-Cache Memory | 1x | **~4.9x compression** | | |
| 87 | +| Context | Variant | Total Time | Prefill tok/s | Decode tok/s | Wall-Clock Overhead | |
| 88 | +|---------|---------|-----------|---------------|-------------|---------------------| |
| 89 | +| 512 | Normal | 58.4s | 148.1 | 11.8 | — | |
| 90 | +| 512 | TQ3 | 64.7s | 141.5 | 9.5 | +11% | |
| 91 | +| 2048 | Normal | 2:38 | 58.4 | 11.5 | — | |
| 92 | +| 2048 | TQ3 | 2:55 | 59.5 | 7.7 | +10% | |
| 93 | +| 4096 | Normal | 7:50 | 32.2 | 10.9 | — | |
| 94 | +| 4096 | TQ3 | 8:16 | 31.6 | 6.5 | +6% | |
| 95 | +| 16384 | Normal | 1:47:42 | 7.7 | 8.0 | — | |
| 96 | +| 16384 | TQ3 | 1:49:00 | 7.6 | 2.9 | +1.2% | |
92 | 97 |
|
93 | | -The decode overhead is amortized over 128 tokens and includes a one-time lazy quantization flush. A future GPU kernel implementation (Approach B) would eliminate this overhead entirely. See [Approach B Roadmap](../docs/approach-b-roadmap.md). |
| 98 | +**Key takeaway**: TQ3 overhead **decreases with context length** (11% → 10% → 6% → 1.2%) because prefill dominates at longer contexts and runs at the same speed. The decode throughput difference (dequantization cost) matters less as sequences grow — exactly the regime where KV-cache compression is needed most. |
| 99 | + |
| 100 | +A future GPU kernel implementation (Approach B) would reduce the decode overhead further. See [Approach B Roadmap](../docs/approach-b-roadmap.md). |
94 | 101 |
|
95 | 102 | ### Optimizations |
96 | 103 |
|
@@ -141,7 +148,7 @@ The 3-bit packing layout is **identical** to llama.cpp tq3_0 (8 indices into 3 b |
141 | 148 |
|
142 | 149 | ```toml |
143 | 150 | [dependencies] |
144 | | -turboquant = { git = "https://github.com/nicosql/turboquant.git" } |
| 151 | +turboquant = { git = "https://github.com/SaschaOnTour/turboquant.git" } |
145 | 152 | ``` |
146 | 153 |
|
147 | 154 | ## Building with Native CPU Optimizations |
|
0 commit comments