Skip to content

Commit fef7881

Browse files
committed
Fix: add full performance benchmark results
1 parent b088f1a commit fef7881

1 file changed

Lines changed: 16 additions & 9 deletions

File tree

README.md

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Rust implementation of Google's TurboQuant algorithm (Zandieh et al., ICLR 2026) for extreme KV-cache compression in LLM inference.
44

5-
[![CI](https://github.com/nicosql/turboquant/actions/workflows/ci.yml/badge.svg)](https://github.com/nicosql/turboquant/actions/workflows/ci.yml)
5+
[![CI](https://github.com/SaschaOnTour/turboquant/actions/workflows/ci.yml/badge.svg)](https://github.com/SaschaOnTour/turboquant/actions/workflows/ci.yml)
66
[![docs.rs](https://docs.rs/turboquant/badge.svg)](https://docs.rs/turboquant)
77
[![crates.io](https://img.shields.io/crates/v/turboquant.svg)](https://crates.io/crates/turboquant)
88

@@ -82,15 +82,22 @@ mistralrs run --pa-cache-type tq3 -m Qwen/Qwen3-0.6B
8282
mistralrs run --pa-cache-type tq4 -m mistralai/Mistral-7B-Instruct-v0.3
8383
```
8484

85-
### Integration Benchmarks (CPU-only, Qwen3-0.6B, 512 prompt + 128 decode)
85+
### Integration Benchmarks (CPU-only, Qwen3-0.6B, 128 decode tokens)
8686

87-
| | Normal | TQ3 | Overhead |
88-
|---|---|---|---|
89-
| Prefill | 129.7 tok/s | 130.0 tok/s | **0% overhead** |
90-
| Decode | 10.5 tok/s | 8.4 tok/s | **~20% (amortized, includes one-time flush)** |
91-
| KV-Cache Memory | 1x | **~4.9x compression** | |
87+
| Context | Variant | Total Time | Prefill tok/s | Decode tok/s | Wall-Clock Overhead |
88+
|---------|---------|-----------|---------------|-------------|---------------------|
89+
| 512 | Normal | 58.4s | 148.1 | 11.8 ||
90+
| 512 | TQ3 | 64.7s | 141.5 | 9.5 | +11% |
91+
| 2048 | Normal | 2:38 | 58.4 | 11.5 ||
92+
| 2048 | TQ3 | 2:55 | 59.5 | 7.7 | +10% |
93+
| 4096 | Normal | 7:50 | 32.2 | 10.9 ||
94+
| 4096 | TQ3 | 8:16 | 31.6 | 6.5 | +6% |
95+
| 16384 | Normal | 1:47:42 | 7.7 | 8.0 ||
96+
| 16384 | TQ3 | 1:49:00 | 7.6 | 2.9 | +1.2% |
9297

93-
The decode overhead is amortized over 128 tokens and includes a one-time lazy quantization flush. A future GPU kernel implementation (Approach B) would eliminate this overhead entirely. See [Approach B Roadmap](../docs/approach-b-roadmap.md).
98+
**Key takeaway**: TQ3 overhead **decreases with context length** (11% → 10% → 6% → 1.2%) because prefill dominates at longer contexts and runs at the same speed. The decode throughput difference (dequantization cost) matters less as sequences grow — exactly the regime where KV-cache compression is needed most.
99+
100+
A future GPU kernel implementation (Approach B) would reduce the decode overhead further. See [Approach B Roadmap](../docs/approach-b-roadmap.md).
94101

95102
### Optimizations
96103

@@ -141,7 +148,7 @@ The 3-bit packing layout is **identical** to llama.cpp tq3_0 (8 indices into 3 b
141148

142149
```toml
143150
[dependencies]
144-
turboquant = { git = "https://github.com/nicosql/turboquant.git" }
151+
turboquant = { git = "https://github.com/SaschaOnTour/turboquant.git" }
145152
```
146153

147154
## Building with Native CPU Optimizations

0 commit comments

Comments
 (0)