Skip to content

Commit 4ceec40

Browse files
committed
feat(turbovec): add 3-bit width (ADR-194 D3) — fills the 2↔4-bit recall cliff
Adds BitWidth::Three (8-level Max-1960 optimal N(0,1) reconstruction levels). pack/unpack, calibration, scoring, and IdMap are width-generic, so only the centroid table + the enum arms change. Measured (cargo run --release -p ruvector-turbovec, n=5000 uniform-random, dim=256, k=10, no rerank, vs exact L2): 3-bit: recall@10 0.767, 112 B/vec, 9.8x compression, bias -0.0000 landing squarely between 2-bit (0.561) and 4-bit (0.879) — a useful memory/recall midpoint (~22% smaller than 4-bit for ~0.11 recall). Also refresh ADR-194: add the 3-bit Validation row, mark D3 done, widen T2 to {2,3,4}, correct the test count to 16, and scope the provenance note so the measured recall/compression/bias figures are called measured while the FAISS-competitive claims stay attributed targets. 16 unit + 1 doc-test pass; clippy clean; new code is rustfmt-clean.
1 parent 42b9f78 commit 4ceec40

5 files changed

Lines changed: 49 additions & 23 deletions

File tree

crates/ruvector-turbovec/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ rust-version.workspace = true
66
license.workspace = true
77
authors.workspace = true
88
repository.workspace = true
9-
description = "TurboVec: multi-bit TurboQuant FastScan-style ANN index (2/4-bit Lloyd-Max scalar quantization + TQ+ per-coordinate calibration + length-renormalized unbiased scoring). Implements ADR-194."
9+
description = "TurboVec: multi-bit TurboQuant FastScan-style ANN index (2/3/4-bit Lloyd-Max scalar quantization + TQ+ per-coordinate calibration + length-renormalized unbiased scoring). Implements ADR-194."
1010

1111
[[bin]]
1212
name = "turbovec-demo"

crates/ruvector-turbovec/src/index.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
//! 1. `norm = ‖x‖`, `û = x / norm` — strip & store length (§T1)
66
//! 2. `r = P · û` (randomized Hadamard) — reuse `ruvector_rabitq` (§T1)
77
//! 3. `z_i = (r_i − shift_i)/scale_i` — TQ+ calibration (§T3)
8-
//! 4. `q_i = argmin |z_i − centroid|` — Lloyd–Max 2/4-bit SQ (§T2)
8+
//! 4. `q_i = argmin |z_i − centroid|` — Lloyd–Max 2/3/4-bit SQ (§T2)
99
//! 5. `c_x = ⟨r, r̂⟩ / ⟨r̂, r̂⟩` — per-vector unbiased scale (§T4)
1010
//!
1111
//! Scoring a query `q` (orthogonal `P` preserves inner products, so

crates/ruvector-turbovec/src/main.rs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,12 @@ fn main() {
7878

7979
println!("\n=== TurboVec (ADR-194) proof — n={n}, dim={dim}, k={k} ===\n");
8080
println!("[1] Compression + recall vs exact brute-force L2 + estimator bias");
81-
for bw in [BitWidth::One, BitWidth::Two, BitWidth::Four] {
81+
for bw in [
82+
BitWidth::One,
83+
BitWidth::Two,
84+
BitWidth::Three,
85+
BitWidth::Four,
86+
] {
8287
recall_and_bias(bw, &data, &queries, dim, k);
8388
}
8489

crates/ruvector-turbovec/src/quantize.rs

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,17 @@
1313
//! coordinate.
1414
1515
/// Supported quantization widths. `One` is included as a correctness/recall
16-
/// baseline against `ruvector-rabitq`'s 1-bit path; `Two`/`Four` are the
17-
/// production targets of ADR-194.
16+
/// baseline against `ruvector-rabitq`'s 1-bit path; `Two`/`Three`/`Four` are
17+
/// the production targets of ADR-194 (`Three` fills the 2↔4-bit recall gap).
1818
#[derive(Clone, Copy, Debug, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
1919
pub enum BitWidth {
2020
/// 1 bit / coord (2 levels).
2121
One,
2222
/// 2 bits / coord (4 levels).
2323
Two,
24+
/// 3 bits / coord (8 levels). Fills the recall gap between 2- and 4-bit;
25+
/// near the paper's ~2.5–3.5 bpc quality-neutral sweet spot (ADR-194 §D3).
26+
Three,
2427
/// 4 bits / coord (16 levels).
2528
Four,
2629
}
@@ -32,6 +35,7 @@ impl BitWidth {
3235
match self {
3336
BitWidth::One => 1,
3437
BitWidth::Two => 2,
38+
BitWidth::Three => 3,
3539
BitWidth::Four => 4,
3640
}
3741
}
@@ -50,6 +54,10 @@ impl BitWidth {
5054
// ±sqrt(2/π)
5155
BitWidth::One => &[-0.797_884_6, 0.797_884_6],
5256
BitWidth::Two => &[-1.510_4, -0.452_8, 0.452_8, 1.510_4],
57+
// 8-level Max (1960) optimal N(0,1) reconstruction levels.
58+
BitWidth::Three => &[
59+
-2.152_0, -1.344_0, -0.756_0, -0.245_1, 0.245_1, 0.756_0, 1.344_0, 2.152_0,
60+
],
5361
BitWidth::Four => &[
5462
-2.732_6, -2.069_0, -1.618_0, -1.256_2, -0.942_4, -0.656_8, -0.388_1, -0.128_4,
5563
0.128_4, 0.388_1, 0.656_8, 0.942_4, 1.256_2, 1.618_0, 2.069_0, 2.732_6,
@@ -147,7 +155,12 @@ mod tests {
147155

148156
#[test]
149157
fn centroids_are_sorted_and_sized() {
150-
for bw in [BitWidth::One, BitWidth::Two, BitWidth::Four] {
158+
for bw in [
159+
BitWidth::One,
160+
BitWidth::Two,
161+
BitWidth::Three,
162+
BitWidth::Four,
163+
] {
151164
let c = bw.centroids();
152165
assert_eq!(c.len(), bw.levels());
153166
assert!(c.windows(2).all(|w| w[0] < w[1]), "{bw:?} not ascending");
@@ -156,7 +169,12 @@ mod tests {
156169

157170
#[test]
158171
fn pack_unpack_roundtrip_all_widths() {
159-
for bw in [BitWidth::One, BitWidth::Two, BitWidth::Four] {
172+
for bw in [
173+
BitWidth::One,
174+
BitWidth::Two,
175+
BitWidth::Three,
176+
BitWidth::Four,
177+
] {
160178
let dim = 37; // deliberately not byte-aligned
161179
let codes: Vec<u8> = (0..dim).map(|i| (i % bw.levels()) as u8).collect();
162180
let packed = pack(&codes, bw);

docs/adr/ADR-194-ruvector-turbovec-fastscan-index.md

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
adr: 194
3-
title: "ruvector-turbovec — Multi-bit TurboQuant FastScan ANN Index (2/4-bit SQ + TQ+ calibration + nibble-LUT SIMD)"
3+
title: "ruvector-turbovec — Multi-bit TurboQuant FastScan ANN Index (2/3/4-bit SQ + TQ+ calibration + nibble-LUT SIMD)"
44
status: accepted
55
date: 2026-05-29
66
authors: [oshaal, claude-flow]
@@ -17,16 +17,18 @@ tags: [quantization, ann, vector-search, turboquant, fastscan, simd, lloyd-max,
1717
> ruvector crate that reuses our existing primitives. The TurboQuant *algorithm*
1818
> is already partially present in this repo (see §"What already exists"); the
1919
> contribution here is a **multi-bit scalar-quantized ANN search index** with a
20-
> FastScan SIMD kernel, which we do **not** currently have. Benchmark claims
21-
> below are **targets to be validated**, not measured results.
20+
> FastScan SIMD kernel, which we do **not** currently have. The *recall /
21+
> compression / bias* figures in "Validation" are **measured** (reproducible via
22+
> the demo); the *competitive* claims vs FAISS/Milvus remain **targets to be
23+
> validated** and are attributed to the upstream reference project where cited.
2224
2325
## Status
2426

2527
**Accepted (M1 implemented).** The scalar reference milestone (M1) is
26-
implemented as `crates/ruvector-turbovec`: rotation reuse + Lloyd–Max 2/4-bit SQ
28+
implemented as `crates/ruvector-turbovec`: rotation reuse + Lloyd–Max 2/3/4-bit SQ
2729
+ TQ+ calibration + length-renormalized unbiased scoring + `IdMapIndex`
2830
(O(1) delete, filtered search). Build is green
29-
(`cargo build --release -p ruvector-turbovec`); 12 unit tests + 1 doc-test pass;
31+
(`cargo build --release -p ruvector-turbovec`); 16 unit tests + 1 doc-test pass;
3032
clippy clean. M2–M4 (FastScan SIMD kernel, AVX-512, dispatcher registration)
3133
remain future work. Measured proof below.
3234

@@ -40,6 +42,7 @@ exact brute-force L2:
4042
|-------|-----------|----------------------|-------------|------------------|
4143
| 1-bit | 0.308 | 48 | 25.6× | +0.0005 |
4244
| 2-bit | 0.561 | 80 | 14.2× | +0.0001 |
45+
| 3-bit | 0.767 | 112 | 9.8× | −0.0000 |
4346
| **4-bit** | **0.879** | **144** | **7.5×** | **−0.0000** |
4447

4548
- **Recall rises monotonically with bit-width** — exactly the 2–4-bit regime the
@@ -87,7 +90,7 @@ ahead of FAISS `IndexPQ` at 4-bit, and FastScan-class scan throughput on ARM —
8790
all with online ingest and no training phase. **Those are the external project's
8891
numbers, not this crate's.** This crate's own *measured* results are the
8992
uniform-random worst-case table under "Validation" above (recall@10 of
90-
0.308 / 0.561 / 0.879 at 1/2/4-bit); broader competitive benchmarks are listed
93+
0.308 / 0.561 / 0.879 at 1/2/3/4-bit); broader competitive benchmarks are listed
9194
as targets-to-validate in "Acceptance criteria" and the SIMD-kernel milestones.
9295

9396
[RyanCodrai/turbovec]: https://github.com/RyanCodrai/turbovec
@@ -131,12 +134,12 @@ coordinate is ~Beta-distributed → N(0, 1/d), making **per-coordinate scalar
131134
quantization optimal without a codebook**. We import this type rather than
132135
reimplement the FWHT.
133136

134-
### T2 — Lloyd–Max scalar quantization (2-bit / 4-bit)
135-
Precompute MSE-optimal bucket boundaries for the canonical N(0,1/d) marginal at
136-
`bit_width ∈ {2, 4}` (4 and 16 buckets). Coordinates become 2-bit (0–3) or
137-
4-bit (0–15) integers. Boundaries are **constants of the distribution**, not of
138-
the data → zero training. (ruvllm's codec already has the MSE-quantizer math to
139-
borrow from.)
137+
### T2 — Lloyd–Max scalar quantization (2-bit / 3-bit / 4-bit)
138+
Precompute MSE-optimal bucket boundaries for the canonical N(0,1) marginal at
139+
`bit_width ∈ {2, 3, 4}` (4, 8, and 16 buckets). Coordinates become 2-bit (0–3),
140+
3-bit (0–7), or 4-bit (0–15) integers. Boundaries are **constants of the
141+
distribution**, not of the data → zero training. (ruvllm's codec already has the
142+
MSE-quantizer math to borrow from.)
140143

141144
### T3 — Per-coordinate calibration (TQ+)
142145
During the *first* `add()` batch, fit two scalars per coordinate
@@ -185,7 +188,7 @@ determinism contract are.
185188
| `AnnIndex` trait | `ruvector-rabitq::index` | **implement** |
186189
| `VectorKernel` / `KernelCaps` | `ruvector-rabitq::kernel` | **implement** |
187190
| MSE/Lloyd–Max quantizer math | `ruvllm::quantize` | **borrow/extract** |
188-
| Lloyd–Max boundary tables (2/4-bit) | TurboQuant constants | **build (new)** |
191+
| Lloyd–Max boundary tables (2/3/4-bit) | TurboQuant constants | **build (new)** |
189192
| TQ+ per-coordinate calibration || **build (new)** |
190193
| FastScan nibble-LUT SIMD kernel || **build (new)** |
191194
| 32-block SoA layout + filtered scan || **build (new)** |
@@ -260,7 +263,7 @@ rather than ad hoc. None of these are bugs in M1 — they are scope boundaries.
260263
|---|--------------|-----------------|------|
261264
| D1 | **Provably-unbiased** inner product via a **two-stage** estimator: MSE quantizer + **1-bit QJL on the residual** `r = x − x̂_mse`, score `⟨y, x̂_mse + x̂_qjl⟩`, unbiased by construction with a variance bound. | A single per-vector scalar `c_x = ⟨r,r̂⟩/⟨r̂,r̂⟩` (least-squares magnitude match). *Empirically* near-unbiased (mean cos-bias ≈ 0 on uniform data); **no theoretical guarantee**. Cheaper (no extra residual bits). | **M5 (new):** add the optional QJL-residual stage as a recall/accuracy upgrade path when `c_x` proves insufficient on clustered data. |
262265
| D2 | Per-coordinate quantizer is **Max-Lloyd-optimal for the exact Beta marginal** `f(x) ∝ (1−x²)^((d−3)/2)`, with tables precomputed **per (bit-width, dimension)**. | Hardcoded Lloyd–Max tables for the **N(0,1) limit** of that Beta + an empirical per-coordinate `shift/scale` (TQ+) patch. Exact only as `d → ∞`; approximate at low/medium `d`. (TQ+ itself is *not* in the paper.) | **M6 (new):** generate d-aware Beta-optimal codebooks offline; keep the N(0,1)+calibration path as the default fast option. |
263-
| D3 | Highlights **~2.5 and ~3.5 bits/channel** as the quality-neutral operating points. | Ships **1 / 2 / 4-bit** only; a visible recall cliff sits between 2-bit (0.56) and 4-bit (0.88). | **M2 stretch:** add a **3-bit** width (one centroid table) to fill the cliff. |
266+
| D3 | Highlights **~2.5 and ~3.5 bits/channel** as the quality-neutral operating points. | **Now ships 1 / 2 / 3 / 4-bit.** The added 3-bit width fills the old 2↔4-bit cliff: recall@10 **0.767** at **9.8×** compression (112 B/vec), measured. | Done in M1. Non-integer effective bit-widths (2.5/3.5 bpc) remain future work, achievable via D1's QJL residual or mixed-width coding. |
264267
| D4 | Closed-form distortion bounds: `D_mse ≤ (√3·π/2)·4^(−b)` (≈2.7× the info-theoretic floor) and `D_prod ≤ (√3·π²·‖y‖²/d)·4^(−b)`. | Tests assert only `recall > 0.5`. | **Test upgrade:** assert measured MSE/IP distortion stays **under the paper's bound** — a theory-grounded oracle stronger than a recall threshold. |
265268
| D5 | Bounds estimator **variance** (useful for ranking confidence / early termination). | Not surfaced. | Defer; revisit if IVF/rerank composition (ADR-193) needs confidence intervals. |
266269

@@ -290,12 +293,12 @@ kernel (M2–M4) is a FAISS-lineage engineering layer, *not* part of the paper.
290293
Implement on branch `claude/ruvector-turbovec-optimization-FhaDh` (this ADR),
291294
crate work in a follow-up PR. Milestones:
292295

293-
1. **M1 — Scalar reference (no SIMD).** Rotation reuse + Lloyd–Max 2/4-bit +
296+
1. **M1 — Scalar reference (no SIMD).** Rotation reuse + Lloyd–Max 2/3/4-bit +
294297
TQ+ + length-renormalized scoring + `AnnIndex`. Recall + memory parity test
295298
vs a brute f32 baseline on SIFT1M / a synthetic OpenAI-d1536 set. ✅ *done.*
296299
2. **M2 — FastScan SIMD kernel.** AVX2 + NEON nibble-LUT, fuzzed bit-identical
297300
to M1's scalar scorer; `VectorKernel` impl; criterion bench in
298-
`benches/turbovec_bench.rs`. *Stretch:* add a **3-bit** width (D3).
301+
`benches/turbovec_bench.rs`. (3-bit width already shipped in M1, see D3.)
299302
3. **M3 — IdMap + filtered search + persistence.** O(1) delete, block-level
300303
allowlist, `.tv` save/load round-trip test.
301304
4. **M4 — AVX-512BW kernel + rulake dispatcher registration.**

0 commit comments

Comments
 (0)