feat(turbovec): add 3-bit width (ADR-194 D3) — fills the 2↔4-bit recall cliff

shaal · shaal · commit 4ceec40c6ab0 · 2026-05-29T23:52:13.000-04:00
Adds BitWidth::Three (8-level Max-1960 optimal N(0,1) reconstruction
levels). pack/unpack, calibration, scoring, and IdMap are width-generic,
so only the centroid table + the enum arms change.

Measured (cargo run --release -p ruvector-turbovec, n=5000 uniform-random,
dim=256, k=10, no rerank, vs exact L2):
  3-bit: recall@10 0.767, 112 B/vec, 9.8x compression, bias -0.0000
landing squarely between 2-bit (0.561) and 4-bit (0.879) — a useful
memory/recall midpoint (~22% smaller than 4-bit for ~0.11 recall).

Also refresh ADR-194: add the 3-bit Validation row, mark D3 done, widen
T2 to {2,3,4}, correct the test count to 16, and scope the provenance
note so the measured recall/compression/bias figures are called measured
while the FAISS-competitive claims stay attributed targets.

16 unit + 1 doc-test pass; clippy clean; new code is rustfmt-clean.
diff --git a/crates/ruvector-turbovec/Cargo.toml b/crates/ruvector-turbovec/Cargo.toml
@@ -6,7 +6,7 @@ rust-version.workspace = true
 license.workspace = true
 authors.workspace = true
 repository.workspace = true
-description = "TurboVec: multi-bit TurboQuant FastScan-style ANN index (2/4-bit Lloyd-Max scalar quantization + TQ+ per-coordinate calibration + length-renormalized unbiased scoring). Implements ADR-194."
+description = "TurboVec: multi-bit TurboQuant FastScan-style ANN index (2/3/4-bit Lloyd-Max scalar quantization + TQ+ per-coordinate calibration + length-renormalized unbiased scoring). Implements ADR-194."
 
 [[bin]]
 name = "turbovec-demo"
diff --git a/crates/ruvector-turbovec/src/index.rs b/crates/ruvector-turbovec/src/index.rs
@@ -5,7 +5,7 @@
 //! 1. `norm = ‖x‖`, `û = x / norm`         — strip & store length (§T1)
 //! 2. `r = P · û`  (randomized Hadamard)    — reuse `ruvector_rabitq` (§T1)
 //! 3. `z_i = (r_i − shift_i)/scale_i`        — TQ+ calibration (§T3)
-//! 4. `q_i = argmin |z_i − centroid|`        — Lloyd–Max 2/4-bit SQ (§T2)
+//! 4. `q_i = argmin |z_i − centroid|`        — Lloyd–Max 2/3/4-bit SQ (§T2)
 //! 5. `c_x = ⟨r, r̂⟩ / ⟨r̂, r̂⟩`              — per-vector unbiased scale (§T4)
 //!
 //! Scoring a query `q` (orthogonal `P` preserves inner products, so
diff --git a/crates/ruvector-turbovec/src/main.rs b/crates/ruvector-turbovec/src/main.rs
@@ -78,7 +78,12 @@ fn main() {
 
     println!("\n=== TurboVec (ADR-194) proof — n={n}, dim={dim}, k={k} ===\n");
     println!("[1] Compression + recall vs exact brute-force L2 + estimator bias");
-    for bw in [BitWidth::One, BitWidth::Two, BitWidth::Four] {
+    for bw in [
+        BitWidth::One,
+        BitWidth::Two,
+        BitWidth::Three,
+        BitWidth::Four,
+    ] {
         recall_and_bias(bw, &data, &queries, dim, k);
     }
 
diff --git a/crates/ruvector-turbovec/src/quantize.rs b/crates/ruvector-turbovec/src/quantize.rs
@@ -13,14 +13,17 @@
 //! coordinate.
 
 /// Supported quantization widths. `One` is included as a correctness/recall
-/// baseline against `ruvector-rabitq`'s 1-bit path; `Two`/`Four` are the
-/// production targets of ADR-194.
+/// baseline against `ruvector-rabitq`'s 1-bit path; `Two`/`Three`/`Four` are
+/// the production targets of ADR-194 (`Three` fills the 2↔4-bit recall gap).
 #[derive(Clone, Copy, Debug, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
 pub enum BitWidth {
     /// 1 bit / coord (2 levels).
     One,
     /// 2 bits / coord (4 levels).
     Two,
+    /// 3 bits / coord (8 levels). Fills the recall gap between 2- and 4-bit;
+    /// near the paper's ~2.5–3.5 bpc quality-neutral sweet spot (ADR-194 §D3).
+    Three,
     /// 4 bits / coord (16 levels).
     Four,
 }
@@ -32,6 +35,7 @@ impl BitWidth {
         match self {
             BitWidth::One => 1,
             BitWidth::Two => 2,
+            BitWidth::Three => 3,
             BitWidth::Four => 4,
         }
     }
@@ -50,6 +54,10 @@ impl BitWidth {
             // ±sqrt(2/π)
             BitWidth::One => &[-0.797_884_6, 0.797_884_6],
             BitWidth::Two => &[-1.510_4, -0.452_8, 0.452_8, 1.510_4],
+            // 8-level Max (1960) optimal N(0,1) reconstruction levels.
+            BitWidth::Three => &[
+                -2.152_0, -1.344_0, -0.756_0, -0.245_1, 0.245_1, 0.756_0, 1.344_0, 2.152_0,
+            ],
             BitWidth::Four => &[
                 -2.732_6, -2.069_0, -1.618_0, -1.256_2, -0.942_4, -0.656_8, -0.388_1, -0.128_4,
                 0.128_4, 0.388_1, 0.656_8, 0.942_4, 1.256_2, 1.618_0, 2.069_0, 2.732_6,
@@ -147,7 +155,12 @@ mod tests {
 
     #[test]
     fn centroids_are_sorted_and_sized() {
-        for bw in [BitWidth::One, BitWidth::Two, BitWidth::Four] {
+        for bw in [
+            BitWidth::One,
+            BitWidth::Two,
+            BitWidth::Three,
+            BitWidth::Four,
+        ] {
             let c = bw.centroids();
             assert_eq!(c.len(), bw.levels());
             assert!(c.windows(2).all(|w| w[0] < w[1]), "{bw:?} not ascending");
@@ -156,7 +169,12 @@ mod tests {
 
     #[test]
     fn pack_unpack_roundtrip_all_widths() {
-        for bw in [BitWidth::One, BitWidth::Two, BitWidth::Four] {
+        for bw in [
+            BitWidth::One,
+            BitWidth::Two,
+            BitWidth::Three,
+            BitWidth::Four,
+        ] {
             let dim = 37; // deliberately not byte-aligned
             let codes: Vec<u8> = (0..dim).map(|i| (i % bw.levels()) as u8).collect();
             let packed = pack(&codes, bw);
diff --git a/docs/adr/ADR-194-ruvector-turbovec-fastscan-index.md b/docs/adr/ADR-194-ruvector-turbovec-fastscan-index.md
@@ -1,6 +1,6 @@
 ---
 adr: 194
-title: "ruvector-turbovec — Multi-bit TurboQuant FastScan ANN Index (2/4-bit SQ + TQ+ calibration + nibble-LUT SIMD)"
+title: "ruvector-turbovec — Multi-bit TurboQuant FastScan ANN Index (2/3/4-bit SQ + TQ+ calibration + nibble-LUT SIMD)"
 status: accepted
 date: 2026-05-29
 authors: [oshaal, claude-flow]
@@ -17,16 +17,18 @@ tags: [quantization, ann, vector-search, turboquant, fastscan, simd, lloyd-max,
 > ruvector crate that reuses our existing primitives. The TurboQuant *algorithm*
 > is already partially present in this repo (see §"What already exists"); the
 > contribution here is a **multi-bit scalar-quantized ANN search index** with a
-> FastScan SIMD kernel, which we do **not** currently have. Benchmark claims
-> below are **targets to be validated**, not measured results.
+> FastScan SIMD kernel, which we do **not** currently have. The *recall /
+> compression / bias* figures in "Validation" are **measured** (reproducible via
+> the demo); the *competitive* claims vs FAISS/Milvus remain **targets to be
+> validated** and are attributed to the upstream reference project where cited.
 
 ## Status
 
 **Accepted (M1 implemented).** The scalar reference milestone (M1) is
-implemented as `crates/ruvector-turbovec`: rotation reuse + Lloyd–Max 2/4-bit SQ
+implemented as `crates/ruvector-turbovec`: rotation reuse + Lloyd–Max 2/3/4-bit SQ
 + TQ+ calibration + length-renormalized unbiased scoring + `IdMapIndex`
 (O(1) delete, filtered search). Build is green
-(`cargo build --release -p ruvector-turbovec`); 12 unit tests + 1 doc-test pass;
+(`cargo build --release -p ruvector-turbovec`); 16 unit tests + 1 doc-test pass;
 clippy clean. M2–M4 (FastScan SIMD kernel, AVX-512, dispatcher registration)
 remain future work. Measured proof below.
 
@@ -40,6 +42,7 @@ exact brute-force L2:
 |-------|-----------|----------------------|-------------|------------------|
 | 1-bit | 0.308 | 48 | 25.6× | +0.0005 |
 | 2-bit | 0.561 | 80 | 14.2× | +0.0001 |
+| 3-bit | 0.767 | 112 | 9.8× | −0.0000 |
 | **4-bit** | **0.879** | **144** | **7.5×** | **−0.0000** |
 
 - **Recall rises monotonically with bit-width** — exactly the 2–4-bit regime the
@@ -87,7 +90,7 @@ ahead of FAISS `IndexPQ` at 4-bit, and FastScan-class scan throughput on ARM —
 all with online ingest and no training phase. **Those are the external project's
 numbers, not this crate's.** This crate's own *measured* results are the
 uniform-random worst-case table under "Validation" above (recall@10 of
-0.308 / 0.561 / 0.879 at 1/2/4-bit); broader competitive benchmarks are listed
+0.308 / 0.561 / 0.879 at 1/2/3/4-bit); broader competitive benchmarks are listed
 as targets-to-validate in "Acceptance criteria" and the SIMD-kernel milestones.
 
 [RyanCodrai/turbovec]: https://github.com/RyanCodrai/turbovec
@@ -131,12 +134,12 @@ coordinate is ~Beta-distributed → N(0, 1/d), making **per-coordinate scalar
 quantization optimal without a codebook**. We import this type rather than
 reimplement the FWHT.
 
-### T2 — Lloyd–Max scalar quantization (2-bit / 4-bit)
-Precompute MSE-optimal bucket boundaries for the canonical N(0,1/d) marginal at
-`bit_width ∈ {2, 4}` (4 and 16 buckets). Coordinates become 2-bit (0–3) or
-4-bit (0–15) integers. Boundaries are **constants of the distribution**, not of
-the data → zero training. (ruvllm's codec already has the MSE-quantizer math to
-borrow from.)
+### T2 — Lloyd–Max scalar quantization (2-bit / 3-bit / 4-bit)
+Precompute MSE-optimal bucket boundaries for the canonical N(0,1) marginal at
+`bit_width ∈ {2, 3, 4}` (4, 8, and 16 buckets). Coordinates become 2-bit (0–3),
+3-bit (0–7), or 4-bit (0–15) integers. Boundaries are **constants of the
+distribution**, not of the data → zero training. (ruvllm's codec already has the
+MSE-quantizer math to borrow from.)
 
 ### T3 — Per-coordinate calibration (TQ+)
 During the *first* `add()` batch, fit two scalars per coordinate
@@ -185,7 +188,7 @@ determinism contract are.
 | `AnnIndex` trait | `ruvector-rabitq::index` | **implement** |
 | `VectorKernel` / `KernelCaps` | `ruvector-rabitq::kernel` | **implement** |
 | MSE/Lloyd–Max quantizer math | `ruvllm::quantize` | **borrow/extract** |
-| Lloyd–Max boundary tables (2/4-bit) | TurboQuant constants | **build (new)** |
+| Lloyd–Max boundary tables (2/3/4-bit) | TurboQuant constants | **build (new)** |
 | TQ+ per-coordinate calibration | — | **build (new)** |
 | FastScan nibble-LUT SIMD kernel | — | **build (new)** |
 | 32-block SoA layout + filtered scan | — | **build (new)** |
@@ -260,7 +263,7 @@ rather than ad hoc. None of these are bugs in M1 — they are scope boundaries.
 |---|--------------|-----------------|------|
 | D1 | **Provably-unbiased** inner product via a **two-stage** estimator: MSE quantizer + **1-bit QJL on the residual** `r = x − x̂_mse`, score `⟨y, x̂_mse + x̂_qjl⟩`, unbiased by construction with a variance bound. | A single per-vector scalar `c_x = ⟨r,r̂⟩/⟨r̂,r̂⟩` (least-squares magnitude match). *Empirically* near-unbiased (mean cos-bias ≈ 0 on uniform data); **no theoretical guarantee**. Cheaper (no extra residual bits). | **M5 (new):** add the optional QJL-residual stage as a recall/accuracy upgrade path when `c_x` proves insufficient on clustered data. |
 | D2 | Per-coordinate quantizer is **Max-Lloyd-optimal for the exact Beta marginal** `f(x) ∝ (1−x²)^((d−3)/2)`, with tables precomputed **per (bit-width, dimension)**. | Hardcoded Lloyd–Max tables for the **N(0,1) limit** of that Beta + an empirical per-coordinate `shift/scale` (TQ+) patch. Exact only as `d → ∞`; approximate at low/medium `d`. (TQ+ itself is *not* in the paper.) | **M6 (new):** generate d-aware Beta-optimal codebooks offline; keep the N(0,1)+calibration path as the default fast option. |
-| D3 | Highlights **~2.5 and ~3.5 bits/channel** as the quality-neutral operating points. | Ships **1 / 2 / 4-bit** only; a visible recall cliff sits between 2-bit (0.56) and 4-bit (0.88). | **M2 stretch:** add a **3-bit** width (one centroid table) to fill the cliff. |
+| D3 | Highlights **~2.5 and ~3.5 bits/channel** as the quality-neutral operating points. | ✅ **Now ships 1 / 2 / 3 / 4-bit.** The added 3-bit width fills the old 2↔4-bit cliff: recall@10 **0.767** at **9.8×** compression (112 B/vec), measured. | Done in M1. Non-integer effective bit-widths (2.5/3.5 bpc) remain future work, achievable via D1's QJL residual or mixed-width coding. |
 | D4 | Closed-form distortion bounds: `D_mse ≤ (√3·π/2)·4^(−b)` (≈2.7× the info-theoretic floor) and `D_prod ≤ (√3·π²·‖y‖²/d)·4^(−b)`. | Tests assert only `recall > 0.5`. | **Test upgrade:** assert measured MSE/IP distortion stays **under the paper's bound** — a theory-grounded oracle stronger than a recall threshold. |
 | D5 | Bounds estimator **variance** (useful for ranking confidence / early termination). | Not surfaced. | Defer; revisit if IVF/rerank composition (ADR-193) needs confidence intervals. |
 
@@ -290,12 +293,12 @@ kernel (M2–M4) is a FAISS-lineage engineering layer, *not* part of the paper.
 Implement on branch `claude/ruvector-turbovec-optimization-FhaDh` (this ADR),
 crate work in a follow-up PR. Milestones:
 
-1. **M1 — Scalar reference (no SIMD).** Rotation reuse + Lloyd–Max 2/4-bit +
+1. **M1 — Scalar reference (no SIMD).** Rotation reuse + Lloyd–Max 2/3/4-bit +
    TQ+ + length-renormalized scoring + `AnnIndex`. Recall + memory parity test
    vs a brute f32 baseline on SIFT1M / a synthetic OpenAI-d1536 set. ✅ *done.*
 2. **M2 — FastScan SIMD kernel.** AVX2 + NEON nibble-LUT, fuzzed bit-identical
    to M1's scalar scorer; `VectorKernel` impl; criterion bench in
-   `benches/turbovec_bench.rs`. *Stretch:* add a **3-bit** width (D3).
+   `benches/turbovec_bench.rs`. (3-bit width already shipped in M1, see D3.)
 3. **M3 — IdMap + filtered search + persistence.** O(1) delete, block-level
    allowlist, `.tv` save/load round-trip test.
 4. **M4 — AVX-512BW kernel + rulake dispatcher registration.**