Skip to content

Commit c39d57e

Browse files
committed
test(turbovec): distortion-bound oracle (ADR-194 D4)
Add quantizer_mse_within_paper_bound: draw 400k N(0,1) samples (Box–Muller, no new deps), quantize via the real quantize_coord path, and assert the per-coordinate MSE for every width stays under TurboQuant's distortion bound D_mse ≤ (√3·π/2)·4^(−b) (arXiv:2504.19874) AND within 5% of the Max-1960 Lloyd–Max optimum. A corrupted centroid level trips this far more precisely than the existing recall>0.5 threshold. Marks D4 done in ADR-194; updates test count to 17. The full-pipeline inner-product bound D_prod remains future work (tracked with D5).
1 parent 2935381 commit c39d57e

2 files changed

Lines changed: 68 additions & 5 deletions

File tree

crates/ruvector-turbovec/src/quantize.rs

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,4 +195,66 @@ mod tests {
195195
assert_eq!(quantize_coord(c, 99.0) as usize, c.len() - 1);
196196
assert_eq!(quantize_coord(c, -99.0) as usize, 0);
197197
}
198+
199+
/// Standard normal via Box–Muller — keeps the test dependency-free
200+
/// (no `rand_distr`) while exercising the real `quantize_coord` path.
201+
fn std_normal(rng: &mut impl rand::Rng) -> f32 {
202+
let u1: f32 = rng.gen::<f32>().max(1e-9);
203+
let u2: f32 = rng.gen::<f32>();
204+
(-2.0 * u1.ln()).sqrt() * (std::f32::consts::TAU * u2).cos()
205+
}
206+
207+
/// D4 (ADR-194): the per-coordinate quantization MSE on the canonical
208+
/// `N(0,1)` marginal must stay under TurboQuant's distortion bound
209+
/// `D_mse ≤ (√3·π/2)·4^(−b)` (arXiv:2504.19874), and within a small
210+
/// margin of the *known* Max-1960 Lloyd–Max optimum. This is a
211+
/// theory-grounded oracle: it pins the centroid tables far more tightly
212+
/// than a recall threshold — corrupt a single level and it trips.
213+
#[test]
214+
fn quantizer_mse_within_paper_bound() {
215+
use rand::SeedableRng;
216+
let mut rng = rand::rngs::StdRng::seed_from_u64(20_260_530);
217+
let n = 400_000usize;
218+
219+
// Max (1960) optimal MSE for the unit-variance Gaussian, per bit-width.
220+
let optimal = |bw: BitWidth| -> f64 {
221+
match bw {
222+
BitWidth::One => 0.363_4,
223+
BitWidth::Two => 0.117_5,
224+
BitWidth::Three => 0.034_5,
225+
BitWidth::Four => 0.009_5,
226+
}
227+
};
228+
229+
for bw in [
230+
BitWidth::One,
231+
BitWidth::Two,
232+
BitWidth::Three,
233+
BitWidth::Four,
234+
] {
235+
let c = bw.centroids();
236+
let mut sse = 0.0f64;
237+
for _ in 0..n {
238+
let z = std_normal(&mut rng);
239+
let code = quantize_coord(c, z) as usize;
240+
let d = (z - c[code]) as f64;
241+
sse += d * d;
242+
}
243+
let mse = sse / n as f64;
244+
245+
let bound = (3f64.sqrt() * std::f64::consts::PI / 2.0) * 4f64.powi(-(bw.bits() as i32));
246+
assert!(
247+
mse <= bound,
248+
"{bw:?}: MSE {mse:.4} exceeds paper bound {bound:.4}"
249+
);
250+
// Tightness: must be within 5% of the Lloyd–Max optimum (sampling
251+
// noise at n=400k is far below this), catching table corruption
252+
// that might still slip under the loose paper bound.
253+
assert!(
254+
mse <= optimal(bw) * 1.05,
255+
"{bw:?}: MSE {mse:.4} not near optimal {:.4}",
256+
optimal(bw)
257+
);
258+
}
259+
}
198260
}

docs/adr/ADR-194-ruvector-turbovec-fastscan-index.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ tags: [quantization, ann, vector-search, turboquant, fastscan, simd, lloyd-max,
2828
implemented as `crates/ruvector-turbovec`: rotation reuse + Lloyd–Max 2/3/4-bit SQ
2929
+ TQ+ calibration + length-renormalized unbiased scoring + `IdMapIndex`
3030
(O(1) delete, filtered search). Build is green
31-
(`cargo build --release -p ruvector-turbovec`); 16 unit tests + 1 doc-test pass;
31+
(`cargo build --release -p ruvector-turbovec`); 17 unit tests + 1 doc-test pass;
3232
clippy clean. M2–M4 (FastScan SIMD kernel, AVX-512, dispatcher registration)
3333
remain future work. Measured proof below.
3434

@@ -264,7 +264,7 @@ rather than ad hoc. None of these are bugs in M1 — they are scope boundaries.
264264
| D1 | **Provably-unbiased** inner product via a **two-stage** estimator: MSE quantizer + **1-bit QJL on the residual** `r = x − x̂_mse`, score `⟨y, x̂_mse + x̂_qjl⟩`, unbiased by construction with a variance bound. | A single per-vector scalar `c_x = ⟨r,r̂⟩/⟨r̂,r̂⟩` (least-squares magnitude match). *Empirically* near-unbiased (mean cos-bias ≈ 0 on uniform data); **no theoretical guarantee**. Cheaper (no extra residual bits). | **M5 (new):** add the optional QJL-residual stage as a recall/accuracy upgrade path when `c_x` proves insufficient on clustered data. |
265265
| D2 | Per-coordinate quantizer is **Max-Lloyd-optimal for the exact Beta marginal** `f(x) ∝ (1−x²)^((d−3)/2)`, with tables precomputed **per (bit-width, dimension)**. | Hardcoded Lloyd–Max tables for the **N(0,1) limit** of that Beta + an empirical per-coordinate `shift/scale` (TQ+) patch. Exact only as `d → ∞`; approximate at low/medium `d`. (TQ+ itself is *not* in the paper.) | **M6 (new):** generate d-aware Beta-optimal codebooks offline; keep the N(0,1)+calibration path as the default fast option. |
266266
| D3 | Highlights **~2.5 and ~3.5 bits/channel** as the quality-neutral operating points. |**Now ships 1 / 2 / 3 / 4-bit.** The added 3-bit width fills the old 2↔4-bit cliff: recall@10 **0.767** at **9.8×** compression (112 B/vec), measured. | Done in M1. Non-integer effective bit-widths (2.5/3.5 bpc) remain future work, achievable via D1's QJL residual or mixed-width coding. |
267-
| D4 | Closed-form distortion bounds: `D_mse ≤ (√3·π/2)·4^(−b)` (≈2.7× the info-theoretic floor) and `D_prod ≤ (√3·π²·‖y‖²/d)·4^(−b)`. | Tests assert only `recall > 0.5`. | **Test upgrade:** assert measured MSE/IP distortion stays **under the paper's bound** — a theory-grounded oracle stronger than a recall threshold. |
267+
| D4 | Closed-form distortion bounds: `D_mse ≤ (√3·π/2)·4^(−b)` (≈2.7× the info-theoretic floor) and `D_prod ≤ (√3·π²·‖y‖²/d)·4^(−b)`. | **Done.** `quantizer_mse_within_paper_bound` measures per-coordinate MSE on the `N(0,1)` marginal for every width and asserts it stays under the `D_mse` bound *and* within 5% of the Max-1960 optimum — a theory-grounded oracle that pins the centroid tables. | `D_prod` (full-pipeline IP) follow-up; deferred with D5. |
268268
| D5 | Bounds estimator **variance** (useful for ranking confidence / early termination). | Not surfaced. | Defer; revisit if IVF/rerank composition (ADR-193) needs confidence intervals. |
269269

270270
**Not divergences (M1 already matches the paper):** L2-norm stored in f32 and
@@ -318,9 +318,10 @@ crate work in a follow-up PR. Milestones:
318318
oracle), enforced in CI.
319319
- `cargo build --release -p ruvector-turbovec` green; all unit + property tests
320320
pass; no `clippy` regressions.
321-
- **Measured MSE / inner-product distortion within the paper's bounds (D4):**
322-
`D_mse ≤ (√3·π/2)·4^(−b)` and `D_prod ≤ (√3·π²·‖y‖²/d)·4^(−b)` — a
323-
theory-grounded test oracle stronger than the current `recall > 0.5` threshold.
321+
- **Measured MSE distortion within the paper's bound (D4):** ✅ enforced by
322+
`quantizer_mse_within_paper_bound` — per-coordinate MSE on `N(0,1)` under
323+
`D_mse ≤ (√3·π/2)·4^(−b)` and within 5% of the Max-1960 optimum, for every
324+
width. (`D_prod ≤ (√3·π²·‖y‖²/d)·4^(−b)` full-pipeline check still to come.)
324325

325326
## References
326327

0 commit comments

Comments
 (0)