Add DSP (Dynamic Superblock Pruning) sparse index with avx 512SIMD optimizations#1471
Add DSP (Dynamic Superblock Pruning) sparse index with avx 512SIMD optimizations#1471lyang24 wants to merge 9 commits intozilliztech:sparse_dsp_devfrom
Conversation
lyang24
commented
Feb 23, 2026
- DSP index with u8/u16 integer pruning and two-level block hierarchy (trade off between 1/4 of the forward index size and slight accuracy)
- AVX-512 SIMD: gather/scatter IP accumulation, block UB scan, seek (the original paper uses avx2)
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lyang24 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
since the index format has been changed. I think we need a new index type called |
3a2d516 to
edc6413
Compare
Signed-off-by: lyang24 <lanqingy93@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fixed |
src/index/sparse/sparse_dsp_index.h
Outdated
| for (int h = 0; h < 4; ++h) { | ||
| if (!kth_heaps[h].empty()) { | ||
| float kth_f = kth_heaps[h].top(); | ||
| bm.kth[h] = static_cast<uint8_t>(std::min(255.0f, std::floor(kth_f * inv_max_score))); |
There was a problem hiding this comment.
when filter is used, the documents that get filtered out may well be the top ones, causing the initial threshold to be higher than the actual k-th largest score after filtering.
|
|
||
| // ======================================================================== | ||
| // Forward index (flat layout for cache-friendly scoring) | ||
| // ======================================================================== |
There was a problem hiding this comment.
these vectors should support mmap.
more specifically, these contents could be mapped from the serialized index file.
| } | ||
| default: | ||
| // skip unknown sections | ||
| RETURN_IF_ERROR(ReadCustomSection(reader, section_header)); |
There was a problem hiding this comment.
it would be better to add some new, specific sections for DSP.
| #endif | ||
|
|
||
| // Write custom section data (e.g., DSP metadata) | ||
| WriteCustomSections(writer); |
There was a problem hiding this comment.
should SPARSE_DSP have dependency on SPARSE_INVERTED_INDEX?
maybe we can skip serializing the inverted index section?
| KNOWHERE_SIMPLE_REGISTER_SPARSE_FLOAT_GLOBAL(SPARSE_WAND_CC_DEPRECATED, SparseInvertedIndexNodeCC, | ||
| knowhere::feature::MMAP, | ||
| /*use_wand=*/true) | ||
| KNOWHERE_SIMPLE_REGISTER_SPARSE_FLOAT_GLOBAL(SPARSE_DSP, SparseDspIndexNode, knowhere::feature::MMAP) |
There was a problem hiding this comment.
cardinal does not support this index type, to avoid compatibility issue, remove it for now.
| /*use_wand=*/false) | ||
| KNOWHERE_SIMPLE_REGISTER_SPARSE_FLOAT_GLOBAL(SPARSE_WAND_CC, SparseInvertedIndexNodeCC, knowhere::feature::MMAP, | ||
| /*use_wand=*/true) | ||
| KNOWHERE_SIMPLE_REGISTER_SPARSE_FLOAT_GLOBAL(SPARSE_DSP, SparseDspIndexNode, knowhere::feature::MMAP) |
There was a problem hiding this comment.
SPARSE_DSP_CC in concurrent scenarios should be supported.
| } | ||
| if (block_buf.empty()) | ||
| continue; | ||
| std::sort(block_buf.begin(), block_buf.end(), [](const BlockEntry& a, const BlockEntry& b) { |
There was a problem hiding this comment.
is this a duplicate sort with Pass 1?
src/index/sparse/sparse_dsp_index.h
Outdated
| } | ||
| } | ||
| // Add top-2 non-surviving superblocks as safety net | ||
| for (int i = 0; i < 2; ++i) { |
- Remove unused dim_max_score_ratio from DSP config - Apply dsp_eta at subblock BoundSum level (paper Section 3) - Add dsp_gamma configurable top-γ superblock safety net - Legacy top-2 fallback preserved when gamma=0 - Add benchmark_sparse_dsp with param sweep, latency percentiles, coverage metrics (failed queries, avg fill) - Strengthen DSP unit tests with comparative recall assertions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wire print_failed_diag after baseline and DSP safe runs - Print diagnostics for any config with failed queries - Finer eta sweep: 0.98, 0.95, 0.92, 0.90, 0.88, 0.85, 0.82, 0.80 - Broader gamma sweep under mu=0.3: 50, 100, 250, 500, 1000 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 11 persistent DSP failures (even at mu=1, eta=1) are likely caused by kth-score initialization setting a nonzero threshold before any documents are scored, pruning superblocks/blocks prematurely. Add dsp_kth_init config (default true) to allow disabling this heuristic for diagnosis. When false, threshold starts at 0 and only rises after the heap fills — truly safe mode. Also clean up unused test lambdas from previous refactor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After Fix 2 moved subblock pruning to u16_block_threshold, u16_threshold was only written but never read. Superblock pruning uses float_threshold directly via mu_threshold/eta_threshold. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- dsp_kth_alpha: scale factor for kth threshold seed (0.0-1.0) Allows separating DSP-T (threshold seeding) from DSP-H (hierarchy) - Alpha sweep: 0.25, 0.50, 0.75 to find calibration sweet spot - Rename "DSP safe" -> "DSP default" (kth-init ON is not truly safe) - "DSP-H exact" for hierarchy-only mode (kth-init OFF) - Trimmed param sweep for faster turnaround Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mode-driven superblock selection replaces the previous mixed logic: - legacy (0): dual-threshold + top-2/gamma backstop (unchanged) - dsp (1): ub>theta/mu || asc>theta/eta, no backstop - lsp0 (2): top-gamma from ub>=theta, no mu/asc gate - lsp1 (3): lsp0 safe set + mu gate (ub>theta/mu) - lsp2 (4): lsp1 + asc gate (ub>theta/mu || asc>theta/eta) LSP modes with gamma<=0 fall back to legacy (documented in config). Legacy gamma backstop preserves strict ub>0 inequality. kth-init and kth-alpha remain orthogonal to mode selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace packed-u4 block-max with chunk-compressed variable bit-width (0..4) encoding: 256-entry chunks with per-chunk minimal bit width selection - Fix AppendCustomSections size overcount for empty dimensions - Add stride-specific AVX512 kernels for n=32/64 (fully unrolled, no loop counter) - Add focused unit tests for bit-packing round-trips and chunk compression Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>