Enhance heuristic pruning to handle duplicate clusters by DIlkhush00 · Pull Request #282 · intel/ScalableVectorSearch

DIlkhush00 · 2026-03-04T21:12:04Z

This PR fixes #80 . It adds a post-pruning step to both:

IterativePruneStrategy
ProgressivePruneStrategy

Approach: If a duplicate cluster is detected, the last (worst) slot in the result is replaced with the closest candidate from the pool that does not have the same distance.

Signed-off-by: Dilkhush Purohit <dilkhushpurohit01@gmail.com>

ibhati

Thanks @DIlkhush00 for working on this issue. The solution looks good overall. I left one comment below, and I was also wondering if we could add a test that specifically triggers this case.
Additionally, could you generate a synthetic dataset with a larger number of vectors (e.g., 1M) and measure the build-time impact? It would also be helpful to benchmark GIST-1M with and without this change to confirm that:

Datasets without duplicates show no recall or build-time regression.
In scenarios where duplicates do occur, this logic does not significantly impact build time, and recall remains consistent.

Thanks again for your contribution!

ibhati · 2026-03-04T22:15:32Z

include/svs/index/vamana/prune.h

+                    in_result = true;
+                    break;
+                }
+            }


I’m confused why cid could already be in result here. We just checked above that this candidate's distance differs from anchor_dist, so I wouldn’t expect it to be present (as all the results have same distance). Should this be an assert instead (i.e., this candidate must not already be in result)? Am I missing a scenario?

Ah, right... by logic any remaining candidate shouldn't already be in result. I'll change this to an assert and update both strategy accordingly.

DIlkhush00 · 2026-03-07T21:23:21Z

Sounds good. I'll add a test that trigger this case and profile the build to ensure my changes don't introduce any significant regressions.

enhance heuristic pruning to handle duplicate clusters

d429b1d

Signed-off-by: Dilkhush Purohit <dilkhushpurohit01@gmail.com>

DIlkhush00 requested review from ahuber21 and ibhati as code owners March 4, 2026 21:12

ibhati requested changes Mar 4, 2026

View reviewed changes

Merge branch 'intel:main' into dilkhush00/duplicate-cluster-fix

a06937f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance heuristic pruning to handle duplicate clusters#282

Enhance heuristic pruning to handle duplicate clusters#282
DIlkhush00 wants to merge 2 commits intointel:mainfrom
DIlkhush00:dilkhush00/duplicate-cluster-fix

DIlkhush00 commented Mar 4, 2026

Uh oh!

ibhati left a comment

Uh oh!

ibhati Mar 4, 2026

Uh oh!

DIlkhush00 Mar 7, 2026

Uh oh!

DIlkhush00 commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DIlkhush00 commented Mar 4, 2026

Uh oh!

ibhati left a comment

Choose a reason for hiding this comment

Uh oh!

ibhati Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

DIlkhush00 Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

DIlkhush00 commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants