Enhance heuristic pruning to handle duplicate clusters#282
Enhance heuristic pruning to handle duplicate clusters#282DIlkhush00 wants to merge 1 commit intointel:mainfrom
Conversation
Signed-off-by: Dilkhush Purohit <dilkhushpurohit01@gmail.com>
ibhati
left a comment
There was a problem hiding this comment.
Thanks @DIlkhush00 for working on this issue. The solution looks good overall. I left one comment below, and I was also wondering if we could add a test that specifically triggers this case.
Additionally, could you generate a synthetic dataset with a larger number of vectors (e.g., 1M) and measure the build-time impact? It would also be helpful to benchmark GIST-1M with and without this change to confirm that:
Datasets without duplicates show no recall or build-time regression.
In scenarios where duplicates do occur, this logic does not significantly impact build time, and recall remains consistent.
Thanks again for your contribution!
| in_result = true; | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
I’m confused why cid could already be in result here. We just checked above that this candidate's distance differs from anchor_dist, so I wouldn’t expect it to be present (as all the results have same distance). Should this be an assert instead (i.e., this candidate must not already be in result)? Am I missing a scenario?
This PR fixes #80 . It adds a post-pruning step to both:
IterativePruneStrategyProgressivePruneStrategyApproach: If a duplicate cluster is detected, the last (worst) slot in the
resultis replaced with the closest candidate from the pool that does not have the same distance.