Skip to content

Enhance heuristic pruning to handle duplicate clusters#282

Open
DIlkhush00 wants to merge 1 commit intointel:mainfrom
DIlkhush00:dilkhush00/duplicate-cluster-fix
Open

Enhance heuristic pruning to handle duplicate clusters#282
DIlkhush00 wants to merge 1 commit intointel:mainfrom
DIlkhush00:dilkhush00/duplicate-cluster-fix

Conversation

@DIlkhush00
Copy link

This PR fixes #80 . It adds a post-pruning step to both:

  • IterativePruneStrategy
  • ProgressivePruneStrategy

Approach: If a duplicate cluster is detected, the last (worst) slot in the result is replaced with the closest candidate from the pool that does not have the same distance.

Signed-off-by: Dilkhush Purohit <dilkhushpurohit01@gmail.com>
Copy link
Member

@ibhati ibhati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DIlkhush00 for working on this issue. The solution looks good overall. I left one comment below, and I was also wondering if we could add a test that specifically triggers this case.
Additionally, could you generate a synthetic dataset with a larger number of vectors (e.g., 1M) and measure the build-time impact? It would also be helpful to benchmark GIST-1M with and without this change to confirm that:

Datasets without duplicates show no recall or build-time regression.
In scenarios where duplicates do occur, this logic does not significantly impact build time, and recall remains consistent.

Thanks again for your contribution!

in_result = true;
break;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m confused why cid could already be in result here. We just checked above that this candidate's distance differs from anchor_dist, so I wouldn’t expect it to be present (as all the results have same distance). Should this be an assert instead (i.e., this candidate must not already be in result)? Am I missing a scenario?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Corner case when entry point has duplicates

2 participants