-
Notifications
You must be signed in to change notification settings - Fork 150
Description
In scenarios where multiple neighbors have the same distance to a query point (ties), the current implementation deterministically returns the same k neighbors based on the tree traversal or data ordering. This is problematic for datasets where ties are common, such as those with duplicate points or rounded values.
Problem:
When ties occur and the number of tied neighbors exceeds k, the library always selects the same k neighbors. This can lead to biased results and limits diversity in downstream applications like stochastic modeling or simulations.
For example:
A query point with 1000 equidistant neighbors (distance 0) but k = 10 will always return the same 10 neighbors.
Users cannot randomize neighbor selection among tied points, which reduces flexibility and fairness.
Proposed Solution:
Introduce an option to handle ties during k-NN queries:
- Detect tied distances among neighbors.
- Randomly select k neighbors from the tied group when ties occur.
- Add an optional parameter (e.g., resolve_ties = TRUE) to enable or disable this behavior.
Why This Matters:
Tied distances are common in real-world datasets due to:
- Exact matches or duplicate points.
- Rounded or discretized data.
- Without tie handling, deterministic selection introduces bias and limits the applicability of libnabo for use cases requiring diverse or randomized neighbor selection.
Request:
Would it be possible to add support for resolving ties as an optional feature in libnabo? This would improve the library’s flexibility and usability for datasets where ties frequently occur.
Thank you for your work on this library!