Skip to content

[BUG] kneighbors_graph include_self does not work as expected #6544

Open
@aamijar

Description

@aamijar

When building a KNN graph we have the option to set include_self=False to avoid including the edge between a point and itself (as its distance is always 0).
The current filtering method drops the first column of the output.

# drop first column if using training data as X
# this will need to be moved to the C++ layer (cuml issue #2562)
if use_training_data:
if out_type in {'cupy', 'numpy', 'numba'}:
I_ndarr = I_ndarr[:, 1:]
D_ndarr = D_ndarr[:, 1:]
elif out_type == "cuml":
I_ndarr = CumlArray.from_input(I_ndarr[:, 1:], force_contiguous=True)
D_ndarr = CumlArray.from_input(D_ndarr[:, 1:], force_contiguous=True)
else:
I_ndarr.drop(I_ndarr.columns[0], axis=1)
D_ndarr.drop(D_ndarr.columns[0], axis=1)

However, this fails if there is a tie of distances (where there are multiple 0 edges)

import numpy as np
from sklearn.neighbors import kneighbors_graph
from cuml.neighbors import kneighbors_graph as cuKNN

X = np.array([
    [1, 5],
    [1, 5],
    [7, 3],
    [9, 6],
    [10, 1]
])

knn_graph = kneighbors_graph(X, 2, mode='connectivity', include_self=False)
print(knn_graph.toarray())
print("")
knn_graph = cuKNN(X, 2, mode='connectivity', include_self=False)
print(knn_graph.toarray())
[[0. 1. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [0. 0. 0. 1. 1.]
 [0. 0. 1. 0. 1.]
 [0. 0. 1. 1. 0.]]

# cuml output is different with a 1 on the top left
[[1. 0. 1. 0. 0.]
 [1. 0. 1. 0. 0.]
 [0. 0. 0. 1. 1.]
 [0. 0. 1. 0. 1.]
 [0. 0. 1. 1. 0.]]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions