Description
Describe the bug
An invalid memory access can occur during UMAP transform when some points in the dataset have more duplicates (neighbors of zero distance) than n_neighbors, and local connectivity is set to 1.
During transform in this line we set adjusted_local_connectivity = max(0.0, params->local_connectivity - 1.0);
, which resolves to 0 when local_connectivity=1
(the default).
Subsequently, in smooth_knn_dist_kernel, start_nonzero=-1
tracks the first non-zero distance neighbor. However, if all the inspected nearest neighbors have zero distance, start_nonzero
remains -1, and when we index into rhos[row] = knn_dists[i + start_nonzero + (index - 1)]
(this line) when i=0
and index=adjusted_local_connectivity=0
, we erroneously access knn_dists[i + (-1) + 0]
, 4 bytes before the allocation.
Steps/Code to reproduce bug
A minimal repro using a sparse dataset where each row has many duplicates to trigger the bug.
n_rows = 3000
n_cols = 50
nnz = 12
indices = []
indptr = [0]
data = []
# Each row has exactly nnz non-zero elements at positions [(i+j)%n_cols for j in range(nnz)].
# This means that every row will have distance 0 to 3000/50=60 other rows.
for i in range(n_rows):
row_indices = [(i + j) % n_cols for j in range(nnz)]
row_indices.sort()
indices.extend(row_indices)
data.extend([1.0] * nnz)
indptr.append(indptr[-1] + nnz)
input_data = csr_matrix((data, indices, indptr), shape=(n_rows, n_cols))
Running with compute sanitizer and local connectivity = 1:
umap = UMAP(n_neighbors=15, metric="jaccard", random_state=42, local_connectivity=1)
umap_model = umap.fit(input_data)
embedding = umap_model.transform(input_data)
print(f"Success: {embedding.shape}")
========= Invalid __global__ read of size 4 bytes
========= at void UMAPAlgo::FuzzySimplSet::Naive::smooth_knn_dist_kernel<float, int, (int)256>(const T1 *, int, float, T1 *, T1 *, int, float, int, float)+0x820
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7598f09ffffc is out of bounds
========= and is 4 bytes before the nearest allocation at 0x7598f0a00000 of size 180,000 bytes
Running with local connectivity = 2:
umap = UMAP(n_neighbors=15, metric="jaccard", random_state=42, local_connectivity=2)
umap_model = umap.fit(input_data)
embedding = umap_model.transform(input_data)
print(f"Success: {embedding.shape}")
Success: (3000, 2)