Skip to content

[BUG] UMAP OOB with all-zero neighboring distances and local_connectivity = 1 #6490

Open
@rishic3

Description

@rishic3

Describe the bug

An invalid memory access can occur during UMAP transform when some points in the dataset have more duplicates (neighbors of zero distance) than n_neighbors, and local connectivity is set to 1.

During transform in this line we set adjusted_local_connectivity = max(0.0, params->local_connectivity - 1.0);, which resolves to 0 when local_connectivity=1 (the default).

Subsequently, in smooth_knn_dist_kernel, start_nonzero=-1 tracks the first non-zero distance neighbor. However, if all the inspected nearest neighbors have zero distance, start_nonzero remains -1, and when we index into rhos[row] = knn_dists[i + start_nonzero + (index - 1)] (this line) when i=0 and index=adjusted_local_connectivity=0, we erroneously access knn_dists[i + (-1) + 0], 4 bytes before the allocation.

Steps/Code to reproduce bug
A minimal repro using a sparse dataset where each row has many duplicates to trigger the bug.

n_rows = 3000
n_cols = 50
nnz = 12

indices = []
indptr = [0]
data = []

# Each row has exactly nnz non-zero elements at positions [(i+j)%n_cols for j in range(nnz)].
# This means that every row will have distance 0 to 3000/50=60 other rows.
for i in range(n_rows):
    row_indices = [(i + j) % n_cols for j in range(nnz)]
    row_indices.sort()
    indices.extend(row_indices)
    data.extend([1.0] * nnz)
    indptr.append(indptr[-1] + nnz)

input_data = csr_matrix((data, indices, indptr), shape=(n_rows, n_cols))

Running with compute sanitizer and local connectivity = 1:

umap = UMAP(n_neighbors=15, metric="jaccard", random_state=42, local_connectivity=1)
umap_model = umap.fit(input_data)
embedding = umap_model.transform(input_data)
print(f"Success: {embedding.shape}")
========= Invalid __global__ read of size 4 bytes
=========     at void UMAPAlgo::FuzzySimplSet::Naive::smooth_knn_dist_kernel<float, int, (int)256>(const T1 *, int, float, T1 *, T1 *, int, float, int, float)+0x820
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7598f09ffffc is out of bounds
=========     and is 4 bytes before the nearest allocation at 0x7598f0a00000 of size 180,000 bytes

Running with local connectivity = 2:

umap = UMAP(n_neighbors=15, metric="jaccard", random_state=42, local_connectivity=2)
umap_model = umap.fit(input_data)
embedding = umap_model.transform(input_data)
print(f"Success: {embedding.shape}")
Success: (3000, 2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions