Skip to content

cuml.UMAP embeddings result in outliers #6454

Open
@jcrist

Description

@jcrist

This came out of running some analysis on hyperparameters in our nn-descent UMAP implementation, but seems to affect both build algos (indicating it's an issue in the steps after building the knn graph).

On sufficiently wide datasets, the embedding_ produced by cuml.UMAP includes some significant outliers that aren't found in the umap.UMAP implementation.

Reproducer (this takes a few minutes to run due to the CPU implementation):

import time

import umap
import cuml
from sklearn.datasets import make_blobs


def fit_and_evaluate(label, umap, X, cpu=None):
    start = time.time()

    umap.fit(X)

    duration = time.time() - start
    mins = int(duration // 60)
    secs = duration % 60

    print(f"{label}:")
    print(f"- time: {mins:02d}:{secs:05.2f}")
    print(f"- min: {umap.embedding_.min(axis=0).tolist()}")
    print(f"- max: {umap.embedding_.max(axis=0).tolist()}")
    if cpu is not None:
        # semi-arbitrarily define outliers as points outside 2 stds of cpu bounds
        cpu_min = cpu.embedding_.min(axis=0)
        cpu_max = cpu.embedding_.max(axis=0)
        cpu_std = cpu.embedding_.std(axis=0)
        lower_bound = cpu_min - (2 * cpu_std)
        upper_bound = cpu_max + (2 * cpu_std)
        n_outliers = (
            ((umap.embedding_ < lower_bound) | (umap.embedding_ > upper_bound))
            .any(axis=1)
            .sum()
        )
        print(f"- n_outliers: {n_outliers}")


# Fit 3 different UMAPs
common = {"n_neighbors": 15, "init": "random"}
umap_nnd = cuml.UMAP(build_algo="nn_descent", **common)
umap_bf = cuml.UMAP(build_algo="brute_force_knn", **common)
umap_cpu = umap.UMAP(**common)

X, _ = make_blobs(centers=10, n_features=512, n_samples=500_000)

fit_and_evaluate("CPU", umap_cpu, X)
fit_and_evaluate("NN-Descent", umap_nnd, X, cpu=umap_cpu)
fit_and_evaluate("Brute Force KNN", umap_bf, X, cpu=umap_cpu)

Output:

CPU:
- time: 04:41.46
- min: [-7.943415641784668, -8.329208374023438]
- max: [17.919870376586914, 15.174100875854492]
NN-Descent:
- time: 00:05.88
- min: [-646.9306030273438, -815.4732055664062]
- max: [782.5948486328125, 438.62982177734375]
- n_outliers: 72
Brute Force KNN:
- time: 00:35.99
- min: [-263.966796875, -427.2183837890625]
- max: [260.0986328125, 11.626998901367188]
- n_outliers: 33

NN-Decent does worse than Brute Force KNN in all runs I've done, but the difference is run dependent. Some runs brute force doesn't appear as bad. I think this probably means the input KNN graph has an influence here, but the actual bug is probably downstream of the knn graph construction.

The actual plots generated here appear pretty identical across all 3 versions once the GPU versions are trimmed to exclude things outside the 0.005/0.955 quantiles (a stricter definition of "outlier" than used above). Without this trimming the plots have a much larger bounds since the min/max on the GPU versions are sometimes orders-of-magnitude larger than the CPU version.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions