`cuml.UMAP` embeddings result in outliers

This came out of running some analysis on hyperparameters in our nn-descent `UMAP` implementation, but seems to affect both build algos (indicating it's an issue in the steps after building the knn graph).

On sufficiently wide datasets, the `embedding_` produced by `cuml.UMAP` includes some significant outliers that aren't found in the `umap.UMAP` implementation.

Reproducer (this takes a few minutes to run due to the CPU implementation):

```python
import time

import umap
import cuml
from sklearn.datasets import make_blobs


def fit_and_evaluate(label, umap, X, cpu=None):
    start = time.time()

    umap.fit(X)

    duration = time.time() - start
    mins = int(duration // 60)
    secs = duration % 60

    print(f"{label}:")
    print(f"- time: {mins:02d}:{secs:05.2f}")
    print(f"- min: {umap.embedding_.min(axis=0).tolist()}")
    print(f"- max: {umap.embedding_.max(axis=0).tolist()}")
    if cpu is not None:
        # semi-arbitrarily define outliers as points outside 2 stds of cpu bounds
        cpu_min = cpu.embedding_.min(axis=0)
        cpu_max = cpu.embedding_.max(axis=0)
        cpu_std = cpu.embedding_.std(axis=0)
        lower_bound = cpu_min - (2 * cpu_std)
        upper_bound = cpu_max + (2 * cpu_std)
        n_outliers = (
            ((umap.embedding_ < lower_bound) | (umap.embedding_ > upper_bound))
            .any(axis=1)
            .sum()
        )
        print(f"- n_outliers: {n_outliers}")


# Fit 3 different UMAPs
common = {"n_neighbors": 15, "init": "random"}
umap_nnd = cuml.UMAP(build_algo="nn_descent", **common)
umap_bf = cuml.UMAP(build_algo="brute_force_knn", **common)
umap_cpu = umap.UMAP(**common)

X, _ = make_blobs(centers=10, n_features=512, n_samples=500_000)

fit_and_evaluate("CPU", umap_cpu, X)
fit_and_evaluate("NN-Descent", umap_nnd, X, cpu=umap_cpu)
fit_and_evaluate("Brute Force KNN", umap_bf, X, cpu=umap_cpu)
```

Output:

```
CPU:
- time: 04:41.46
- min: [-7.943415641784668, -8.329208374023438]
- max: [17.919870376586914, 15.174100875854492]
NN-Descent:
- time: 00:05.88
- min: [-646.9306030273438, -815.4732055664062]
- max: [782.5948486328125, 438.62982177734375]
- n_outliers: 72
Brute Force KNN:
- time: 00:35.99
- min: [-263.966796875, -427.2183837890625]
- max: [260.0986328125, 11.626998901367188]
- n_outliers: 33
```

NN-Decent does worse than Brute Force KNN in all runs I've done, but the difference is run dependent. Some runs brute force doesn't appear as bad. I _think_ this probably means the input KNN graph has an influence here, but the actual bug is probably downstream of the knn graph construction.

The actual plots generated here appear pretty identical across all 3 versions once the GPU versions are trimmed to exclude things outside the 0.005/0.955 quantiles (a stricter definition of "outlier" than used above). Without this trimming the plots have a much larger bounds since the min/max on the GPU versions are sometimes orders-of-magnitude larger than the CPU version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`cuml.UMAP` embeddings result in outliers #6454

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cuml.UMAP embeddings result in outliers #6454

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`cuml.UMAP` embeddings result in outliers #6454