Description
This came out of running some analysis on hyperparameters in our nn-descent UMAP
implementation, but seems to affect both build algos (indicating it's an issue in the steps after building the knn graph).
On sufficiently wide datasets, the embedding_
produced by cuml.UMAP
includes some significant outliers that aren't found in the umap.UMAP
implementation.
Reproducer (this takes a few minutes to run due to the CPU implementation):
import time
import umap
import cuml
from sklearn.datasets import make_blobs
def fit_and_evaluate(label, umap, X, cpu=None):
start = time.time()
umap.fit(X)
duration = time.time() - start
mins = int(duration // 60)
secs = duration % 60
print(f"{label}:")
print(f"- time: {mins:02d}:{secs:05.2f}")
print(f"- min: {umap.embedding_.min(axis=0).tolist()}")
print(f"- max: {umap.embedding_.max(axis=0).tolist()}")
if cpu is not None:
# semi-arbitrarily define outliers as points outside 2 stds of cpu bounds
cpu_min = cpu.embedding_.min(axis=0)
cpu_max = cpu.embedding_.max(axis=0)
cpu_std = cpu.embedding_.std(axis=0)
lower_bound = cpu_min - (2 * cpu_std)
upper_bound = cpu_max + (2 * cpu_std)
n_outliers = (
((umap.embedding_ < lower_bound) | (umap.embedding_ > upper_bound))
.any(axis=1)
.sum()
)
print(f"- n_outliers: {n_outliers}")
# Fit 3 different UMAPs
common = {"n_neighbors": 15, "init": "random"}
umap_nnd = cuml.UMAP(build_algo="nn_descent", **common)
umap_bf = cuml.UMAP(build_algo="brute_force_knn", **common)
umap_cpu = umap.UMAP(**common)
X, _ = make_blobs(centers=10, n_features=512, n_samples=500_000)
fit_and_evaluate("CPU", umap_cpu, X)
fit_and_evaluate("NN-Descent", umap_nnd, X, cpu=umap_cpu)
fit_and_evaluate("Brute Force KNN", umap_bf, X, cpu=umap_cpu)
Output:
CPU:
- time: 04:41.46
- min: [-7.943415641784668, -8.329208374023438]
- max: [17.919870376586914, 15.174100875854492]
NN-Descent:
- time: 00:05.88
- min: [-646.9306030273438, -815.4732055664062]
- max: [782.5948486328125, 438.62982177734375]
- n_outliers: 72
Brute Force KNN:
- time: 00:35.99
- min: [-263.966796875, -427.2183837890625]
- max: [260.0986328125, 11.626998901367188]
- n_outliers: 33
NN-Decent does worse than Brute Force KNN in all runs I've done, but the difference is run dependent. Some runs brute force doesn't appear as bad. I think this probably means the input KNN graph has an influence here, but the actual bug is probably downstream of the knn graph construction.
The actual plots generated here appear pretty identical across all 3 versions once the GPU versions are trimmed to exclude things outside the 0.005/0.955 quantiles (a stricter definition of "outlier" than used above). Without this trimming the plots have a much larger bounds since the min/max on the GPU versions are sometimes orders-of-magnitude larger than the CPU version.