Differences between `umap.UMAP` and `cuml.UMAP` in embeddings logic

While bug hunting through our `UMAP` implementation, I found two notable (and related) differences between our logic and that in `umap-learn`. Raising them here.

When generating the embeddings,`umap-learn` does the following:

- Stores the unmodified fss graph as a `graph_` attribute
- Makes a copy of the input graph ([link](https://github.com/lmcinnes/umap/blob/63903695b996e2b68c67e00fdda6eedb7a233189/umap/umap_.py#L1068))
- Thresholds values in the _copy_ of the graph, setting low values to 0 ([link](https://github.com/lmcinnes/umap/blob/63903695b996e2b68c67e00fdda6eedb7a233189/umap/umap_.py#L1088-L1091))
- Drops zero values in the _copy_ of the graph ([link](https://github.com/lmcinnes/umap/blob/63903695b996e2b68c67e00fdda6eedb7a233189/umap/umap_.py#L1093))
- Runs initialization (spectral, random, ...)

In contrast, our implementation:

- Runs initialization (spectral, random, ...)
- Thresholds values in the input graph, setting low values to 0. This happens _in place_, mutating the input graph. ([link](https://github.com/rapidsai/cuml/blob/9559f85270fc61575f76f979db7c582eee3fcd73/cpp/src/umap/simpl_set_embed/algo.cuh#L322-L332))
- Allocates a new COO and copies the non-zero elements of the above graph to it ([link](https://github.com/rapidsai/cuml/blob/9559f85270fc61575f76f979db7c582eee3fcd73/cpp/src/umap/simpl_set_embed/algo.cuh#L334-L335))
- Then stores the input (thresholded, but with zero elements _not_ removed) graph as `graph_` ([link](https://github.com/rapidsai/cuml/blob/9559f85270fc61575f76f979db7c582eee3fcd73/python/cuml/cuml/manifold/umap.pyx#L689)).

This results in two functional differences:

- In our implementation, init is run before thresholding. In `umap-learn` init is run after thresholding. This may be partially responsible for the differences we see in our embeddings vs those generated by `umap-learn`.
- In our implementation, `graph_` has had thresholding applied to it (but zero values not removed). In `umap-learn`, `graph_` is the raw graph with no thresholding applied.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Differences between `umap.UMAP` and `cuml.UMAP` in embeddings logic #6539

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Differences between umap.UMAP and cuml.UMAP in embeddings logic #6539

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Differences between `umap.UMAP` and `cuml.UMAP` in embeddings logic #6539