Skip to content

Differences between umap.UMAP and cuml.UMAP in embeddings logic #6539

@jcrist

Description

@jcrist

While bug hunting through our UMAP implementation, I found two notable (and related) differences between our logic and that in umap-learn. Raising them here.

When generating the embeddings,umap-learn does the following:

  • Stores the unmodified fss graph as a graph_ attribute
  • Makes a copy of the input graph (link)
  • Thresholds values in the copy of the graph, setting low values to 0 (link)
  • Drops zero values in the copy of the graph (link)
  • Runs initialization (spectral, random, ...)

In contrast, our implementation:

  • Runs initialization (spectral, random, ...)
  • Thresholds values in the input graph, setting low values to 0. This happens in place, mutating the input graph. (link)
  • Allocates a new COO and copies the non-zero elements of the above graph to it (link)
  • Then stores the input (thresholded, but with zero elements not removed) graph as graph_ (link).

This results in two functional differences:

  • In our implementation, init is run before thresholding. In umap-learn init is run after thresholding. This may be partially responsible for the differences we see in our embeddings vs those generated by umap-learn.
  • In our implementation, graph_ has had thresholding applied to it (but zero values not removed). In umap-learn, graph_ is the raw graph with no thresholding applied.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions