Description
While bug hunting through our UMAP
implementation, I found two notable (and related) differences between our logic and that in umap-learn
. Raising them here.
When generating the embeddings,umap-learn
does the following:
- Stores the unmodified fss graph as a
graph_
attribute - Makes a copy of the input graph (link)
- Thresholds values in the copy of the graph, setting low values to 0 (link)
- Drops zero values in the copy of the graph (link)
- Runs initialization (spectral, random, ...)
In contrast, our implementation:
- Runs initialization (spectral, random, ...)
- Thresholds values in the input graph, setting low values to 0. This happens in place, mutating the input graph. (link)
- Allocates a new COO and copies the non-zero elements of the above graph to it (link)
- Then stores the input (thresholded, but with zero elements not removed) graph as
graph_
(link).
This results in two functional differences:
- In our implementation, init is run before thresholding. In
umap-learn
init is run after thresholding. This may be partially responsible for the differences we see in our embeddings vs those generated byumap-learn
. - In our implementation,
graph_
has had thresholding applied to it (but zero values not removed). Inumap-learn
,graph_
is the raw graph with no thresholding applied.