Fix UMAP graph thresholding #6595

viclafargue · 2025-04-28T13:28:31Z

Answers #6539

The UMAP implementation and the RAFT utility functions it uses would have to transition to the new COO API before we can merge this PR.

This PR:

Trims the graph before embedding initialization
Stores the graph on host when using the UMAP estimator

jinsolp

Thank you for this PR Victor! I have a question before I move on with further reviews. : )

jinsolp · 2025-04-28T17:42:42Z

cpp/src/umap/runner.cuh

+
+  value_t threshold = get_threshold(handle, fss_graph, inputs.n, params->n_epochs);
+  perform_thresholding(handle, fss_graph, threshold);
+
  raft::sparse::op::coo_remove_zeros<value_t>(&fss_graph, graph, stream);
 }


graph is the final output exposed at the python level. It looks like umap-learn runs the fuzzy simplicial operation (link, corresponding to our FuzzySimplSetImpl::symmetrize, then eliminates zeros, which becomes the final output.

I think if we doperform_thresholding here, we end up storing the graph after thresholding as the final graph, which seems different from that umap-learn has?
I understand that we need to do the thresholding before the embedding init, but think by doing this we end up with a different output graph (which should correspond to umap-learn's self.graph_). Please correct me if I'm wrong!

My thinking was that once the graph reaches the Python layer, we can no longer make assumptions about how the user intends to use it—so we should avoid modifying or potentially corrupting it. Since our goal is to trim the graph, that would require creating a copy, which could significantly increase VRAM usage. I acknowledge this approach doesn't exactly align with how umap-learn handles it, but I assumed that most users would primarily be interested in a fuzzy simplicial set graph that retains only the most important connections. I had also even considered performing the trimming before the symmetrization step, since the thrust operations there appears to cause a memory spike—if I’ve understood correctly.

However, I just realized that because the trimming threshold depends on the number of training epochs, trimming before storing the graph might work well with the UMAP estimator, but not as effectively with fuzzy_simplicial_set, which lacks access to the number of epochs. In that case, creating a copy might indeed be the only viable option—unless the graph is stored as a SciPy array on the Cython side?

In that case, creating a copy might indeed be the only viable option—unless the graph is stored as a SciPy array on the Cython side?

When Corey and I discussed this a month or two ago, IIRC we came to agreement that the graph_ attribute could (and probably should) be returned to the user on host. It's mostly there for introspection or for umap-learn compatibility. We do want to make sure the graph is being treated roughly the same as umap-learn does so zero-code-change methods that reference it (e.g. merging of estimators) work roughly the same. Since we never reference it ourselves, having it on host would be fine and would decrease the device memory pressure at this point.

Forward-merge branch-25.06 into branch-25.08

…sholding

Forward-merge branch-25.06 into branch-25.08

Finishes up the shellcheck work tracked in rapidsai/build-planning#135 Authors: - Gil Forsyth (https://github.com/gforsyth) Approvers: - Simon Adorf (https://github.com/csadorf) - Divye Gala (https://github.com/divyegala) - James Lamb (https://github.com/jameslamb) URL: rapidsai#6901

…or (rapidsai#6902) Moving the import to the local context was causing the following issue: ```python (rapids) coder ➜ ~/cuml/cuml-accel-bench $ bash bench.sh Traceback (most recent call last): File "/home/coder/cuml/python/cuml/cuml/benchmark/run_benchmarks.py", line 262, in <module> algo = algorithms.algorithm_by_name(name) File "/home/coder/cuml/python/cuml/cuml/benchmark/algorithms.py", line 837, in algorithm_by_name algos = all_algorithms() File "/home/coder/cuml/python/cuml/cuml/benchmark/algorithms.py", line 227, in all_algorithms cuml.cluster.KMeans, ^^^^ UnboundLocalError: cannot access local variable 'cuml' where it is not associated with a value ``` where bench.sh was just a call to use the benchmark code: ```bash export NROWS=1000000 python /opt/conda/lib/python3.12/site-packages/cuml/benchmark/run_benchmarks.py LinearRegression ElasticNet Lasso --num-rows $NROWS --dataset regression --input-type numpy --csv LinearRegressionCPU$NROWS.csv ``` this PR fixes that by using importlib. Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6902

) Mark test_warm_start_oob[RandomForestClassifier] as flaky. Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6906

Closes rapidsai#6652 * Define our own `DistanceType` and other helper classes and enums. Define `to_cuvs()` method for easy casting. * Use pimpl pattern to hide `cuvs::neighbors::ivf_flat::index` and `cuvs::neighbors::ivf_pq::index` members. These members are implementation details and are not part of the public API of approximate KNN. With this PR, none of the `libcuml` headers will include cuVS headers. Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Divye Gala (https://github.com/divyegala) URL: rapidsai#6883

This is pretty much a full rewrite of the python side of our `HDBSCAN` wrapper. There were a few goals here: - Fix rapidsai#6541 - Significantly simplify and reduce the state stored on an instance of `HDBSCAN` to ease rapidsai#6711 - Isolate the different code paths for initializing state (e.g. from fit, from sklearn, unpickling, ...) to better ensure we always have a valid `HDBSCAN` instance (and that resources are being properly tracked and freed) - Remove several cases of poor cython practice (e.g. allocating something via `new` then passing it around via a python `int` before later freeing) The changes I've made here accomplish all these goals. There are still further simplifications that could be made, but those will require some API changes to the C++ layer. For now I've punted on those, once this is in I'll open some follow-up issues to discuss further improvements. Note that for the `cuml.accel` layer we're not in a fully correct state _yet_. We're better off than we were before (and all tests continue to pass), but for `sklearn <> cuml` interop to be fully correct we'll want to finish up rapidsai#6711. I've left that work out of this PR to minimize the already large diff. Fixes rapidsai#6541. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6913

Closes rapidsai#6879 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) - Simon Adorf (https://github.com/csadorf) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6899

This: - Moves `KernelRidge` to subclass from `InteropMixin` and `Base` instead of `UniversalBase` - Improves the `as_sklearn`/`from_sklearn` methods for `KernelRidge` and adds an associated test - Adds a proxy for `KernelRidge` in `cuml.accel` - Adds docs on `KernelRidge` to the `cuml.accel` docs Fixes rapidsai#6787 Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) URL: rapidsai#6917

This ports our `HDBSCAN` to be based on `InteropMixin` instead of `UniversalBase`. It adds a test for sklearn interop to ensure we're converting correctly. It also moves the `cuml.accel` wrapper to be based on `ProxyBase` instead of `ProxyMixin`. Since this was the last model using the old `ProxyMixin` base class, we also delete `ProxyMixin` and the respective tests. Fixes rapidsai#6711. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) URL: rapidsai#6916

This removes `UniversalBase` and all remaining infrastructure surrounding it. This is possible now that we've fully transitioned to using `InteropMixin` for managing sklearn <> cuml conversion. This is a continuation of the work ripping out dead code leftover from `cuml-cpu`. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6919

I stumbled upon this while refactoring some other code. There were several `.pyx` files that had no reason to be in cython. I've moved these to be `.py` files instead. Pure python modules don't require any compilation, are easier to debug, and can rely on the copious standard tooling for formatting/linting/editor help. There were no reasons for these modules to by in cython, moving them to python. For reviewers: you might notice the diff isn't just a straight file rename. This is because `black` only runs on `.py` files, hence the better formatted code. There are no code changes (beyond reformatting and file name changes) in this PR. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6920

~For unknown reasons our pytest config isn't being uniformly applied in the dask tests module. This is leading to warnings that should be filtered not being ignored and causing errors during test runs.~ ~One such warning was recently introduced with the release of scipy 1.16. This deprecates a few parameters in a `scipy.optimize` routine that are used by sklearn's `LogisticRegression`. Due to how the deprecation warning is raised this warning shows up under sklearn's module instead of scipy (the actual source).~ ~Barring figuring out why our config isn't being uniformly applied, I've opted to explicitly ignore these warnings in the few places they pop up for now.~ *(edited - I have a better understanding of the issues now, and the PR has expanded a bit in scope)* This does 4 fixes to fix CI: - Explicitly ignores a `DeprecationWarning` raised by scipy 1.16 within sklearn. Due to how the `rapids-dask-dependency` patches modules when imported this causes the warning to be raised from an incorrect module, making our existing filter not effective. This seems like an upstream bug in the patching mechanism, but I've been unable to reproduce the issue locally so far. - Skips tests requiring `statsmodels` if it fails to import. The recent scipy 1.16 causes parts of statsmodels to fail to import. This will require an upstream fix to fully work, but using `importorskip` to skip if statsmodels fails to import (or isn't installed) seems like a sane way to handle this. - Removes two calls to `__align__` that were causing compilation errors on CUDA 12.9.1 on conda. From internal discussions it sounds like these alignment declarations weren't necessary, and removing them gets the build to pass. - Pins `numpydoc<1.9` to allow the sklearn upstream tests to pass. The new release of `numpydoc` broke the released test suite for sklearn. Fixes rapidsai#6927. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) - Divye Gala (https://github.com/divyegala) - Gil Forsyth (https://github.com/gforsyth) URL: rapidsai#6928

The test assumed a parameter would always be a cupy array, when that isn't true. Closes rapidsai#6932 Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) - Gil Forsyth (https://github.com/gforsyth) URL: rapidsai#6931

This PR adds an auto-formatter for xfail lists to ensure consistent formatting and sorting. It includes a pre-commit hook to automatically format xfail lists on commit. Closes rapidsai#6893 Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - James Lamb (https://github.com/jameslamb) - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6892

rapidsai#6926) We'd already deprecated `penalty='none'` in favor of `penalty=None` for other estimators, just not these. Switching to `penalty=None` everywhere is more uniform and also improves compatibility with sklearn. Also: - Un-xfails some MBSGDClassifier tests that don't need to be xfailed (and haven't for a while) - Fixes the LogisticRegression docstring on `penalty` to match the valid options Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6926

In the old FIL, `load()` accepted the optional `threshold` parameter for binary classifiers. The new FIL moves the parameter to `predict()`. To avoid breaking existing user code, we should continue accepting `threshold` in `load()`, until 25.10. Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6922

When maintaining xfail lists for different test suites, it's easy for the lists to become out of sync with the actual tests. This can happen when tests are renamed, removed, or only exist for specific dependency versions. Currently, these mismatches go unnoticed, leading to stale xfail entries that might mask real issues. This PR adds a warning system that detects and fails on unmatched xfail entries, helping maintain the accuracy of our test configurations. 1. Added warning system for unmatched xfail tests: - New `UnmatchedXfailTests` warning class - Enhanced pytest collection to track and compare test IDs - Helps identify missing, renamed, or version-specific tests 2. Updated the scikit-learn test runner to fail on unmatched xfail tests by adding: ```bash -W error::cuml.accel.pytest_plugin.UnmatchedXfailTests ``` Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6907

…apidsai#6905) This partially reverts commit 9379c88, removing the xfail markers from tests in test_simpl_set.py that were related to Python 3.13.4 compatibility issues. Closes rapidsai#6867 Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6905

This PR decreases the nightly CI check window from 14 days to 7 days. Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - James Lamb (https://github.com/jameslamb) URL: rapidsai#6934

Found this while doing the recent `HDBSCAN` refactor. This parameter doesn't do anything and hasn't for a long time. It's undocumented and unused, we should just rip it out. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6936

Now that we have dropped support for CUDA 11 we no longer require the nvidia channel. With the changes in rapidsai/rapids-dask-dependency#85, RAPIDS now only uses released versions of dask, so we no longer need the dask channel either. Contributes to rapidsai/build-planning#184 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - James Lamb (https://github.com/jameslamb) URL: rapidsai#6935

Closes rapidsai#4246 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) - Simon Adorf (https://github.com/csadorf) Approvers: - Divye Gala (https://github.com/divyegala) - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6924

Fixes rapidsai#6909 This exposes `n_iter_` on `LogisticRegression` by taking the value from the solver. To be compatible with scikit-learn's `LogisticRegression` the attribute should be an array. This makes it all a bit more complex. Authors: - Tim Head (https://github.com/betatim) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6911

…pidsai#6937) `Base.__getattr__` existed to forward any missing attributes on a model to the underlying `solver_model` attribute (if any). Beyond being implicit, this led to nicer user error messages raised in properties on subclasses being swallowed up as a generic `AttributeError`. This PR removes the magic `__getattr__` in favor of explicitly surfacing attributes in the parent model. The only place this was needed was the `MBSGD*` estimators. The `__getstate__`/`__setstate__` implementations on `Base` were only needed due to the `__getattr__` implementation, since without them `model.__getstate__` would return `model.solver_model.__getstate__`. Removing these as well in favor of relying on python's standard behavior (which matched the previous implementations). Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6937

…apidsai#6943) This finishes up a TODO in the codebase to remove these from the `cuml.cluster` namespace. We just did a similar deprecation removing the `cuml.cluster.hdbscan.prediction` namespace. The canonical import path for all of these is now `cuml.cluster.hdbscan`. Also fixes a typo (and adds a test) in the recent deprecation of the `cuml.cluster.hdbscan.prediction` namespace. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6943

Fix graph thresholding

19d32da

viclafargue requested a review from a team as a code owner April 28, 2025 13:28

viclafargue requested review from dantegd and divyegala April 28, 2025 13:28

github-actions bot added the CUDA/C++ label Apr 28, 2025

csadorf linked an issue Apr 28, 2025 that may be closed by this pull request

Differences between umap.UMAP and cuml.UMAP in embeddings logic #6539

Open

viclafargue requested review from jcrist and jinsolp April 28, 2025 16:52

jinsolp reviewed Apr 28, 2025

View reviewed changes

viclafargue and others added 2 commits April 29, 2025 15:49

Use a copy instead

7ee6704

DOC v25.08 Updates [skip ci]

81f1234

viclafargue changed the title ~~Fix graph thresholding~~ Fix UMAP graph thresholding May 1, 2025

GPUtester and others added 2 commits May 1, 2025 13:54

Merge pull request rapidsai#6612 from rapidsai/branch-25.06

93a808a

Forward-merge branch-25.06 into branch-25.08

Store UMAP graph as SciPy (instead of cuPy) COO array

00c4294

viclafargue requested a review from a team as a code owner May 5, 2025 12:02

github-actions bot added the Cython / Python Cython or Python issue label May 5, 2025

viclafargue added 2 commits May 5, 2025 14:03

Merge remote-tracking branch 'origin/branch-25.06' into fix-graph-tre…

3e56f84

…sholding

Store graph on host memory

21063d1

viclafargue requested a review from a team as a code owner May 6, 2025 14:37

viclafargue requested a review from vyasr May 6, 2025 14:37

github-actions bot added the CMake label May 6, 2025

viclafargue mentioned this pull request May 6, 2025

Adding host COO utility rapidsai/raft#2650

Closed

divyegala assigned viclafargue May 6, 2025

GPUtester and others added 7 commits May 6, 2025 19:10

Merge pull request rapidsai#6631 from rapidsai/branch-25.06

104a2ec

Forward-merge branch-25.06 into branch-25.08

Merge pull request rapidsai#6634 from rapidsai/branch-25.06

f29aed0

Forward-merge branch-25.06 into branch-25.08

Merge pull request rapidsai#6638 from rapidsai/branch-25.06

6f515af

Forward-merge branch-25.06 into branch-25.08

Merge pull request rapidsai#6639 from rapidsai/branch-25.06

e46615c

Forward-merge branch-25.06 into branch-25.08

Merge pull request rapidsai#6640 from rapidsai/branch-25.06

38af9df

Forward-merge branch-25.06 into branch-25.08

Merge pull request rapidsai#6641 from rapidsai/branch-25.06

b2c3d1a

Forward-merge branch-25.06 into branch-25.08

Merge branch 'branch-25.06' into branch-25.08-merge-25.06

d1f5881

gforsyth and others added 26 commits June 12, 2025 18:41

Mark test_warm_start_oob[RandomForestClassifier] as flaky (rapidsai#6906

e4fb946

) Mark test_warm_start_oob[RandomForestClassifier] as flaky. Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6906

Remove experimental.fil Python module (rapidsai#6899)

5fc6b24

Closes rapidsai#6879 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) - Simon Adorf (https://github.com/csadorf) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6899

Decrease nightly check window to 7 days (rapidsai#6934)

93e0820

This PR decreases the nightly CI check window from 14 days to 7 days. Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - James Lamb (https://github.com/jameslamb) URL: rapidsai#6934

Merge branch 'branch-25.08' into fix-graph-tresholding

1dfd460

Use new COO utility for host storage

c8402b3

viclafargue requested review from a team as code owners July 1, 2025 12:27

github-actions bot added conda conda issue ci labels Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix UMAP graph thresholding #6595

Fix UMAP graph thresholding #6595

viclafargue commented Apr 28, 2025 •

edited

Loading

Uh oh!

jinsolp left a comment

Uh oh!

jinsolp Apr 28, 2025 •

edited

Loading

Uh oh!

viclafargue Apr 29, 2025 •

edited

Loading

Uh oh!

jcrist Apr 29, 2025

Uh oh!

Uh oh!

Fix UMAP graph thresholding #6595

Are you sure you want to change the base?

Fix UMAP graph thresholding #6595

Conversation

viclafargue commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinsolp left a comment

Choose a reason for hiding this comment

Uh oh!

jinsolp Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viclafargue Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viclafargue commented Apr 28, 2025 •

edited

Loading

jinsolp Apr 28, 2025 •

edited

Loading

viclafargue Apr 29, 2025 •

edited

Loading