-
Notifications
You must be signed in to change notification settings - Fork 576
Fix UMAP graph thresholding #6595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.06
Are you sure you want to change the base?
Fix UMAP graph thresholding #6595
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR Victor! I have a question before I move on with further reviews. : )
cpp/src/umap/runner.cuh
Outdated
|
||
value_t threshold = get_threshold(handle, fss_graph, inputs.n, params->n_epochs); | ||
perform_thresholding(handle, fss_graph, threshold); | ||
|
||
raft::sparse::op::coo_remove_zeros<value_t>(&fss_graph, graph, stream); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
graph
is the final output exposed at the python level. It looks like umap-learn
runs the fuzzy simplicial operation (link, corresponding to our FuzzySimplSetImpl::symmetrize
, then eliminates zeros, which becomes the final output.
I think if we doperform_thresholding
here, we end up storing the graph after thresholding as the final graph, which seems different from that umap-learn
has?
I understand that we need to do the thresholding before the embedding init, but think by doing this we end up with a different output graph (which should correspond to umap-learn
's self.graph_
). Please correct me if I'm wrong!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that once the graph reaches the Python layer, we can no longer make assumptions about how the user intends to use it—so we should avoid modifying or potentially corrupting it. Since our goal is to trim the graph, that would require creating a copy, which could significantly increase VRAM usage. I acknowledge this approach doesn't exactly align with how umap-learn
handles it, but I assumed that most users would primarily be interested in a fuzzy simplicial set graph that retains only the most important connections. I had also even considered performing the trimming before the symmetrization step, since the thrust operations there appears to cause a memory spike—if I’ve understood correctly.
However, I just realized that because the trimming threshold depends on the number of training epochs, trimming before storing the graph might work well with the UMAP estimator, but not as effectively with fuzzy_simplicial_set
, which lacks access to the number of epochs. In that case, creating a copy might indeed be the only viable option—unless the graph is stored as a SciPy array on the Cython side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, creating a copy might indeed be the only viable option—unless the graph is stored as a SciPy array on the Cython side?
When Corey and I discussed this a month or two ago, IIRC we came to agreement that the graph_
attribute could (and probably should) be returned to the user on host. It's mostly there for introspection or for umap-learn compatibility. We do want to make sure the graph is being treated roughly the same as umap-learn does so zero-code-change methods that reference it (e.g. merging of estimators) work roughly the same. Since we never reference it ourselves, having it on host would be fine and would decrease the device memory pressure at this point.
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Finishes up the shellcheck work tracked in rapidsai/build-planning#135 Authors: - Gil Forsyth (https://github.com/gforsyth) Approvers: - Simon Adorf (https://github.com/csadorf) - Divye Gala (https://github.com/divyegala) - James Lamb (https://github.com/jameslamb) URL: rapidsai#6901
…or (rapidsai#6902) Moving the import to the local context was causing the following issue: ```python (rapids) coder ➜ ~/cuml/cuml-accel-bench $ bash bench.sh Traceback (most recent call last): File "/home/coder/cuml/python/cuml/cuml/benchmark/run_benchmarks.py", line 262, in <module> algo = algorithms.algorithm_by_name(name) File "/home/coder/cuml/python/cuml/cuml/benchmark/algorithms.py", line 837, in algorithm_by_name algos = all_algorithms() File "/home/coder/cuml/python/cuml/cuml/benchmark/algorithms.py", line 227, in all_algorithms cuml.cluster.KMeans, ^^^^ UnboundLocalError: cannot access local variable 'cuml' where it is not associated with a value ``` where bench.sh was just a call to use the benchmark code: ```bash export NROWS=1000000 python /opt/conda/lib/python3.12/site-packages/cuml/benchmark/run_benchmarks.py LinearRegression ElasticNet Lasso --num-rows $NROWS --dataset regression --input-type numpy --csv LinearRegressionCPU$NROWS.csv ``` this PR fixes that by using importlib. Authors: - Dante Gama Dessavre (https://github.com/dantegd) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6902
) Mark test_warm_start_oob[RandomForestClassifier] as flaky. Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6906
Closes rapidsai#6652 * Define our own `DistanceType` and other helper classes and enums. Define `to_cuvs()` method for easy casting. * Use pimpl pattern to hide `cuvs::neighbors::ivf_flat::index` and `cuvs::neighbors::ivf_pq::index` members. These members are implementation details and are not part of the public API of approximate KNN. With this PR, none of the `libcuml` headers will include cuVS headers. Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Divye Gala (https://github.com/divyegala) URL: rapidsai#6883
This is pretty much a full rewrite of the python side of our `HDBSCAN` wrapper. There were a few goals here: - Fix rapidsai#6541 - Significantly simplify and reduce the state stored on an instance of `HDBSCAN` to ease rapidsai#6711 - Isolate the different code paths for initializing state (e.g. from fit, from sklearn, unpickling, ...) to better ensure we always have a valid `HDBSCAN` instance (and that resources are being properly tracked and freed) - Remove several cases of poor cython practice (e.g. allocating something via `new` then passing it around via a python `int` before later freeing) The changes I've made here accomplish all these goals. There are still further simplifications that could be made, but those will require some API changes to the C++ layer. For now I've punted on those, once this is in I'll open some follow-up issues to discuss further improvements. Note that for the `cuml.accel` layer we're not in a fully correct state _yet_. We're better off than we were before (and all tests continue to pass), but for `sklearn <> cuml` interop to be fully correct we'll want to finish up rapidsai#6711. I've left that work out of this PR to minimize the already large diff. Fixes rapidsai#6541. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6913
Closes rapidsai#6879 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) - Simon Adorf (https://github.com/csadorf) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6899
This: - Moves `KernelRidge` to subclass from `InteropMixin` and `Base` instead of `UniversalBase` - Improves the `as_sklearn`/`from_sklearn` methods for `KernelRidge` and adds an associated test - Adds a proxy for `KernelRidge` in `cuml.accel` - Adds docs on `KernelRidge` to the `cuml.accel` docs Fixes rapidsai#6787 Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) URL: rapidsai#6917
This ports our `HDBSCAN` to be based on `InteropMixin` instead of `UniversalBase`. It adds a test for sklearn interop to ensure we're converting correctly. It also moves the `cuml.accel` wrapper to be based on `ProxyBase` instead of `ProxyMixin`. Since this was the last model using the old `ProxyMixin` base class, we also delete `ProxyMixin` and the respective tests. Fixes rapidsai#6711. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) URL: rapidsai#6916
This removes `UniversalBase` and all remaining infrastructure surrounding it. This is possible now that we've fully transitioned to using `InteropMixin` for managing sklearn <> cuml conversion. This is a continuation of the work ripping out dead code leftover from `cuml-cpu`. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6919
I stumbled upon this while refactoring some other code. There were several `.pyx` files that had no reason to be in cython. I've moved these to be `.py` files instead. Pure python modules don't require any compilation, are easier to debug, and can rely on the copious standard tooling for formatting/linting/editor help. There were no reasons for these modules to by in cython, moving them to python. For reviewers: you might notice the diff isn't just a straight file rename. This is because `black` only runs on `.py` files, hence the better formatted code. There are no code changes (beyond reformatting and file name changes) in this PR. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6920
~For unknown reasons our pytest config isn't being uniformly applied in the dask tests module. This is leading to warnings that should be filtered not being ignored and causing errors during test runs.~ ~One such warning was recently introduced with the release of scipy 1.16. This deprecates a few parameters in a `scipy.optimize` routine that are used by sklearn's `LogisticRegression`. Due to how the deprecation warning is raised this warning shows up under sklearn's module instead of scipy (the actual source).~ ~Barring figuring out why our config isn't being uniformly applied, I've opted to explicitly ignore these warnings in the few places they pop up for now.~ *(edited - I have a better understanding of the issues now, and the PR has expanded a bit in scope)* This does 4 fixes to fix CI: - Explicitly ignores a `DeprecationWarning` raised by scipy 1.16 within sklearn. Due to how the `rapids-dask-dependency` patches modules when imported this causes the warning to be raised from an incorrect module, making our existing filter not effective. This seems like an upstream bug in the patching mechanism, but I've been unable to reproduce the issue locally so far. - Skips tests requiring `statsmodels` if it fails to import. The recent scipy 1.16 causes parts of statsmodels to fail to import. This will require an upstream fix to fully work, but using `importorskip` to skip if statsmodels fails to import (or isn't installed) seems like a sane way to handle this. - Removes two calls to `__align__` that were causing compilation errors on CUDA 12.9.1 on conda. From internal discussions it sounds like these alignment declarations weren't necessary, and removing them gets the build to pass. - Pins `numpydoc<1.9` to allow the sklearn upstream tests to pass. The new release of `numpydoc` broke the released test suite for sklearn. Fixes rapidsai#6927. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) - Divye Gala (https://github.com/divyegala) - Gil Forsyth (https://github.com/gforsyth) URL: rapidsai#6928
The test assumed a parameter would always be a cupy array, when that isn't true. Closes rapidsai#6932 Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) - Gil Forsyth (https://github.com/gforsyth) URL: rapidsai#6931
This PR adds an auto-formatter for xfail lists to ensure consistent formatting and sorting. It includes a pre-commit hook to automatically format xfail lists on commit. Closes rapidsai#6893 Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - James Lamb (https://github.com/jameslamb) - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6892
rapidsai#6926) We'd already deprecated `penalty='none'` in favor of `penalty=None` for other estimators, just not these. Switching to `penalty=None` everywhere is more uniform and also improves compatibility with sklearn. Also: - Un-xfails some MBSGDClassifier tests that don't need to be xfailed (and haven't for a while) - Fixes the LogisticRegression docstring on `penalty` to match the valid options Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6926
In the old FIL, `load()` accepted the optional `threshold` parameter for binary classifiers. The new FIL moves the parameter to `predict()`. To avoid breaking existing user code, we should continue accepting `threshold` in `load()`, until 25.10. Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6922
When maintaining xfail lists for different test suites, it's easy for the lists to become out of sync with the actual tests. This can happen when tests are renamed, removed, or only exist for specific dependency versions. Currently, these mismatches go unnoticed, leading to stale xfail entries that might mask real issues. This PR adds a warning system that detects and fails on unmatched xfail entries, helping maintain the accuracy of our test configurations. 1. Added warning system for unmatched xfail tests: - New `UnmatchedXfailTests` warning class - Enhanced pytest collection to track and compare test IDs - Helps identify missing, renamed, or version-specific tests 2. Updated the scikit-learn test runner to fail on unmatched xfail tests by adding: ```bash -W error::cuml.accel.pytest_plugin.UnmatchedXfailTests ``` Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6907
…apidsai#6905) This partially reverts commit 9379c88, removing the xfail markers from tests in test_simpl_set.py that were related to Python 3.13.4 compatibility issues. Closes rapidsai#6867 Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6905
This PR decreases the nightly CI check window from 14 days to 7 days. Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - James Lamb (https://github.com/jameslamb) URL: rapidsai#6934
Found this while doing the recent `HDBSCAN` refactor. This parameter doesn't do anything and hasn't for a long time. It's undocumented and unused, we should just rip it out. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6936
Now that we have dropped support for CUDA 11 we no longer require the nvidia channel. With the changes in rapidsai/rapids-dask-dependency#85, RAPIDS now only uses released versions of dask, so we no longer need the dask channel either. Contributes to rapidsai/build-planning#184 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - James Lamb (https://github.com/jameslamb) URL: rapidsai#6935
Closes rapidsai#4246 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) - Simon Adorf (https://github.com/csadorf) Approvers: - Divye Gala (https://github.com/divyegala) - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6924
Fixes rapidsai#6909 This exposes `n_iter_` on `LogisticRegression` by taking the value from the solver. To be compatible with scikit-learn's `LogisticRegression` the attribute should be an array. This makes it all a bit more complex. Authors: - Tim Head (https://github.com/betatim) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6911
…pidsai#6937) `Base.__getattr__` existed to forward any missing attributes on a model to the underlying `solver_model` attribute (if any). Beyond being implicit, this led to nicer user error messages raised in properties on subclasses being swallowed up as a generic `AttributeError`. This PR removes the magic `__getattr__` in favor of explicitly surfacing attributes in the parent model. The only place this was needed was the `MBSGD*` estimators. The `__getstate__`/`__setstate__` implementations on `Base` were only needed due to the `__getattr__` implementation, since without them `model.__getstate__` would return `model.solver_model.__getstate__`. Removing these as well in favor of relying on python's standard behavior (which matched the previous implementations). Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#6937
…apidsai#6943) This finishes up a TODO in the codebase to remove these from the `cuml.cluster` namespace. We just did a similar deprecation removing the `cuml.cluster.hdbscan.prediction` namespace. The canonical import path for all of these is now `cuml.cluster.hdbscan`. Also fixes a typo (and adds a test) in the recent deprecation of the `cuml.cluster.hdbscan.prediction` namespace. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#6943
Answers #6539
The UMAP implementation and the RAFT utility functions it uses would have to transition to the new COO API before we can merge this PR.
This PR: