Skip to content

Conversation

betatim
Copy link
Contributor

@betatim betatim commented Aug 29, 2025

What does this PR do?

This picks up #2347, which modifies _reduce_dimensionality to use fit_transform when possible. This makes it possible to use nearest neighbor descent with cuML UMAP.

Fixes #2335

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you write any new necessary tests?

jemather and others added 10 commits April 28, 2025 16:22
Modifying _reduce_dimensionality to use fit_transform, per MaartenGr#2335
trying to diagnose what the other potential error is besides a TypeError that needs to be caught
trying another condition
ensuring fit_transform is available
Remove testing for whether topic_representations_ has been modified - not clear why this is needed here, and creates an unhandled case when using partial_fit
fixing lint whitespace issue
Addressing comments raised by @MaartenGr
Copy link
Owner

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for picking this up and my apologies for the delay! I added some feedback on a portion that wasn't yet resolved in the previous PR (see here)

Furthermore, the error you get in the tests is likely to be a result of

def fit(self, X: np.ndarray = None):
not having the y parameter. It's alright to add that there since it's only for API compatibility.

Comment on lines 3783 to 3795
if partial_fit:
if hasattr(self.umap_model, "partial_fit"):
self.umap_model = self.umap_model.partial_fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
elif self.topic_representations_ is None:
self.umap_model.fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
else:
if hasattr(self.umap_model, "fit_transform"):
umap_embeddings = self.umap_model.fit_transform(embeddings)
else:
self.umap_model.fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
Copy link
Owner

@MaartenGr MaartenGr Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code was this:

        # Partial fit
        if partial_fit:
            if hasattr(self.umap_model, "partial_fit"):
                self.umap_model = self.umap_model.partial_fit(embeddings)
            elif self.topic_representations_ is None:
                self.umap_model.fit(embeddings)

which means that if the latter two if and elif statements are not satisfied, then it will only run self.umap_model.transform(embeddings).

This means that in your code, the else statement that you added should only run umap_embeddings = self.umap_model.transform(embeddings).

In other words:

        # Partial fit
        if partial_fit:
            if hasattr(self.umap_model, "partial_fit"):
                self.umap_model = self.umap_model.partial_fit(embeddings)
            elif self.topic_representations_ is None:
                if hasattr(self.umap_model, "fit_transform"):
                    umap_embeddings = self.umap_model.fit_transform(embeddings)
                else:
                    self.umap_model.fit(embeddings)
                    umap_embeddings = self.umap_model.transform(embeddings)
            else:
                umap_embeddings = self.umap_model.transform(embeddings)

This retains the original behavior, namely that:

  • If the model has the partial_fit function, use that
  • If it does not and has no representations, fit a model (using your suggested fit_transform for speedup if it exists)
  • If it has no partial_fit and already has topic representations, then UMAP was already fit once, and we should only run transform

@betatim
Copy link
Contributor Author

betatim commented Sep 18, 2025

Thanks for the code suggestion, I applied it. Also added y as argument.

This time it was my turn to be slow to reply :D

@MaartenGr
Copy link
Owner

@betatim Thanks for the help on this! I'll do something about those duplicate umap_embeddings = topic_model.transform(embeddings) in a different PR but for now this can be merged.

@MaartenGr MaartenGr merged commit 1b9c79c into MaartenGr:master Sep 25, 2025
6 checks passed
@betatim betatim deleted the reduce-dim-fit-transform branch September 25, 2025 13:45
@betatim
Copy link
Contributor Author

betatim commented Sep 25, 2025

Thanks a lot for the help and patience!

I agree the duplicates are not nice, but I don't have a great idea right now. Feels like it needs abstracting away somehow :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding support for cuML's nn-descent with UMAP

3 participants