Skip to content

Conversation

jemather
Copy link

@jemather jemather commented Apr 28, 2025

Modifying _reduce_dimensionality to use fit_transform in single step, supporting the use of nearest neighbor descent when using cuML UMAP.

Fixes #2335

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Adding support for cuML's nn-descent with UMAP
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?

Modifying _reduce_dimensionality to use fit_transform, per MaartenGr#2335
@jemather jemather changed the title Update _bertopic.py Modify reduce_dimensionality to use fit_transform Apr 28, 2025
@MaartenGr
Copy link
Owner

Thanks for the PR! I'm seeing some errors popping up here and there. Could you take a look?

@betatim
Copy link
Contributor

betatim commented May 13, 2025

@jemather do you want to look at the failures or would it be Ok for me to "take over" this PR to address the test failures?

@jemather
Copy link
Author

@betatim Sorry, started working on the changes and did not commit them. Let me see if I can get this done here in the next hour or so and if not happy to hand it over. Thank you for the nudge!

jemather added 6 commits May 13, 2025 12:03
trying to diagnose what the other potential error is besides a TypeError that needs to be caught
trying another condition
ensuring fit_transform is available
Remove testing for whether topic_representations_ has been modified - not clear why this is needed here, and creates an unhandled case when using partial_fit
fixing lint whitespace issue
@jemather
Copy link
Author

@betatim took me a few tries, but think it should be set now.

@betatim
Copy link
Contributor

betatim commented May 14, 2025

Cool beans! Happy that all it took was a nudge :D :D

@betatim
Copy link
Contributor

betatim commented May 15, 2025

I don't have direct experience with BERTopic, coming to this as someone who works on cuml. Hence motivated to help move this PR along. @MaartenGr the diff looks reasonable, but I don't know much about the code base. Is there something I can help with regarding this PR or does it "just" need time from one of the maintainers to take a look at it and think about (I say "just" because maintainer time and brain power is one of the rarest resources on planet earth ;) )

Copy link
Owner

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I left a couple of small comments. Can you take a look?

Comment on lines 3786 to 3792
elif self.topic_representations_ is None:
self.umap_model.fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
else:
if hasattr(self.umap_model, "fit_transform"):
umap_embeddings = self.umap_model.fit_transform(embeddings, y=y)
else:
self.umap_model.fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the elif statement should remain here instead of using the else statement. It has to do with some specifics regarding partial_fit not needing to be fitted again.

Copy link
Author

@jemather jemather May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed in the latest commit. I ran into an issue with this initially because I believe a condition exists within the test suite where partial_fit was True, the umap_model did not have a partial_fit method, and there were apparently topic representations, because umap_embeddings would not be created and switching to an else there fixed this.

I was wondering if it might be better to throw an exception or otherwise give a warning if a partial_fit is requested and the umap_model doesn't support it. Right now that is being obscured, and I've retained that behavior in my commit as I don't really know how it affects things outside of here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was is that the else statement should only transform the data, not fit again. Especially with a partial_fit function, there might be instances where we do not want to retrain the model but instead fit it once (especially if it is a UMAP model).

Copy link
Author

@jemather jemather Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm a bit confused as to what the desired behavior is. I was trying to emulate that of the original code, which to my reading runs partial_fit() if partial_fit == True AND the model has a partial_fit method, otherwise it runs fit. To my reading, there are no conditions under which either fit or partial_fit won't be run when _reduce_dimensionality() is called.

        # Partial fit
        if partial_fit:
            if hasattr(self.umap_model, "partial_fit"):
                self.umap_model = self.umap_model.partial_fit(embeddings)
            elif self.topic_representations_ is None:
                self.umap_model.fit(embeddings)


        # Regular fit
        else:
            try:
                # cuml umap needs y to be an numpy array
                y = np.array(y) if y is not None else None
                self.umap_model.fit(embeddings, y=y)
            except TypeError:
                self.umap_model.fit(embeddings)


        umap_embeddings = self.umap_model.transform(embeddings)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code was this:

        # Partial fit
        if partial_fit:
            if hasattr(self.umap_model, "partial_fit"):
                self.umap_model = self.umap_model.partial_fit(embeddings)
            elif self.topic_representations_ is None:
                self.umap_model.fit(embeddings)

which means that if the latter two if and elif statements are not satisfied, then it will only run self.umap_model.transform(embeddings).

This means that in your code, the else statement that you added should only run umap_embeddings = self.umap_model.transform(embeddings)

In other words:

        # Partial fit
        if partial_fit:
            if hasattr(self.umap_model, "partial_fit"):
                self.umap_model = self.umap_model.partial_fit(embeddings)
            elif self.topic_representations_ is None:
                if hasattr(self.umap_model, "fit_transform"):
                    umap_embeddings = self.umap_model.fit_transform(embeddings)
                else:
                    self.umap_model.fit(embeddings)
                    umap_embeddings = self.umap_model.transform(embeddings)
            else:
                umap_embeddings = self.umap_model.transform(embeddings)

This retains the original behavior, namely that:

  • If you the model has partial_fit, use that
  • If it does not and has no representations, fit a model (using your suggested fit_transform for speedup if it exists)
  • If it has no partial_fit and already has topic representations, then UMAP was already fit once and we should only run transform

if hasattr(self.umap_model, "fit_transform"):
umap_embeddings = self.umap_model.fit_transform(embeddings, y=y)
else:
self.umap_model.fit(embeddings)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing the y=y here in the .fit. I can see it's there in the .fit_transform though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you moved the y=y to transform step but shouldn't it be used in the .fit instead?

Copy link
Author

@jemather jemather Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should. Sorry about that, I'll fix it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your latest commit, you moved the y parameter to the wrong line. It shouldn't be used in .transform but .fit instead.

Addressing comments raised by @MaartenGr
@csadorf
Copy link

csadorf commented Aug 27, 2025

Looks like this might be ready to merge? Is there any way we could help with this PR?

@MaartenGr
Copy link
Owner

It's almost there. My comments that I made on June 6th (see above) weren't addressed just yet. If those are addressed and cover the mentioned use cases, then I can go ahead and merge.

@betatim
Copy link
Contributor

betatim commented Aug 29, 2025

@jemather I picked up this PR to implement the outstanding review comments in #2416. I hope this is fine with you. I kept all the original commits from this PR, so that you get credit for your work.

@MaartenGr
Copy link
Owner

Closing this as it was picked up and merged in #2416

@MaartenGr MaartenGr closed this Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding support for cuML's nn-descent with UMAP

4 participants