-
Notifications
You must be signed in to change notification settings - Fork 859
Modify reduce_dimensionality to use fit_transform #2347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Modifying _reduce_dimensionality to use fit_transform, per MaartenGr#2335
Thanks for the PR! I'm seeing some errors popping up here and there. Could you take a look? |
@jemather do you want to look at the failures or would it be Ok for me to "take over" this PR to address the test failures? |
@betatim Sorry, started working on the changes and did not commit them. Let me see if I can get this done here in the next hour or so and if not happy to hand it over. Thank you for the nudge! |
trying to diagnose what the other potential error is besides a TypeError that needs to be caught
trying another condition
ensuring fit_transform is available
Remove testing for whether topic_representations_ has been modified - not clear why this is needed here, and creates an unhandled case when using partial_fit
fixing lint whitespace issue
@betatim took me a few tries, but think it should be set now. |
Cool beans! Happy that all it took was a nudge :D :D |
I don't have direct experience with BERTopic, coming to this as someone who works on cuml. Hence motivated to help move this PR along. @MaartenGr the diff looks reasonable, but I don't know much about the code base. Is there something I can help with regarding this PR or does it "just" need time from one of the maintainers to take a look at it and think about (I say "just" because maintainer time and brain power is one of the rarest resources on planet earth ;) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. I left a couple of small comments. Can you take a look?
elif self.topic_representations_ is None: | ||
self.umap_model.fit(embeddings) | ||
umap_embeddings = self.umap_model.transform(embeddings) | ||
else: | ||
if hasattr(self.umap_model, "fit_transform"): | ||
umap_embeddings = self.umap_model.fit_transform(embeddings, y=y) | ||
else: | ||
self.umap_model.fit(embeddings) | ||
umap_embeddings = self.umap_model.transform(embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the elif
statement should remain here instead of using the else
statement. It has to do with some specifics regarding partial_fit
not needing to be fitted again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed in the latest commit. I ran into an issue with this initially because I believe a condition exists within the test suite where partial_fit was True, the umap_model did not have a partial_fit method, and there were apparently topic representations, because umap_embeddings would not be created and switching to an else there fixed this.
I was wondering if it might be better to throw an exception or otherwise give a warning if a partial_fit is requested and the umap_model doesn't support it. Right now that is being obscured, and I've retained that behavior in my commit as I don't really know how it affects things outside of here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was is that the else
statement should only transform the data, not fit again. Especially with a partial_fit
function, there might be instances where we do not want to retrain the model but instead fit it once (especially if it is a UMAP model).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm a bit confused as to what the desired behavior is. I was trying to emulate that of the original code, which to my reading runs partial_fit()
if partial_fit == True
AND the model has a partial_fit method, otherwise it runs fit
. To my reading, there are no conditions under which either fit
or partial_fit
won't be run when _reduce_dimensionality()
is called.
# Partial fit
if partial_fit:
if hasattr(self.umap_model, "partial_fit"):
self.umap_model = self.umap_model.partial_fit(embeddings)
elif self.topic_representations_ is None:
self.umap_model.fit(embeddings)
# Regular fit
else:
try:
# cuml umap needs y to be an numpy array
y = np.array(y) if y is not None else None
self.umap_model.fit(embeddings, y=y)
except TypeError:
self.umap_model.fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code was this:
# Partial fit
if partial_fit:
if hasattr(self.umap_model, "partial_fit"):
self.umap_model = self.umap_model.partial_fit(embeddings)
elif self.topic_representations_ is None:
self.umap_model.fit(embeddings)
which means that if the latter two if
and elif
statements are not satisfied, then it will only run self.umap_model.transform(embeddings)
.
This means that in your code, the else
statement that you added should only run umap_embeddings = self.umap_model.transform(embeddings)
In other words:
# Partial fit
if partial_fit:
if hasattr(self.umap_model, "partial_fit"):
self.umap_model = self.umap_model.partial_fit(embeddings)
elif self.topic_representations_ is None:
if hasattr(self.umap_model, "fit_transform"):
umap_embeddings = self.umap_model.fit_transform(embeddings)
else:
self.umap_model.fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
else:
umap_embeddings = self.umap_model.transform(embeddings)
This retains the original behavior, namely that:
- If you the model has
partial_fit
, use that - If it does not and has no representations, fit a model (using your suggested
fit_transform
for speedup if it exists) - If it has no
partial_fit
and already has topic representations, then UMAP was already fit once and we should only runtransform
if hasattr(self.umap_model, "fit_transform"): | ||
umap_embeddings = self.umap_model.fit_transform(embeddings, y=y) | ||
else: | ||
self.umap_model.fit(embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing the y=y
here in the .fit
. I can see it's there in the .fit_transform
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw you moved the y=y
to transform
step but shouldn't it be used in the .fit
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should. Sorry about that, I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your latest commit, you moved the y
parameter to the wrong line. It shouldn't be used in .transform
but .fit
instead.
Addressing comments raised by @MaartenGr
Looks like this might be ready to merge? Is there any way we could help with this PR? |
It's almost there. My comments that I made on June 6th (see above) weren't addressed just yet. If those are addressed and cover the mentioned use cases, then I can go ahead and merge. |
Closing this as it was picked up and merged in #2416 |
Modifying _reduce_dimensionality to use fit_transform in single step, supporting the use of nearest neighbor descent when using cuML UMAP.
Fixes #2335
Before submitting