-
Notifications
You must be signed in to change notification settings - Fork 859
Modify _reduce_dimensionality
to use fit_transform
#2416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify _reduce_dimensionality
to use fit_transform
#2416
Conversation
Modifying _reduce_dimensionality to use fit_transform, per MaartenGr#2335
trying to diagnose what the other potential error is besides a TypeError that needs to be caught
trying another condition
ensuring fit_transform is available
Remove testing for whether topic_representations_ has been modified - not clear why this is needed here, and creates an unhandled case when using partial_fit
fixing lint whitespace issue
Addressing comments raised by @MaartenGr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for picking this up and my apologies for the delay! I added some feedback on a portion that wasn't yet resolved in the previous PR (see here)
Furthermore, the error you get in the tests is likely to be a result of
BERTopic/bertopic/dimensionality/_base.py
Line 22 in 144ab7b
def fit(self, X: np.ndarray = None): |
y
parameter. It's alright to add that there since it's only for API compatibility.
if partial_fit: | ||
if hasattr(self.umap_model, "partial_fit"): | ||
self.umap_model = self.umap_model.partial_fit(embeddings) | ||
umap_embeddings = self.umap_model.transform(embeddings) | ||
elif self.topic_representations_ is None: | ||
self.umap_model.fit(embeddings) | ||
umap_embeddings = self.umap_model.transform(embeddings) | ||
else: | ||
if hasattr(self.umap_model, "fit_transform"): | ||
umap_embeddings = self.umap_model.fit_transform(embeddings) | ||
else: | ||
self.umap_model.fit(embeddings) | ||
umap_embeddings = self.umap_model.transform(embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code was this:
# Partial fit
if partial_fit:
if hasattr(self.umap_model, "partial_fit"):
self.umap_model = self.umap_model.partial_fit(embeddings)
elif self.topic_representations_ is None:
self.umap_model.fit(embeddings)
which means that if the latter two if and elif statements are not satisfied, then it will only run self.umap_model.transform(embeddings)
.
This means that in your code, the else statement that you added should only run umap_embeddings = self.umap_model.transform(embeddings)
.
In other words:
# Partial fit
if partial_fit:
if hasattr(self.umap_model, "partial_fit"):
self.umap_model = self.umap_model.partial_fit(embeddings)
elif self.topic_representations_ is None:
if hasattr(self.umap_model, "fit_transform"):
umap_embeddings = self.umap_model.fit_transform(embeddings)
else:
self.umap_model.fit(embeddings)
umap_embeddings = self.umap_model.transform(embeddings)
else:
umap_embeddings = self.umap_model.transform(embeddings)
This retains the original behavior, namely that:
- If the model has the
partial_fit
function, use that - If it does not and has no representations, fit a model (using your suggested
fit_transform
for speedup if it exists) - If it has no
partial_fit
and already has topic representations, then UMAP was already fit once, and we should only runtransform
Co-authored-by: Maarten Grootendorst <[email protected]>
Thanks for the code suggestion, I applied it. Also added This time it was my turn to be slow to reply :D |
@betatim Thanks for the help on this! I'll do something about those duplicate |
Thanks a lot for the help and patience! I agree the duplicates are not nice, but I don't have a great idea right now. Feels like it needs abstracting away somehow :-/ |
What does this PR do?
This picks up #2347, which modifies
_reduce_dimensionality
to usefit_transform
when possible. This makes it possible to use nearest neighbor descent with cuML UMAP.Fixes #2335
Before submitting