-
Notifications
You must be signed in to change notification settings - Fork 859
Add delete topics #2322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add delete topics #2322
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR! I left a few questions to get a bit more insight into this PR. Other than that, it does seem like it would do the trick here for the most part.
Also, I believe .custom_labels_
also needs to be updated if it exists, right?
@MaartenGr, I added adjustment for .custom_labels_ and also refactored based on your other comments. Could you please review again and let me know if more edits are needed? Thank you! |
@MaartenGr, thanks for finding the mapping issues. I found out that was due to the difference between final_mapping = self.topic_mapper_.get_mappings(original_topics=False) versus final_mapping = self.topic_mapper_.get_mappings(original_topics=True). I also added a few more adjustment to make sure all attributes are updated. I tested with your example and the topic_model.get_topic_info() looks normal now after deletion. Could you please re-test and confirm? |
Thank you for making the changes. I noticed something odd. When I run the following before deleting a topic: topic_model.transform(["a document about networks deep neural network"]) I would get a different prediction then if I would run it afterwards. Perhaps there is something going on with the embeddings? |
@MaartenGr, I could not replicate the issue you mentioned. I ran your example, the prediction was the same before and after deleting topic 3. Only after I delete topic 0 - " 0_networks_deep_neural_network", it classified your example to -1. See screenshots below. Before deleting: After deleting topic 3 - "3_user_recommendation_items_users": After deleting topic 0 - "0_networks_deep_neural_network": |
Could it be that your package version is still the old one? Code I ran with your test case: ` dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"] abstracts = dataset["abstract"][:20_000] embedding_model = SentenceTransformer("all-MiniLM-L6-v2") keybert_model = KeyBERTInspired() representation_model = { topic_model = BERTopic( topics, probs = topic_model.fit_transform(abstracts[:20_000], loaded_embeddings[:20_000]) print(topic_model.transform(["a document about networks deep neural network"])) topic_model.delete_topics([3]) print(topic_model.transform(["a document about networks deep neural network"])) topic_model.delete_topics([0]) print(topic_model.transform(["a document about networks deep neural network"])) |
Odd, I can't seem to reproduce that problem in another environment. I will check in the old environment to see if I did anything special there. Note that you actually showcase something similar. Your scores before and after deleting a topic are different, which shouldn't be the case if the topic embeddings remain the same. You got 0.76048 before and 0.75848 after. This change does indicate something is happening there. Another thing I tested was using zero-shot topic modeling. When I run the following: from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from hdbscan import HDBSCAN
from umap import UMAP
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
docs = load_dataset("CShorten/ML-ArXiv-Papers")["train"]["abstract"][:20_000]
# Pre-calculate embeddings
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(docs, show_progress_bar=True)
# Use sub-models
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)
# Representation models
keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model = {
"KeyBERT": keybert_model,
"MMR": mmr_model,
}
# BERTopic
topic_model = BERTopic(
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
zeroshot_topic_list=["topic modeling", "large language models"],
verbose=True,
).fit(docs, embeddings) I get these results: The zero-shot topic modeling doesn't order the counts as intended (it should be outliers -> zero-shot -> sorted topics. That last part isn't happening which is a bug of zero-shot topic modeling, so we can safely ignore that. However, when I now delete a topic, the name of the zero-shot topic is removed: The order of the topics is changed but more importantly, the name of the zero-shot topic is removed. |
Hi @MaartenGr, thanks for the comments. A few findings from me: 1: Regarding "your scores before and after deleting a topic are different, which shouldn't be the case if the topic embeddings remain the same", a static probability score is actually not the expected behavior given your explanations in this other thread: #2124. I have tested that without running delete_topics at all, if you run the same topic_model.transform(["a document about networks deep neural network"] line multiple times after fitting the model, you will get different probability scores and I've seen variations larger than 0.05. 2: There are some lines in the add_mappings method that cause the zero-shot topic to be missed: I my latest commit, I added some draft edits that will fix the missing zero-shot topic error you pointed out, specifically for delete_topics, as well as adding a notebook to show the tests I ran: https://github.com/MaartenGr/BERTopic/blob/1bc6593ded1acb3fd6b046a08aedfcff6ca912f1/delete_topics_test_sc.ipynb. I don't know the background behind the adjustment in add_mappings for zero-shot and it is not related to my use cases. If possible, could you review if the changes I suggested for delete_topics would be sufficient? Maybe you can work on addressing issues with the zero-shot adjustments/merge_topics in a separate PR? Thank you and I look forward to hearing your thoughts. |
Apologies, I had initially thought
Which lines specifically in this code block? Do you mean the
Is that related to a different issue? We are not using
I'm not sure if I understand the notebook and commit correctly. The fix that you mention is not in the latest commit but I'm only seeing print statements there. Where in the notebook can I find the fix?
That said, if the issue is only with zero-shot topic modeling, then it makes sense to me to raise a warning if somebody tries this functionality with a zero-shot model. Otherwise, this repo will certainly get a bunch of new issues of people trying to use it in this way. Remember that any given feature in BERTopic should also consider a wide range of use cases. |
@MaartenGr Thanks for the detailed reply. Let me point you to the lines of code, my fixes and the notebook first. I think it will become clearer what I meant if you can run through those (maybe with a debugger). The issue with zero-shot topics going missing is caused by lines 4713 to 4755 in the berttopic.py file: BERTopic/bertopic/_bertopic.py Line 4713 in 4042e74
I updated in the delete method how I call add_mappings to : self.topic_mapper_.add_mappings(mapping, topic_model=deepcopy(self)) I am now passing a deepcopy of the trained_model to self.topic_mapper_.add_mappings, so the line topic_model._topic_id_to_zeroshot_topic_idx = new_topic_id_to_zeroshot_topic_idx does not modify the actual model (self) and hence avoids the issue it caused. This is the most "localized" way I can think of to resolve this only for delete_topics without changing _add_mappings at all (I can remove all those print statements once you confirm this is okay. They are just there for so you can see what is causing the missing zero shot topic issue). The notebook basically just shows that if you install the new version with my last commit, after running delete topic, the zero shot topic are still there. The only topic that was deleted was the one we specified. You are right about merge_topics being out of scope, I just mentioned it because the _add_mappings if block causes a similar issue there as well, where zero-shot topics will also go missing in this test example we are using. Hopefully that helps. Let me know if you need more details! Thank you! |
Apologies for the delay. Lately, finding the time to work on this has become more difficult. That said, if it's just a deepcopy of itself, then it is likely related to some shallow reference (class attributes) that get overwritten at some point. Just to be sure I understand correctly, with your suggested change it should now work correctly for zero-shot right? |
@MaartenGr No worries. Thanks for getting back to me! Yes, my suggested change means that delete_topics will not remove zero-shot topics (unless specified to be removed), and we can see that based on your test case. Basically, using a deepcopy of the model when calling add_mapping just allows the steps in delete_topics to NOT get affected by Line 4798 - Line 4840 in the main branch's add_mapping implementation (which I think is causing issues): BERTopic/bertopic/_bertopic.py Line 4798 in 6faf0de
All my delete_topics method needs are lines between 4791 and 4796 in the add_mapping method. Let me know if that is okay and I will push a new commit to remove the print statements. Hopefully we can get it merged soon. Thank you! |
@shuanglovesdata I'm going to keep saying this, but sorry for the delay. I'll do my best to find a bit more time to get this merged! In that case, yes, let's go ahead with this solution then! |
Hi @MaartenGr, I just pushed a commit to remove all print/debug statements, so the branch is now clean (meaning it only contains the essential new logic that makes delete_topics work, along with additional unit tests). It should all work fine as we last discussed. Please merge if you are okay with the PR. Thank you. |
@shuanglovesdata Thank you for your patience on this and your amazing work! It's a wonderful feature and I'm happy that it's finally merged 😄 |
@MaartenGr Thank you very much! I am really happy to contribute to this awesome package and add a feature that we definitely use for our workflow! Hope to connect with you through https://www.linkedin.com/in/shuanglovesdata/ and https://shuangbuildsai.medium.com/. |
What does this PR do?
Fixes discussion #2172
Before submitting