Skip to content

Conversation

shuanglovesdata
Copy link
Contributor

What does this PR do?

  • Added a delete_topics method to move topics to a -1 outlier cluster.
  • Added test_delete.py to test the new method

Fixes discussion #2172

Before submitting

Copy link
Owner

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR! I left a few questions to get a bit more insight into this PR. Other than that, it does seem like it would do the trick here for the most part.

Also, I believe .custom_labels_ also needs to be updated if it exists, right?

@shuanglovesdata
Copy link
Contributor Author

@MaartenGr, I added adjustment for .custom_labels_ and also refactored based on your other comments. Could you please review again and let me know if more edits are needed? Thank you!

@MaartenGr
Copy link
Owner

I tested it with the following and it doesn't seem to work for me.

Training

Data

from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Extract abstracts to train on and corresponding titles
abstracts = dataset["abstract"]
titles = dataset["title"]

Embeddings

from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

Representation models

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

# KeyBERT
keybert_model = KeyBERTInspired()

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
}

Fit

from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  representation_model=representation_model,

  # Hyperparameters
  min_topic_size=50,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(abstracts[:20_000], embeddings[:20_000])

Issue

Inspecting the topics using topic_model.get_topic_info() then yields me:

image

However, when I run:

topic_model.delete_topics([3])
topic_model.get_topic_info()

I get these topics:

image

Although the topic is removed, all other topics are mixed up. For instance, topic 0 was initially 0_networks_deep_neural_network and has now become 0_bayesian_structure_network_variables, which shouldn't be possible considering the amount of documents hasn't changed.

Note how the representative documents also haven't changed. So it seems that there is something mixed up with the labels although I'm not sure why.

@shuanglovesdata
Copy link
Contributor Author

@MaartenGr, thanks for finding the mapping issues. I found out that was due to the difference between final_mapping = self.topic_mapper_.get_mappings(original_topics=False) versus final_mapping = self.topic_mapper_.get_mappings(original_topics=True). I also added a few more adjustment to make sure all attributes are updated. I tested with your example and the topic_model.get_topic_info() looks normal now after deletion. Could you please re-test and confirm?

@MaartenGr
Copy link
Owner

Thank you for making the changes. I noticed something odd. When I run the following before deleting a topic:

topic_model.transform(["a document about networks deep neural network"])

I would get a different prediction then if I would run it afterwards. Perhaps there is something going on with the embeddings?

@shuanglovesdata
Copy link
Contributor Author

shuanglovesdata commented May 9, 2025

@MaartenGr, I could not replicate the issue you mentioned. I ran your example, the prediction was the same before and after deleting topic 3. Only after I delete topic 0 - " 0_networks_deep_neural_network", it classified your example to -1. See screenshots below.

Before deleting:

image

After deleting topic 3 - "3_user_recommendation_items_users":

image

After deleting topic 0 - "0_networks_deep_neural_network":

image

@shuanglovesdata
Copy link
Contributor Author

shuanglovesdata commented May 9, 2025

Could it be that your package version is still the old one?

Code I ran with your test case:

`
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic import BERTopic

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

abstracts = dataset["abstract"][:20_000]
titles = dataset["title"][:20_000]

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)

representation_model = {
"KeyBERT": keybert_model,
"MMR": mmr_model,
}

topic_model = BERTopic(
embedding_model=embedding_model,
representation_model=representation_model,
min_topic_size=50,
verbose=True
)

topics, probs = topic_model.fit_transform(abstracts[:20_000], loaded_embeddings[:20_000])

print(topic_model.transform(["a document about networks deep neural network"]))

topic_model.delete_topics([3])
topic_model.get_topic_info()[:5]

print(topic_model.transform(["a document about networks deep neural network"]))

topic_model.delete_topics([0])
topic_model.get_topic_info()[:5]

print(topic_model.transform(["a document about networks deep neural network"]))
`

@MaartenGr
Copy link
Owner

Odd, I can't seem to reproduce that problem in another environment. I will check in the old environment to see if I did anything special there. Note that you actually showcase something similar. Your scores before and after deleting a topic are different, which shouldn't be the case if the topic embeddings remain the same. You got 0.76048 before and 0.75848 after. This change does indicate something is happening there.

Another thing I tested was using zero-shot topic modeling. When I run the following:

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from hdbscan import HDBSCAN
from umap import UMAP

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

docs = load_dataset("CShorten/ML-ArXiv-Papers")["train"]["abstract"][:20_000]

# Pre-calculate embeddings
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Use sub-models
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)

# Representation models
keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model = {
  "KeyBERT": keybert_model,
  "MMR": mmr_model,
}

# BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    zeroshot_topic_list=["topic modeling", "large language models"],
    verbose=True,
).fit(docs, embeddings)

I get these results:

image

The zero-shot topic modeling doesn't order the counts as intended (it should be outliers -> zero-shot -> sorted topics. That last part isn't happening which is a bug of zero-shot topic modeling, so we can safely ignore that.

However, when I now delete a topic, the name of the zero-shot topic is removed:

image

The order of the topics is changed but more importantly, the name of the zero-shot topic is removed.

@shuanglovesdata
Copy link
Contributor Author

shuanglovesdata commented May 27, 2025

Hi @MaartenGr, thanks for the comments. A few findings from me:

1: Regarding "your scores before and after deleting a topic are different, which shouldn't be the case if the topic embeddings remain the same", a static probability score is actually not the expected behavior given your explanations in this other thread: #2124. I have tested that without running delete_topics at all, if you run the same topic_model.transform(["a document about networks deep neural network"] line multiple times after fitting the model, you will get different probability scores and I've seen variations larger than 0.05.

2: There are some lines in the add_mappings method that cause the zero-shot topic to be missed:
https://github.com/MaartenGr/BERTopic/blob/4042e744a66c174da15979206376f377f906a08c/bertopic/_bertopic.py#L4713C1-L4755C93
I believe the same logic actually causes merge_topics to also have issues with missing zero-shot topics after a merge.

I my latest commit, I added some draft edits that will fix the missing zero-shot topic error you pointed out, specifically for delete_topics, as well as adding a notebook to show the tests I ran: https://github.com/MaartenGr/BERTopic/blob/1bc6593ded1acb3fd6b046a08aedfcff6ca912f1/delete_topics_test_sc.ipynb.

I don't know the background behind the adjustment in add_mappings for zero-shot and it is not related to my use cases. If possible, could you review if the changes I suggested for delete_topics would be sufficient? Maybe you can work on addressing issues with the zero-shot adjustments/merge_topics in a separate PR?

Thank you and I look forward to hearing your thoughts.

@MaartenGr
Copy link
Owner

1: Regarding "your scores before and after deleting a topic are different, which shouldn't be the case if the topic embeddings remain the same", a static probability score is actually not the expected behavior given your explanations in this other thread: #2124. I have tested that without running delete_topics at all, if you run the same topic_model.transform(["a document about networks deep neural network"] line multiple times after fitting the model, you will get different probability scores and I've seen variations larger than 0.05.

Apologies, I had initially thought transform was done purely with embeddings and not the underlying cluster model, so this makes sense to me.

2: There are some lines in the add_mappings method that cause the zero-shot topic to be missed:
https://github.com/MaartenGr/BERTopic/blob/4042e744a66c174da15979206376f377f906a08c/bertopic/_bertopic.py#L4713C1-L4755C93

Which lines specifically in this code block? Do you mean the if-statement?

I believe the same logic actually causes merge_topics to also have issues with missing zero-shot topics after a merge.

Is that related to a different issue? We are not using merge_topics here, right? Either way, the way the zero-shot topics are modelled after an update a couple of updates ago isn't necessarily an improvement I think. I might redo that at some point but that is out of scope for this issue.

I my latest commit, I added some draft edits that will fix the missing zero-shot topic error you pointed out, specifically for delete_topics, as well as adding a notebook to show the tests I ran: https://github.com/MaartenGr/BERTopic/blob/1bc6593ded1acb3fd6b046a08aedfcff6ca912f1/delete_topics_test_sc.ipynb.

I'm not sure if I understand the notebook and commit correctly. The fix that you mention is not in the latest commit but I'm only seeing print statements there. Where in the notebook can I find the fix?

I don't know the background behind the adjustment in add_mappings for zero-shot and it is not related to my use cases. If possible, could you review if the changes I suggested for delete_topics would be sufficient? Maybe you can work on addressing issues with the zero-shot adjustments/merge_topics in a separate PR?

add_mappings is very important for this functionality (or at least maintaining topic state) as it relates to how any potential subsequent operations are run. For instance, you can't further reduce the topics after delete_topics if the state of topic merging/deletion isn't properly maintained.

That said, if the issue is only with zero-shot topic modeling, then it makes sense to me to raise a warning if somebody tries this functionality with a zero-shot model. Otherwise, this repo will certainly get a bunch of new issues of people trying to use it in this way. Remember that any given feature in BERTopic should also consider a wide range of use cases.

@shuanglovesdata
Copy link
Contributor Author

@MaartenGr Thanks for the detailed reply. Let me point you to the lines of code, my fixes and the notebook first. I think it will become clearer what I meant if you can run through those (maybe with a debugger).

The issue with zero-shot topics going missing is caused by lines 4713 to 4755 in the berttopic.py file:

# When zero-shot topic(s) are present in the topics to merge,
. Specifically, new_topic_id_to_zeroshot_topic_idx = {}, and topic_model._topic_id_to_zeroshot_topic_idx = new_topic_id_to_zeroshot_topic_idx just remove the zero shot topic mapping in the trained model and that is why it was gone... I don't have enough background on why this whole "if" block exists and whether removing/adjusting for the zero shot topic here is justified... The comment in the code: "# If any of the original topics are zero-shot, take the best fitting zero-shot label # if the cosine similarity with the new topic exceeds the zero-shot threshold" doesn't really make sense to me...

My fix for delete_topics:
image

I updated in the delete method how I call add_mappings to : self.topic_mapper_.add_mappings(mapping, topic_model=deepcopy(self))

1bc6593

I am now passing a deepcopy of the trained_model to self.topic_mapper_.add_mappings, so the line topic_model._topic_id_to_zeroshot_topic_idx = new_topic_id_to_zeroshot_topic_idx does not modify the actual model (self) and hence avoids the issue it caused. This is the most "localized" way I can think of to resolve this only for delete_topics without changing _add_mappings at all (I can remove all those print statements once you confirm this is okay. They are just there for so you can see what is causing the missing zero shot topic issue).

The notebook basically just shows that if you install the new version with my last commit, after running delete topic, the zero shot topic are still there. The only topic that was deleted was the one we specified.

You are right about merge_topics being out of scope, I just mentioned it because the _add_mappings if block causes a similar issue there as well, where zero-shot topics will also go missing in this test example we are using.

Hopefully that helps. Let me know if you need more details!

Thank you!

@MaartenGr
Copy link
Owner

Apologies for the delay. Lately, finding the time to work on this has become more difficult.

That said, if it's just a deepcopy of itself, then it is likely related to some shallow reference (class attributes) that get overwritten at some point.

Just to be sure I understand correctly, with your suggested change it should now work correctly for zero-shot right?

@shuanglovesdata
Copy link
Contributor Author

@MaartenGr No worries. Thanks for getting back to me!

Yes, my suggested change means that delete_topics will not remove zero-shot topics (unless specified to be removed), and we can see that based on your test case.

Basically, using a deepcopy of the model when calling add_mapping just allows the steps in delete_topics to NOT get affected by Line 4798 - Line 4840 in the main branch's add_mapping implementation (which I think is causing issues):

# When zero-shot topic(s) are present in the topics to merge,

All my delete_topics method needs are lines between 4791 and 4796 in the add_mapping method.

Let me know if that is okay and I will push a new commit to remove the print statements. Hopefully we can get it merged soon.

Thank you!

@MaartenGr
Copy link
Owner

@shuanglovesdata I'm going to keep saying this, but sorry for the delay. I'll do my best to find a bit more time to get this merged!

In that case, yes, let's go ahead with this solution then!

@shuanglovesdata
Copy link
Contributor Author

Hi @MaartenGr, I just pushed a commit to remove all print/debug statements, so the branch is now clean (meaning it only contains the essential new logic that makes delete_topics work, along with additional unit tests). It should all work fine as we last discussed. Please merge if you are okay with the PR. Thank you.

@MaartenGr MaartenGr merged commit 32b2ddd into MaartenGr:master Aug 5, 2025
6 checks passed
@MaartenGr
Copy link
Owner

@shuanglovesdata Thank you for your patience on this and your amazing work! It's a wonderful feature and I'm happy that it's finally merged 😄

@shuanglovesdata
Copy link
Contributor Author

@MaartenGr Thank you very much! I am really happy to contribute to this awesome package and add a feature that we definitely use for our workflow! Hope to connect with you through https://www.linkedin.com/in/shuanglovesdata/ and https://shuangbuildsai.medium.com/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants