Add delete topics #2322

shuanglovesdata · 2025-04-03T14:24:55Z

What does this PR do?

Added a delete_topics method to move topics to a -1 outlier cluster.
Added test_delete.py to test the new method

Fixes discussion #2172

Before submitting

[ N] This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
[ Y] Did you read the contributor guideline?
[ Y] Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
- refer to Method to remove Topic(s) directly from trained/merged model based on Topic ID or Name? #2172
[ N/A - TBD ] Did you make sure to update the documentation with your changes (if applicable)?
- Please let me know if this is necessary.
[Y ] Did you write any new necessary tests?

…t -1 topics

MaartenGr

Thank you for this PR! I left a few questions to get a bit more insight into this PR. Other than that, it does seem like it would do the trick here for the most part.

Also, I believe .custom_labels_ also needs to be updated if it exists, right?

bertopic/_bertopic.py

shuanglovesdata · 2025-04-18T19:10:59Z

@MaartenGr, I added adjustment for .custom_labels_ and also refactored based on your other comments. Could you please review again and let me know if more edits are needed? Thank you!

MaartenGr · 2025-04-24T12:19:04Z

I tested it with the following and it doesn't seem to work for me.

Training

Data

from datasets import load_dataset

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

# Extract abstracts to train on and corresponding titles
abstracts = dataset["abstract"]
titles = dataset["title"]

Embeddings

from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

Representation models

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

# KeyBERT
keybert_model = KeyBERTInspired()

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
}

Fit

from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  representation_model=representation_model,

  # Hyperparameters
  min_topic_size=50,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(abstracts[:20_000], embeddings[:20_000])

Issue

Inspecting the topics using topic_model.get_topic_info() then yields me:

However, when I run:

topic_model.delete_topics([3])
topic_model.get_topic_info()

I get these topics:

Although the topic is removed, all other topics are mixed up. For instance, topic 0 was initially 0_networks_deep_neural_network and has now become 0_bayesian_structure_network_variables, which shouldn't be possible considering the amount of documents hasn't changed.

Note how the representative documents also haven't changed. So it seems that there is something mixed up with the labels although I'm not sure why.

shuanglovesdata · 2025-04-26T00:46:47Z

@MaartenGr, thanks for finding the mapping issues. I found out that was due to the difference between final_mapping = self.topic_mapper_.get_mappings(original_topics=False) versus final_mapping = self.topic_mapper_.get_mappings(original_topics=True). I also added a few more adjustment to make sure all attributes are updated. I tested with your example and the topic_model.get_topic_info() looks normal now after deletion. Could you please re-test and confirm?

MaartenGr · 2025-05-06T11:43:47Z

Thank you for making the changes. I noticed something odd. When I run the following before deleting a topic:

topic_model.transform(["a document about networks deep neural network"])

I would get a different prediction then if I would run it afterwards. Perhaps there is something going on with the embeddings?

shuanglovesdata · 2025-05-09T17:31:35Z

@MaartenGr, I could not replicate the issue you mentioned. I ran your example, the prediction was the same before and after deleting topic 3. Only after I delete topic 0 - " 0_networks_deep_neural_network", it classified your example to -1. See screenshots below.

Before deleting:

After deleting topic 3 - "3_user_recommendation_items_users":

After deleting topic 0 - "0_networks_deep_neural_network":

shuanglovesdata · 2025-05-09T17:39:48Z

Could it be that your package version is still the old one?

Code I ran with your test case:

`
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic import BERTopic

dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]

abstracts = dataset["abstract"][:20_000]
titles = dataset["title"][:20_000]

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)

keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)

representation_model = {
"KeyBERT": keybert_model,
"MMR": mmr_model,
}

topic_model = BERTopic(
embedding_model=embedding_model,
representation_model=representation_model,
min_topic_size=50,
verbose=True
)

topics, probs = topic_model.fit_transform(abstracts[:20_000], loaded_embeddings[:20_000])

print(topic_model.transform(["a document about networks deep neural network"]))

topic_model.delete_topics([3])
topic_model.get_topic_info()[:5]

print(topic_model.transform(["a document about networks deep neural network"]))

topic_model.delete_topics([0])
topic_model.get_topic_info()[:5]

print(topic_model.transform(["a document about networks deep neural network"]))
`

MaartenGr · 2025-05-13T07:05:03Z

Odd, I can't seem to reproduce that problem in another environment. I will check in the old environment to see if I did anything special there. Note that you actually showcase something similar. Your scores before and after deleting a topic are different, which shouldn't be the case if the topic embeddings remain the same. You got 0.76048 before and 0.75848 after. This change does indicate something is happening there.

Another thing I tested was using zero-shot topic modeling. When I run the following:

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from hdbscan import HDBSCAN
from umap import UMAP

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

docs = load_dataset("CShorten/ML-ArXiv-Papers")["train"]["abstract"][:20_000]

# Pre-calculate embeddings
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Use sub-models
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0, random_state=42)
hdbscan_model = HDBSCAN(min_samples=5, gen_min_span_tree=True, prediction_data=True)

# Representation models
keybert_model = KeyBERTInspired()
mmr_model = MaximalMarginalRelevance(diversity=0.3)
representation_model = {
  "KeyBERT": keybert_model,
  "MMR": mmr_model,
}

# BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    zeroshot_topic_list=["topic modeling", "large language models"],
    verbose=True,
).fit(docs, embeddings)

I get these results:

The zero-shot topic modeling doesn't order the counts as intended (it should be outliers -> zero-shot -> sorted topics. That last part isn't happening which is a bug of zero-shot topic modeling, so we can safely ignore that.

However, when I now delete a topic, the name of the zero-shot topic is removed:

The order of the topics is changed but more importantly, the name of the zero-shot topic is removed.

shuanglovesdata · 2025-05-27T21:27:28Z

Hi @MaartenGr, thanks for the comments. A few findings from me:

1: Regarding "your scores before and after deleting a topic are different, which shouldn't be the case if the topic embeddings remain the same", a static probability score is actually not the expected behavior given your explanations in this other thread: #2124. I have tested that without running delete_topics at all, if you run the same topic_model.transform(["a document about networks deep neural network"] line multiple times after fitting the model, you will get different probability scores and I've seen variations larger than 0.05.

2: There are some lines in the add_mappings method that cause the zero-shot topic to be missed:
https://github.com/MaartenGr/BERTopic/blob/4042e744a66c174da15979206376f377f906a08c/bertopic/_bertopic.py#L4713C1-L4755C93
I believe the same logic actually causes merge_topics to also have issues with missing zero-shot topics after a merge.

I my latest commit, I added some draft edits that will fix the missing zero-shot topic error you pointed out, specifically for delete_topics, as well as adding a notebook to show the tests I ran: https://github.com/MaartenGr/BERTopic/blob/1bc6593ded1acb3fd6b046a08aedfcff6ca912f1/delete_topics_test_sc.ipynb.

I don't know the background behind the adjustment in add_mappings for zero-shot and it is not related to my use cases. If possible, could you review if the changes I suggested for delete_topics would be sufficient? Maybe you can work on addressing issues with the zero-shot adjustments/merge_topics in a separate PR?

Thank you and I look forward to hearing your thoughts.

MaartenGr · 2025-06-06T09:31:50Z

1: Regarding "your scores before and after deleting a topic are different, which shouldn't be the case if the topic embeddings remain the same", a static probability score is actually not the expected behavior given your explanations in this other thread: #2124. I have tested that without running delete_topics at all, if you run the same topic_model.transform(["a document about networks deep neural network"] line multiple times after fitting the model, you will get different probability scores and I've seen variations larger than 0.05.

Apologies, I had initially thought transform was done purely with embeddings and not the underlying cluster model, so this makes sense to me.

2: There are some lines in the add_mappings method that cause the zero-shot topic to be missed:
https://github.com/MaartenGr/BERTopic/blob/4042e744a66c174da15979206376f377f906a08c/bertopic/_bertopic.py#L4713C1-L4755C93

Which lines specifically in this code block? Do you mean the if-statement?

I believe the same logic actually causes merge_topics to also have issues with missing zero-shot topics after a merge.

Is that related to a different issue? We are not using merge_topics here, right? Either way, the way the zero-shot topics are modelled after an update a couple of updates ago isn't necessarily an improvement I think. I might redo that at some point but that is out of scope for this issue.

I my latest commit, I added some draft edits that will fix the missing zero-shot topic error you pointed out, specifically for delete_topics, as well as adding a notebook to show the tests I ran: https://github.com/MaartenGr/BERTopic/blob/1bc6593ded1acb3fd6b046a08aedfcff6ca912f1/delete_topics_test_sc.ipynb.

I'm not sure if I understand the notebook and commit correctly. The fix that you mention is not in the latest commit but I'm only seeing print statements there. Where in the notebook can I find the fix?

I don't know the background behind the adjustment in add_mappings for zero-shot and it is not related to my use cases. If possible, could you review if the changes I suggested for delete_topics would be sufficient? Maybe you can work on addressing issues with the zero-shot adjustments/merge_topics in a separate PR?

add_mappings is very important for this functionality (or at least maintaining topic state) as it relates to how any potential subsequent operations are run. For instance, you can't further reduce the topics after delete_topics if the state of topic merging/deletion isn't properly maintained.

That said, if the issue is only with zero-shot topic modeling, then it makes sense to me to raise a warning if somebody tries this functionality with a zero-shot model. Otherwise, this repo will certainly get a bunch of new issues of people trying to use it in this way. Remember that any given feature in BERTopic should also consider a wide range of use cases.

shuanglovesdata · 2025-06-06T21:43:45Z

@MaartenGr Thanks for the detailed reply. Let me point you to the lines of code, my fixes and the notebook first. I think it will become clearer what I meant if you can run through those (maybe with a debugger).

The issue with zero-shot topics going missing is caused by lines 4713 to 4755 in the berttopic.py file:

BERTopic/bertopic/_bertopic.py

Line 4713 in 4042e74

# When zero-shot topic(s) are present in the topics to merge,

. Specifically, new_topic_id_to_zeroshot_topic_idx = {}, and topic_model._topic_id_to_zeroshot_topic_idx = new_topic_id_to_zeroshot_topic_idx just remove the zero shot topic mapping in the trained model and that is why it was gone... I don't have enough background on why this whole "if" block exists and whether removing/adjusting for the zero shot topic here is justified... The comment in the code: "# If any of the original topics are zero-shot, take the best fitting zero-shot label # if the cosine similarity with the new topic exceeds the zero-shot threshold" doesn't really make sense to me...

My fix for delete_topics:

I updated in the delete method how I call add_mappings to : self.topic_mapper_.add_mappings(mapping, topic_model=deepcopy(self))

1bc6593

I am now passing a deepcopy of the trained_model to self.topic_mapper_.add_mappings, so the line topic_model._topic_id_to_zeroshot_topic_idx = new_topic_id_to_zeroshot_topic_idx does not modify the actual model (self) and hence avoids the issue it caused. This is the most "localized" way I can think of to resolve this only for delete_topics without changing _add_mappings at all (I can remove all those print statements once you confirm this is okay. They are just there for so you can see what is causing the missing zero shot topic issue).

The notebook basically just shows that if you install the new version with my last commit, after running delete topic, the zero shot topic are still there. The only topic that was deleted was the one we specified.

You are right about merge_topics being out of scope, I just mentioned it because the _add_mappings if block causes a similar issue there as well, where zero-shot topics will also go missing in this test example we are using.

Hopefully that helps. Let me know if you need more details!

Thank you!

MaartenGr · 2025-06-18T13:13:03Z

Apologies for the delay. Lately, finding the time to work on this has become more difficult.

That said, if it's just a deepcopy of itself, then it is likely related to some shallow reference (class attributes) that get overwritten at some point.

Just to be sure I understand correctly, with your suggested change it should now work correctly for zero-shot right?

shuanglovesdata · 2025-06-18T20:51:48Z

@MaartenGr No worries. Thanks for getting back to me!

Yes, my suggested change means that delete_topics will not remove zero-shot topics (unless specified to be removed), and we can see that based on your test case.

Basically, using a deepcopy of the model when calling add_mapping just allows the steps in delete_topics to NOT get affected by Line 4798 - Line 4840 in the main branch's add_mapping implementation (which I think is causing issues):

BERTopic/bertopic/_bertopic.py

Line 4798 in 6faf0de

# When zero-shot topic(s) are present in the topics to merge,

All my delete_topics method needs are lines between 4791 and 4796 in the add_mapping method.

Let me know if that is okay and I will push a new commit to remove the print statements. Hopefully we can get it merged soon.

Thank you!

MaartenGr · 2025-07-04T12:49:44Z

@shuanglovesdata I'm going to keep saying this, but sorry for the delay. I'll do my best to find a bit more time to get this merged!

In that case, yes, let's go ahead with this solution then!

shuanglovesdata · 2025-07-27T16:35:59Z

Hi @MaartenGr, I just pushed a commit to remove all print/debug statements, so the branch is now clean (meaning it only contains the essential new logic that makes delete_topics work, along with additional unit tests). It should all work fine as we last discussed. Please merge if you are okay with the PR. Thank you.

MaartenGr · 2025-08-05T09:24:24Z

@shuanglovesdata Thank you for your patience on this and your amazing work! It's a wonderful feature and I'm happy that it's finally merged 😄

shuanglovesdata · 2025-08-21T00:48:14Z

@MaartenGr Thank you very much! I am really happy to contribute to this awesome package and add a feature that we definitely use for our workflow! Hope to connect with you through https://www.linkedin.com/in/shuanglovesdata/ and https://shuangbuildsai.medium.com/.

shuanglovesdata added 3 commits March 31, 2025 23:42

inital_try_merge_to_outlier

ecae7fe

Updated delete_topics and test_delete.py to account for models withou…

b8e01c5

…t -1 topics

minor refactor/format updates

d85959e

MaartenGr reviewed Apr 10, 2025

View reviewed changes

bertopic/_bertopic.py Outdated Show resolved Hide resolved

bertopic/_bertopic.py Outdated Show resolved Hide resolved

bertopic/_bertopic.py Outdated Show resolved Hide resolved

refactor delete_topics and adjust for custom_labels_

15e9cfe

debug and update delete_topics

31fd95e

draft delete adjustment for zero shot

1bc6593

finalize delete topics

17f0a72

MaartenGr merged commit 32b2ddd into MaartenGr:master Aug 5, 2025
6 checks passed

Add delete topics #2322

Add delete topics #2322

Uh oh!

Conversation

shuanglovesdata commented Apr 3, 2025

What does this PR do?

Before submitting

Uh oh!

MaartenGr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shuanglovesdata commented Apr 18, 2025

Uh oh!

MaartenGr commented Apr 24, 2025

Training

Issue

Uh oh!

shuanglovesdata commented Apr 26, 2025

Uh oh!

MaartenGr commented May 6, 2025

Uh oh!

shuanglovesdata commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shuanglovesdata commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaartenGr commented May 13, 2025

Uh oh!

shuanglovesdata commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaartenGr commented Jun 6, 2025

Uh oh!

shuanglovesdata commented Jun 6, 2025

Uh oh!

MaartenGr commented Jun 18, 2025

Uh oh!

shuanglovesdata commented Jun 18, 2025

Uh oh!

MaartenGr commented Jul 4, 2025

Uh oh!

shuanglovesdata commented Jul 27, 2025

Uh oh!

Uh oh!

MaartenGr commented Aug 5, 2025

Uh oh!

shuanglovesdata commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shuanglovesdata commented May 9, 2025 •

edited

Loading

shuanglovesdata commented May 9, 2025 •

edited

Loading

shuanglovesdata commented May 27, 2025 •

edited

Loading