-
Notifications
You must be signed in to change notification settings - Fork 859
Description
Feature request
Improve BERTopic.merge_models
so that when two clusters from different models are merged, the resulting topic metadata (embeddings, representation, etc.) is updated to better reflect the new cluster.
Motivation
I’ve been experimenting with BERTopic.merge_models
and noticed a potential improvement regarding how cluster merges are handled.
Currently, when two clusters from different models are considered similar enough and are merged into a single topic in the resulting model, the merged topic keeps the embedding, topic representation, topic label, and topic aspects from the baseline model only. The contribution from the second model is not reflected in the merged topic.
This can lead to a merged topic that does not fully represent all documents from both models, especially if the second model has a substantial number of documents in that cluster.
Your contribution
I suggest:
- Updating
topic_embeddings_
by averaging or weighted averaging embeddings from all merged clusters. - Combining
topic_representations_
(e.g., recomputed c-TF-IDF). - Combining
topic_labels_
andtopic_aspects_
(e.g., recalculate topic labels based based on new topic representation; recompute topic aspects)
This would make merge_models
produce topics that more accurately represent the union of all merged clusters.