Skip to content

Update topic embeddings and representations when merging clusters in merge_models #2431

@inesmcm26

Description

@inesmcm26

Feature request

Improve BERTopic.merge_models so that when two clusters from different models are merged, the resulting topic metadata (embeddings, representation, etc.) is updated to better reflect the new cluster.

Motivation

I’ve been experimenting with BERTopic.merge_models and noticed a potential improvement regarding how cluster merges are handled.

Currently, when two clusters from different models are considered similar enough and are merged into a single topic in the resulting model, the merged topic keeps the embedding, topic representation, topic label, and topic aspects from the baseline model only. The contribution from the second model is not reflected in the merged topic.

This can lead to a merged topic that does not fully represent all documents from both models, especially if the second model has a substantial number of documents in that cluster.

Your contribution

I suggest:

  • Updating topic_embeddings_ by averaging or weighted averaging embeddings from all merged clusters.
  • Combining topic_representations_ (e.g., recomputed c-TF-IDF).
  • Combining topic_labels_ and topic_aspects_ (e.g., recalculate topic labels based based on new topic representation; recompute topic aspects)

This would make merge_models produce topics that more accurately represent the union of all merged clusters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions