BERTopic on a lot of data displays very similar topics #2386

carobs9 · 2025-07-01T12:33:01Z

carobs9
Jul 1, 2025

Hi all,

I have trained a BERTopic model on around 3.5 million short texts (no more than 200 words for the longest one). The model trained successfully but the 16 obtained topics seem to be very similar to each other, containing also very similar representative words. The embedding points seem to be very close as well in the 2-dimensional space. Has anyone had a similar issue and can come up with any solutions or suggestions?

Thank you very much in advance!

Here are the UMAP and HDBSCAN parameters:

umap_model = UMAP(n_neighbors=100, # I tested as well with 10 n_neighbors, getting similar results
                                            n_components=2, 
                                            min_dist=0.0, 
                                            metric='cosine', 
                                            low_memory=True,
                                            random_state=cfg.SEED)

hdbscan_model = HDBSCAN(
            min_cluster_size=8000,
            min_samples=50,
            cluster_selection_epsilon=0.01,
            metric="euclidean",
            cluster_selection_method="eom",
            prediction_data=True
        )

topic_model = BERTopic(
            embedding_model=cfg.EMBEDDING_MODEL,
            umap_model=umap_model,
            hdbscan_model= hdbscan_model,
            language=language,
            vectorizer_model=vectorizer_model,
            calculate_probabilities=True,
            verbose=True,
            low_memory=True
        )

Answered by MaartenGr

Jul 1, 2025

The number of components for your UMAP model is rather low. I would definitely increase that to at least 5. You are losing a lot of information when reducing it to (almost) the bare minimum.

With respect to HDBSCAN, the min_cluster_size is quite high and HDBSCAN then tends to create rather abstract and broad clusters. I would advise lowering that value and potentially merging topics later on if needed. Doing the latter would also show what it means to get a few topics.

Lastly, the cfg.EMBEDDING_MODEL might also be related but depends on its contents.

View full answer

MaartenGr · 2025-07-01T12:52:28Z

MaartenGr
Jul 1, 2025
Maintainer

The number of components for your UMAP model is rather low. I would definitely increase that to at least 5. You are losing a lot of information when reducing it to (almost) the bare minimum.

With respect to HDBSCAN, the min_cluster_size is quite high and HDBSCAN then tends to create rather abstract and broad clusters. I would advise lowering that value and potentially merging topics later on if needed. Doing the latter would also show what it means to get a few topics.

Lastly, the cfg.EMBEDDING_MODEL might also be related but depends on its contents.

4 replies

carobs9 Jul 1, 2025
Author

Hi @MaartenGr , thank you your answer. The reason I reduce to two components is because I am interested in creating a datamap where the topic assignation can be seen visually. Do you think this would be a possibility at all with such large dataset?

The rest of the suggestions make great progress. Thank you.

MaartenGr Jul 1, 2025
Maintainer

Hi @MaartenGr , thank you your answer. The reason I reduce to two components is because I am interested in creating a datamap where the topic assignation can be seen visually. Do you think this would be a possibility at all with such large dataset?

I would advise not doing that during the training of BERTopic as the reduction is quite strong. Instead, you can do a 2D reduction after training the model. That way, HDBSCAN still profits from a bit higher dimensionality whilst you can pass 2D-reduced embeddings to any of the visualization options that are in BERTopic.

carobs9 Jul 3, 2025
Author

Hi @MaartenGr , I have tried your suggestions and the results improve slightly, especially when taking a look at the resulting topics. Yet, when I try to visualize a 10.000 sample of the 2D embeddings and their topics after reduction (I reduced to 10) using Datamapplot, I get something like the image below, which is highly difficult to understand. Is it possible that the whole problem arises from the way the embeddings are being calculated?

MaartenGr Jul 4, 2025
Maintainer

Most likely, there is either a problem when matplotlib zooming out a bit too much (so the way you saved the image) or there is an outlier embedding somewhere that makes the image zoom out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BERTopic on a lot of data displays very similar topics #2386

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

BERTopic on a lot of data displays very similar topics #2386

Uh oh!

Uh oh!

carobs9 Jul 1, 2025

Replies: 1 comment · 4 replies

Uh oh!

MaartenGr Jul 1, 2025 Maintainer

Uh oh!

carobs9 Jul 1, 2025 Author

Uh oh!

MaartenGr Jul 1, 2025 Maintainer

Uh oh!

carobs9 Jul 3, 2025 Author

Uh oh!

MaartenGr Jul 4, 2025 Maintainer

carobs9
Jul 1, 2025

Replies: 1 comment 4 replies

MaartenGr
Jul 1, 2025
Maintainer

carobs9 Jul 1, 2025
Author

MaartenGr Jul 1, 2025
Maintainer

carobs9 Jul 3, 2025
Author

MaartenGr Jul 4, 2025
Maintainer