How to tweak the minimum topic size when training on a lot of data? #2369

carobs9 · 2025-05-28T12:15:00Z

carobs9
May 28, 2025

Hello,

I am training BERTopic on around 500,000 short descriptions, and I plan on training it on around 1,000,000 later on. The training at the moment takes around 50 minutes, and I am using HDBSCAN to find clusters. Based on the data, I would like to obtain four or five clusters, but it is very difficult to find the correct minimum topic size when training takes such long time.

Is there a way of testing the results of using different minimum topic size values without having to train the model every time?

Thank you in advance!

Answered by MaartenGr

May 30, 2025

What you are looking for is quite tricky considering the nature of HDBSCAN. When you run it using a small subset of documents, you are unlikely to generalize those findings to a larger amount of data. This is partly because your data will differ, partly because UMAP will learn different representations, and partly because HDBSCAN just "finds" any given number of clusters as long as it adheres to its parameters.

So it would indeed be difficult to know the min_topic_size/ min_cluster_size beforehand to get a specific (or roughly) amount of clusters.

Instead, I would actually advise using a different clustering algorithm altogether. In my experience, HDBSCAN really excels at finding many clu…

View full answer

MaartenGr · 2025-05-30T14:36:05Z

MaartenGr
May 30, 2025
Maintainer

What you are looking for is quite tricky considering the nature of HDBSCAN. When you run it using a small subset of documents, you are unlikely to generalize those findings to a larger amount of data. This is partly because your data will differ, partly because UMAP will learn different representations, and partly because HDBSCAN just "finds" any given number of clusters as long as it adheres to its parameters.

So it would indeed be difficult to know the min_topic_size/ min_cluster_size beforehand to get a specific (or roughly) amount of clusters.

Instead, I would actually advise using a different clustering algorithm altogether. In my experience, HDBSCAN really excels at finding many clusters with all types of different structures and distributions. However, with 1,000,000 documents and only 5 topics, structure might be less of an issue and you could get similar (or even better results) by using something like k-Means instead. It would allow you to predefine the number clusters to create.

Another trick would be to use cuML to speed up training, but I'm not familiar with your setup or whether you have a GPU to work with. If you have, that would significantly decrease the training time and allow you to quickly iterate. Then, you could indeed start testing different values for min_topic_size.

2 replies

carobs9 Jun 3, 2025
Author

Hi @MaartenGr ,

thank you very much for your answer. So far, the combination of HDBSCAN and k-Means to get the most out of the data is working great.

In addition, my colleagues and I are interested in accessing the probabilities used by HDBSCAN to assign one topic instead of others, and the probabilities of the rest of the topics, by setting calculate_probabilities=True during training. We are also reducing the number of topics post-training by using topic_model.reduce_topics(docs, nr_topics=i). Do you have any advice for obtaining probabilities insights? And would it be possible to calculate these probabilities after reducing the number of topics post training?

As usual, thank you for your time!

MaartenGr Jun 6, 2025
Maintainer

Do you have any advice for obtaining probabilities insights?

What kind of insight do you mean exactly? The probabilities are generated and accessible within BERTopic, so you could always visualize/analyse them if need be.

And would it be possible to calculate these probabilities after reducing the number of topics post training?

You can calculate the probabilities after reducing the number of topics post training as those will be simply summed together.
If you want to recalculate based on the reduced topics, then saving and loading the model using safetensors is another option. When you do that, the probabilities will instead be calculated purely with embeddings, not HDBSCAN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to tweak the minimum topic size when training on a lot of data? #2369

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to tweak the minimum topic size when training on a lot of data? #2369

Uh oh!

carobs9 May 28, 2025

Replies: 1 comment · 2 replies

Uh oh!

MaartenGr May 30, 2025 Maintainer

Uh oh!

carobs9 Jun 3, 2025 Author

Uh oh!

MaartenGr Jun 6, 2025 Maintainer

carobs9
May 28, 2025

Replies: 1 comment 2 replies

MaartenGr
May 30, 2025
Maintainer

carobs9 Jun 3, 2025
Author

MaartenGr Jun 6, 2025
Maintainer