-
Hello, I am training BERTopic on around 500,000 short descriptions, and I plan on training it on around 1,000,000 later on. The training at the moment takes around 50 minutes, and I am using HDBSCAN to find clusters. Based on the data, I would like to obtain four or five clusters, but it is very difficult to find the correct minimum topic size when training takes such long time. Is there a way of testing the results of using different minimum topic size values without having to train the model every time? Thank you in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
What you are looking for is quite tricky considering the nature of HDBSCAN. When you run it using a small subset of documents, you are unlikely to generalize those findings to a larger amount of data. This is partly because your data will differ, partly because UMAP will learn different representations, and partly because HDBSCAN just "finds" any given number of clusters as long as it adheres to its parameters. So it would indeed be difficult to know the Instead, I would actually advise using a different clustering algorithm altogether. In my experience, HDBSCAN really excels at finding many clusters with all types of different structures and distributions. However, with 1,000,000 documents and only 5 topics, structure might be less of an issue and you could get similar (or even better results) by using something like k-Means instead. It would allow you to predefine the number clusters to create. Another trick would be to use cuML to speed up training, but I'm not familiar with your setup or whether you have a GPU to work with. If you have, that would significantly decrease the training time and allow you to quickly iterate. Then, you could indeed start testing different values for |
Beta Was this translation helpful? Give feedback.
What you are looking for is quite tricky considering the nature of HDBSCAN. When you run it using a small subset of documents, you are unlikely to generalize those findings to a larger amount of data. This is partly because your data will differ, partly because UMAP will learn different representations, and partly because HDBSCAN just "finds" any given number of clusters as long as it adheres to its parameters.
So it would indeed be difficult to know the
min_topic_size
/min_cluster_size
beforehand to get a specific (or roughly) amount of clusters.Instead, I would actually advise using a different clustering algorithm altogether. In my experience, HDBSCAN really excels at finding many clu…