Skip to content
Discussion options

You must be logged in to vote

What you are looking for is quite tricky considering the nature of HDBSCAN. When you run it using a small subset of documents, you are unlikely to generalize those findings to a larger amount of data. This is partly because your data will differ, partly because UMAP will learn different representations, and partly because HDBSCAN just "finds" any given number of clusters as long as it adheres to its parameters.

So it would indeed be difficult to know the min_topic_size/ min_cluster_size beforehand to get a specific (or roughly) amount of clusters.

Instead, I would actually advise using a different clustering algorithm altogether. In my experience, HDBSCAN really excels at finding many clu…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@carobs9
Comment options

@MaartenGr
Comment options

Answer selected by carobs9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants