-
Hi all, I have trained a BERTopic model on around 3.5 million short texts (no more than 200 words for the longest one). The model trained successfully but the 16 obtained topics seem to be very similar to each other, containing also very similar representative words. The embedding points seem to be very close as well in the 2-dimensional space. Has anyone had a similar issue and can come up with any solutions or suggestions? Thank you very much in advance! Here are the UMAP and HDBSCAN parameters: umap_model = UMAP(n_neighbors=100, # I tested as well with 10 n_neighbors, getting similar results
n_components=2,
min_dist=0.0,
metric='cosine',
low_memory=True,
random_state=cfg.SEED)
hdbscan_model = HDBSCAN(
min_cluster_size=8000,
min_samples=50,
cluster_selection_epsilon=0.01,
metric="euclidean",
cluster_selection_method="eom",
prediction_data=True
)
topic_model = BERTopic(
embedding_model=cfg.EMBEDDING_MODEL,
umap_model=umap_model,
hdbscan_model= hdbscan_model,
language=language,
vectorizer_model=vectorizer_model,
calculate_probabilities=True,
verbose=True,
low_memory=True
) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
The number of components for your UMAP model is rather low. I would definitely increase that to at least 5. You are losing a lot of information when reducing it to (almost) the bare minimum. With respect to HDBSCAN, the Lastly, the |
Beta Was this translation helpful? Give feedback.
The number of components for your UMAP model is rather low. I would definitely increase that to at least 5. You are losing a lot of information when reducing it to (almost) the bare minimum.
With respect to HDBSCAN, the
min_cluster_size
is quite high and HDBSCAN then tends to create rather abstract and broad clusters. I would advise lowering that value and potentially merging topics later on if needed. Doing the latter would also show what it means to get a few topics.Lastly, the
cfg.EMBEDDING_MODEL
might also be related but depends on its contents.