Replies: 1 comment 2 replies
-
An approach that I often see to get this "stability", is to indeed run UMAP several times with varying degrees of random states and/or other parameters. Then, the resulting reduced embeddings are typically averaged to get a more "stable" representation. Those averaged embeddings are then passed to BERTopic where you should then skip the dimensionality reduction step. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Maarten,
I plan to use BERTopic for my research paper and would like to hear your thoughts on its reproducibility. Currently, I’m using UMAP + KMeans, since it allows me to control the number of topics. One of the sections in my paper discusses reproducibility, and I came across a [paper] that defines it as: “The term ‘stability’ refers to a topic model’s ability to produce solutions that are partially or completely identical across multiple runs with different random initializations.”
To investigate this, I fixed all parameters and only changed the random seed for UMAP. I then generated the topic keywords and computed topic similarity scores (in this case: Jaccard, Dice, and Spearman, as the paper suggested). However, I found that all the scores were relatively low (below 0.5). Some generated topics did not match in the same order — for example, what I originally assumed would be topic_1 from model A would differ from topic_1 in model B, showing variation in several key words.
I’d like to get your thoughts on the best approach for conducting a reproducibility study using BERTopic. For example, should I fix the UMAP random_state and instead bootstrap the data 100 times? I just want to make sure I’m not overcomplicating something that shouldn’t be.
Thank you so much for helping.
Beta Was this translation helpful? Give feedback.
All reactions