Replies: 2 comments 3 replies
-
Can you share the full code for initializing BERTopic? That might already help a bit. With the information you provided it might relate to the data itself and how your embedding model embeds the documents. Have you checked which documents are in topic 0? Inspecting the topics themselves through the documents is likely to reveal why these documents are pushed towards that cluster. |
Beta Was this translation helpful? Give feedback.
-
Hi @MaartenGr thank you so much for maintaining the documentation and keeping the discussions live. I found so many answers and helpful tips in the discussions! I would like to follow up on a similar issue. My data consists of YouTube video transcripts (at testing stage I have around 2000 transcripts, which ends up with over 24 000 shorter chunks), so the data is naturally very noisy. I end up with a huge number of outliers (around 40-50% of the data), regardless of whether I chunk the long transcripts into 500-word documents. Unfortunately, the outlier reduction methods do not seem to help much; whilst the number of outliers is reduced, they all move to topic 0. This is a bit different for the embeddings outlier reduction strategy, but still most of the outliers are moved into topic 0. I have been thinking about whether I could work with the HDBSCAN parameters to reduce the number of outliers. I have been testing KMeans too and in my case the results look promising, however I am at loss with the current approach. Here is my BERT initialisation code, including the first HDBSCAN setting that did not yield improvement in the results: `# Set up BERTopic topic_model = BERTopic( Fit model and get topicstopics, probs = topic_model.fit_transform(all_chunks)` |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I'm using BERTopic on a large dataset (~3.8 million documents), and I'm encountering an issue where a very large number of documents are being assigned to topic -1 (outliers) or topic 0. Particularly, around 2.9 million go into topic -1, and around 4.7K into topic 0. These two topics seem to contain very common words in the corpus. Topic 1 drops to 3.2K documents, and contains more specific keywords.
I've tried analyzing this behavior by inspecting and plotting the document-topic probability matrix, where each row is a document and each column is a topic. I mainly inspected the distribution of maximum topic probabilities per document. I attach the resulting graph below, plus a distribution of the max probabilities of topics -1, 0, 1, 2, 3, 10, 20 and 1000.
However, I'm having trouble understanding why so many documents are getting pushed to -1 or 0, and what the central cluster around topic 0 really represents. It seems like the Topic 0 should be divided into subtopics, since it contains general information.
What I’ve Tried:
The latter method seems to improve the results slightly. Topic -1 disappears, and topic 0 contains around 4.8K documents, followed by 29K in topic 1. Yet, the results are not satisfactory, since this topic 0 cluster still contains a lot of docs and is hiding information. I suspect it may be due to:
A large, central cluster in the embedding space. The embeddings seem to form a big, central cluster, and I suspect that HDBSCAN cannot find clusters due to these low quality embeddings. I wanted to inspect the probabilities to understand the root of the problem.
Weak separation between topics
My Setup:
~3.8 million documents
~4,300 topics
How can I reduce the overwhelming assignment to topic 0 and -1? Is the problem arising during the UMAP step, or later, during HDBSCAN? Is there a best practice to interpret the document-topic matrix in this kind of situation?
Any help or guidance would be much appreciated!
I also attach the visualize_topics() output, which does not contain the outliers:

Beta Was this translation helpful? Give feedback.
All reactions