Understanding Document-Term Matrix to Reduce Topic -1 and Topic 0 Assignments #2400

carobs9 · 2025-07-18T11:59:30Z

carobs9
Jul 18, 2025

Hello,

I'm using BERTopic on a large dataset (~3.8 million documents), and I'm encountering an issue where a very large number of documents are being assigned to topic -1 (outliers) or topic 0. Particularly, around 2.9 million go into topic -1, and around 4.7K into topic 0. These two topics seem to contain very common words in the corpus. Topic 1 drops to 3.2K documents, and contains more specific keywords.

I've tried analyzing this behavior by inspecting and plotting the document-topic probability matrix, where each row is a document and each column is a topic. I mainly inspected the distribution of maximum topic probabilities per document. I attach the resulting graph below, plus a distribution of the max probabilities of topics -1, 0, 1, 2, 3, 10, 20 and 1000.

However, I'm having trouble understanding why so many documents are getting pushed to -1 or 0, and what the central cluster around topic 0 really represents. It seems like the Topic 0 should be divided into subtopics, since it contains general information.

What I’ve Tried:

Using new_topics = np.argmax(prob) if max(prob) >= threshold else -1 logic.
Reducing outliers.

The latter method seems to improve the results slightly. Topic -1 disappears, and topic 0 contains around 4.8K documents, followed by 29K in topic 1. Yet, the results are not satisfactory, since this topic 0 cluster still contains a lot of docs and is hiding information. I suspect it may be due to:

A large, central cluster in the embedding space. The embeddings seem to form a big, central cluster, and I suspect that HDBSCAN cannot find clusters due to these low quality embeddings. I wanted to inspect the probabilities to understand the root of the problem.
Weak separation between topics

My Setup:

GPU supported UMAP and HDBSCAN
~3.8 million documents
~4,300 topics
min topic size of 50 (so low due to computational issues when increasing this value).
I set calculate_probabilities=True
Embeddings from all-MiniLM-L6-v2

How can I reduce the overwhelming assignment to topic 0 and -1? Is the problem arising during the UMAP step, or later, during HDBSCAN? Is there a best practice to interpret the document-topic matrix in this kind of situation?

Any help or guidance would be much appreciated!

I also attach the visualize_topics() output, which does not contain the outliers:

MaartenGr · 2025-07-27T07:17:08Z

MaartenGr
Jul 27, 2025
Maintainer

Can you share the full code for initializing BERTopic? That might already help a bit.

With the information you provided it might relate to the data itself and how your embedding model embeds the documents. Have you checked which documents are in topic 0? Inspecting the topics themselves through the documents is likely to reveal why these documents are pushed towards that cluster.

2 replies

carobs9 Aug 4, 2025
Author

Hi @MaartenGr. Thank you for your answer. Here are the model specifications:

embedding_model:"sentence-transformers/all-MiniLM-L6-v2"
clustering_model:"HDBSCAN()"
hdbscan_params.cluster_selection_epsilon:0.05
hdbscan_params.cluster_selection_method:"eom"
hdbscan_params.gen_min_span_tree:false
hdbscan_params.metric:"euclidean"
hdbscan_params.min_cluster_size:50
hdbscan_params.min_samples:50
hdbscan_params.prediction_data:true


UMAP_PARAMS.init:"spectral"
UMAP_PARAMS.low_memory:true
UMAP_PARAMS.metric:"cosine"
UMAP_PARAMS.min_dist:0.1
UMAP_PARAMS.n_components:15
UMAP_PARAMS.n_neighbors:50
UMAP_PARAMS.random_state:42
UMAP_PARAMS.repulsion_strength:0.5
UMAP_PARAMS.umap_version:"cuML"
vectorizer_model = CountVectorizer(
            stop_words="english",
            tokenizer=LemmaTokenizer(),
            max_df=0.6
        )

MaartenGr Aug 5, 2025
Maintainer

A couple of things that might relate to this:

Have you checked the documents in topic 0? Exploring those might already give you a reason why those are put together.
The sentence-transformers/all-MiniLM-L6-v2 is a good model but already quite old. I would advise using a new model instead. You can check the MTEB leaderboard for newer techniques.
The n_components might be a tad high perhaps reducing that to 5 or 10 might improve the results.
HDBSCAN simply has a tendency to generate these types of cluster distributions unfortunately. Perhaps using a different cluster model would improve the results.

adriannajezierska · 2025-07-30T07:24:00Z

adriannajezierska
Jul 30, 2025

Hi @MaartenGr thank you so much for maintaining the documentation and keeping the discussions live. I found so many answers and helpful tips in the discussions! I would like to follow up on a similar issue. My data consists of YouTube video transcripts (at testing stage I have around 2000 transcripts, which ends up with over 24 000 shorter chunks), so the data is naturally very noisy. I end up with a huge number of outliers (around 40-50% of the data), regardless of whether I chunk the long transcripts into 500-word documents. Unfortunately, the outlier reduction methods do not seem to help much; whilst the number of outliers is reduced, they all move to topic 0. This is a bit different for the embeddings outlier reduction strategy, but still most of the outliers are moved into topic 0. I have been thinking about whether I could work with the HDBSCAN parameters to reduce the number of outliers. I have been testing KMeans too and in my case the results look promising, however I am at loss with the current approach. Here is my BERT initialisation code, including the first HDBSCAN setting that did not yield improvement in the results:

`# Set up BERTopic
umap_model = UMAP(random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size= 50, metric='euclidean', cluster_selection_method='eom', prediction_data=True, min_samples=5)
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)
vectorizer_model = CountVectorizer(stop_words="english")
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

topic_model = BERTopic(
embedding_model=embedding_model,
min_topic_size=5,
verbose=True,
top_n_words=20,
n_gram_range=(1, 2),
calculate_probabilities=True,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model
)

Fit model and get topics

topics, probs = topic_model.fit_transform(all_chunks)`

1 reply

MaartenGr Aug 5, 2025
Maintainer

HDBSAN indeed has a tendency to generate many outliers. You can tweak them a bit by working on the min_cluster_size and min_samples parameters, but then you are essentially grid searching over those (which might result in overfitting). I would advise updating your embedding model, as newer ones tend to have much richer representations and indeed potentially switching out HDBSCAN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding Document-Term Matrix to Reduce Topic -1 and Topic 0 Assignments #2400

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Understanding Document-Term Matrix to Reduce Topic -1 and Topic 0 Assignments #2400

Uh oh!

Uh oh!

carobs9 Jul 18, 2025

Replies: 2 comments · 3 replies

Uh oh!

MaartenGr Jul 27, 2025 Maintainer

Uh oh!

carobs9 Aug 4, 2025 Author

Uh oh!

MaartenGr Aug 5, 2025 Maintainer

Uh oh!

Uh oh!

adriannajezierska Jul 30, 2025

Fit model and get topics

Uh oh!

MaartenGr Aug 5, 2025 Maintainer

carobs9
Jul 18, 2025

Replies: 2 comments 3 replies

MaartenGr
Jul 27, 2025
Maintainer

carobs9 Aug 4, 2025
Author

MaartenGr Aug 5, 2025
Maintainer

adriannajezierska
Jul 30, 2025

MaartenGr Aug 5, 2025
Maintainer