Consistency between initial clustering and approximate_distribution() similarity calculations #2404

Eric-Fithian · 2025-08-04T19:09:47Z

Eric-Fithian
Aug 4, 2025

When using approximate_distribution(), should the similarity calculation method match the approach used in initial clustering for consistency?

Specifically:

Initial BERTopic clustering uses embeddings → clustering → c-TF-IDF topic representations
approximate_distribution() can use either c-TF-IDF (through Bag-o-Words?) or embeddings (use_embedding_model parameter)

If documents were originally clustered based on semantic similarity in embedding space, could using c-TF-IDF without embeddings for distribution calculation lead to inconsistent results? Or does the method automatically handle this appropriately?

Answered by MaartenGr

Aug 5, 2025

When using approximate_distribution(), should the similarity calculation method match the approach used in initial clustering for consistency?

No, it is a different technique and appropriately named "approximate..." to indicate that it is not 1:1 with the initial clustering.

Initial BERTopic clustering uses embeddings → clustering → c-TF-IDF topic representations

Note that it is actually embeddings -> dimensionality reduction -> clustering -> Bag-of-Words -> c-TF-IDF

If documents were originally clustered based on semantic similarity in embedding space, could using c-TF-IDF without embeddings for distribution calculation lead to inconsistent results?

Yes and no. You are, to a certai…

View full answer

MaartenGr · 2025-08-05T09:41:20Z

MaartenGr
Aug 5, 2025
Maintainer

When using approximate_distribution(), should the similarity calculation method match the approach used in initial clustering for consistency?

No, it is a different technique and appropriately named "approximate..." to indicate that it is not 1:1 with the initial clustering.

Initial BERTopic clustering uses embeddings → clustering → c-TF-IDF topic representations

Note that it is actually embeddings -> dimensionality reduction -> clustering -> Bag-of-Words -> c-TF-IDF

If documents were originally clustered based on semantic similarity in embedding space, could using c-TF-IDF without embeddings for distribution calculation lead to inconsistent results?

Yes and no. You are, to a certain extent, comparing apples to oranges. Specifically, the initial clustering of BERTopic is done on a document-level whereas the approximate_distribution then assigns the topics as a distribution within the documents. The two tasks are related but not exactly the same.

You indeed would want to expect that the most common topic in approximate_distribution matches that of the initial clustering but that will not always be the case considering this method is not primarily guided by the initial clusters.

2 replies

Eric-Fithian Aug 5, 2025
Author

Thank you for the clarification. That makes sense. Does this mean that when the use_embedding_model flag is set that it is a 1:1 comparison for each token set to each topic cluster?

MaartenGr Aug 17, 2025
Maintainer

It uses cosine simility to check between the embeddings of the token set to each topic embedding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consistency between initial clustering and approximate_distribution() similarity calculations #2404

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Consistency between initial clustering and approximate_distribution() similarity calculations #2404

Uh oh!

Eric-Fithian Aug 4, 2025

Replies: 1 comment · 2 replies

Uh oh!

MaartenGr Aug 5, 2025 Maintainer

Uh oh!

Eric-Fithian Aug 5, 2025 Author

Uh oh!

MaartenGr Aug 17, 2025 Maintainer

Eric-Fithian
Aug 4, 2025

Replies: 1 comment 2 replies

MaartenGr
Aug 5, 2025
Maintainer

Eric-Fithian Aug 5, 2025
Author

MaartenGr Aug 17, 2025
Maintainer