Consistency between initial clustering and approximate_distribution() similarity calculations #2404
-
When using Specifically:
If documents were originally clustered based on semantic similarity in embedding space, could using c-TF-IDF without embeddings for distribution calculation lead to inconsistent results? Or does the method automatically handle this appropriately? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
No, it is a different technique and appropriately named "approximate..." to indicate that it is not 1:1 with the initial clustering.
Note that it is actually embeddings -> dimensionality reduction -> clustering -> Bag-of-Words -> c-TF-IDF
Yes and no. You are, to a certain extent, comparing apples to oranges. Specifically, the initial clustering of BERTopic is done on a document-level whereas the approximate_distribution then assigns the topics as a distribution within the documents. The two tasks are related but not exactly the same. You indeed would want to expect that the most common topic in |
Beta Was this translation helpful? Give feedback.
No, it is a different technique and appropriately named "approximate..." to indicate that it is not 1:1 with the initial clustering.
Note that it is actually embeddings -> dimensionality reduction -> clustering -> Bag-of-Words -> c-TF-IDF
Yes and no. You are, to a certai…