Skip to content
Discussion options

You must be logged in to vote

When using approximate_distribution(), should the similarity calculation method match the approach used in initial clustering for consistency?

No, it is a different technique and appropriately named "approximate..." to indicate that it is not 1:1 with the initial clustering.

Initial BERTopic clustering uses embeddings → clustering → c-TF-IDF topic representations

Note that it is actually embeddings -> dimensionality reduction -> clustering -> Bag-of-Words -> c-TF-IDF

If documents were originally clustered based on semantic similarity in embedding space, could using c-TF-IDF without embeddings for distribution calculation lead to inconsistent results?

Yes and no. You are, to a certai…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@Eric-Fithian
Comment options

@MaartenGr
Comment options

Answer selected by Eric-Fithian
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants