Replies: 1 comment 2 replies
-
If you have a pre-clustered set of documents, I would personally advise using Manual BERTopic instead. You can pass your clusters there and use any of the existing representation models that are included within BERTopic. That means you could also use several representation models at the same time. Note that KeyBERT is different from how it was implemented within BERTopic considering the implementation in BERTopic works a small, but representative subset of the documents in the topics rather than all of them. More specifically, running KeyBERT on all your clustered documents would create a bit of a mess considering they are either too long (concatenating all documents) or you would get hundreds of keywords (by running KeyBERT on each document). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
hello,
I am working on a text clustering and topic extraction project. I use a Single-Pass algorithm for incremental clustering of streaming text, which successfully groups documents into clusters. My current challenge is to extract high-quality topic words (Top N Words) for each of these pre-existing clusters.
For this topic word extraction step, I am evaluating two primary methods:
Using KeyBERT directly on all documents within a cluster.
Using BERTopic's representation models (like KeyBERTInspired) by fitting BERTopic on a single cluster at a time (forcing n_clusters=1).
I am seeking guidance on which method is more suitable in terms of effectiveness and efficiency, or if there is a better integrated approach within BERTopic.
Current Approach and Code Snippet
My current implementation uses the second method. Since I already have the documents clustered, I attempt to use BERTopic to generate topic words for a single cluster by setting the clustering model to KMeans(n_clusters=1).
Here is my code:
I would greatly appreciate your insights on the following:
For extracting topic words from a pre-clustered set of documents, which method generally yields better results in terms of accuracy and interpretability: using KeyBERT directly or using BERTopic (with n_clusters=1) and its representation_model? What are the theoretical or practical reasons for the difference?
If there is anything more that i can provide please let me know, and thanks!
Beta Was this translation helpful? Give feedback.
All reactions