Best Practice for Extracting Topic Words from Pre-Clustered Documents (KeyBERT vs. BERTopic's Representation Models)? #2420

liu673 · 2025-09-09T08:00:37Z

liu673
Sep 9, 2025

hello,

I am working on a text clustering and topic extraction project. I use a Single-Pass algorithm for incremental clustering of streaming text, which successfully groups documents into clusters. My current challenge is to extract high-quality topic words (Top N Words) for each of these pre-existing clusters.

For this topic word extraction step, I am evaluating two primary methods:

Using KeyBERT directly on all documents within a cluster.

Using BERTopic's representation models (like KeyBERTInspired) by fitting BERTopic on a single cluster at a time (forcing n_clusters=1).

I am seeking guidance on which method is more suitable in terms of effectiveness and efficiency, or if there is a better integrated approach within BERTopic.

Current Approach and Code Snippet
My current implementation uses the second method. Since I already have the documents clustered, I attempt to use BERTopic to generate topic words for a single cluster by setting the clustering model to KMeans(n_clusters=1).
Here is my code:

from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI
from sklearn.cluster import KMeans

# Representation models to diversify topic words
representation_model = {
    "Main": OpenAI(),  # Example; may vary based on environment
    "KeyBERT": KeyBERTInspired(top_n_words=10, nr_repr_docs=100),
    "MMR": MaximalMarginalRelevance(diversity=0.3),
}
# Force BERTopic to create only 1 cluster for the pre-grouped documents
cluster_model = KMeans(n_clusters=1, random_state=42)

topic_model = BERTopic(
    language="chinese",  # Working with Chinese text
    embedding_model=embedding_model, # e.g., a local sentence-transformer model
    umap_model=umap_model,
    ctfidf_model=ctfidf_model,
    hdbscan_model=cluster_model, # Using KMeans forced to 1 cluster
    representation_model=representation_model,
    top_n_words=10,
    calculate_probabilities=False,
)

# Assuming `cluster_docs` is a list of documents for one specific cluster
topics, probs = topic_model.fit_transform(cluster_docs)
topic_words = topic_model.get_topic_info()  # Fetch the extracted topic words

I would greatly appreciate your insights on the following:
For extracting topic words from a pre-clustered set of documents, which method generally yields better results in terms of accuracy and interpretability: using KeyBERT directly or using BERTopic (with n_clusters=1) and its representation_model? What are the theoretical or practical reasons for the difference?

If there is anything more that i can provide please let me know, and thanks!

MaartenGr · 2025-09-11T12:21:45Z

MaartenGr
Sep 11, 2025
Maintainer

If you have a pre-clustered set of documents, I would personally advise using Manual BERTopic instead. You can pass your clusters there and use any of the existing representation models that are included within BERTopic. That means you could also use several representation models at the same time.

Note that KeyBERT is different from how it was implemented within BERTopic considering the implementation in BERTopic works a small, but representative subset of the documents in the topics rather than all of them. More specifically, running KeyBERT on all your clustered documents would create a bit of a mess considering they are either too long (concatenating all documents) or you would get hundreds of keywords (by running KeyBERT on each document).

2 replies

liu673 Sep 12, 2025
Author

Thank you very much for your detailed response and for pointing me toward the Manual BERTopic approach. I really appreciate you taking the time to help me with this.

I carefully studied the Manual BERTopic solution you suggested. During my testing, I found that while skipping embedding and clustering and relying on c-TF-IDF is a clever approach, the quality of the resulting topic words wasn't as strong for my specific use case. I also encountered a limitation where I couldn't integrate the representation_model when using the BaseEmbedder.

Based on your suggestion, I proceeded to test and compare two alternative implementations for processing a single cluster of 100 documents:

Implementation 1: Uses a sentence_transformer embedding model on CUDA, with BaseDimensionalityReduction and BaseCluster().

from bertopic.cluster import BaseCluster
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.dimensionality import BaseDimensionalityReduction

representation_model = {
    "Main": openai_representation,
    "KeyBERT": KeyBERTInspired(top_n_words=10, nr_repr_docs=100, ),
    "MMR": MaximalMarginalRelevance(diversity=0.3),
}

topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=BaseDimensionalityReduction(),
    hdbscan_model=BaseCluster(),
    ctfidf_model=ClassTfidfTransformer(),
    top_n_words=10,
    representation_model=representation_model,
    calculate_probabilities=False,
)
y = [0] * len(documents)

topics, probs = topic_model.fit_transform(documents=documents, y=y)

Implementation 2: Uses the same embedding model but employs PCA for dimensionality reduction and KMeans for clustering.

from sklearn.decomposition import PCA

representation_model = {
    "Main": openai_representation,
    "KeyBERT": KeyBERTInspired(top_n_words=10, nr_repr_docs=100, ),
    "MMR": MaximalMarginalRelevance(diversity=0.3),
}

topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=PCA(n_components=5),
    hdbscan_model=KMeans(n_clusters=1, random_state=42),
    ctfidf_model=ctfidf_model,
    representation_model=representation_model,
    top_n_words=10,
    calculate_probabilities=False,
)

topics, probs = topic_model.fit_transform(documents=documents)

Interestingly, the execution times on an A10 GPU were very similar:

Implementation 1: 10.89 seconds
Implementation 2: 11.53 seconds

I also attempted a test using cuML for dimensionality reduction, which took 12.23 seconds. However, I encountered a significant issue during batch processing with cuML: the entire program would hang indefinitely. My current hypothesis is that this is caused by a race condition or a CUDA context conflict between the cuML operations and the embedding model's GPU usage, potentially leading to an out-of-memory (OOM) error or a deadlock.

Thank you again for your guidance. I'm looking forward to any further insights you might have, particularly on the cuML stability issue or any other recommendations for optimizing this workflow.

MaartenGr Sep 25, 2025
Maintainer

Thank you for sharing this! These are interesting results. The similar execution times are difficult to uncover considering they may just "accidentally" have the same execution time. In 1 you performed c-TF-IDF which is always being used whilst in 2 your additionally added KeyBERT and MMR, which are bound to add to the execution time. So it is a bit apples vs. oranges here if you only compare the supervised with the non-supervised methods.

With respect to the cuML stability issue, I unfortunately have no guidance. I haven't had that experience before, other than OOM errors on the c-TF-IDF side of things.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best Practice for Extracting Topic Words from Pre-Clustered Documents (KeyBERT vs. BERTopic's Representation Models)? #2420

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best Practice for Extracting Topic Words from Pre-Clustered Documents (KeyBERT vs. BERTopic's Representation Models)? #2420

Uh oh!

liu673 Sep 9, 2025

Replies: 1 comment · 2 replies

Uh oh!

MaartenGr Sep 11, 2025 Maintainer

Uh oh!

liu673 Sep 12, 2025 Author

Uh oh!

MaartenGr Sep 25, 2025 Maintainer

liu673
Sep 9, 2025

Replies: 1 comment 2 replies

MaartenGr
Sep 11, 2025
Maintainer

liu673 Sep 12, 2025
Author

MaartenGr Sep 25, 2025
Maintainer