-
Notifications
You must be signed in to change notification settings - Fork 859
Description
Have you searched existing issues? 🔎
- I have searched and found no existing issues
Desribe the bug
When using zero-shot topic modeling with the nr_topics
parameter set, and all (or most) documents are assigned to zero-shot topics, the topic_sizes_
attribute remains empty and get_topic_info()
returns an empty DataFrame.
Root cause:
The issue occurs in the fit_transform()
method when all documents are assigned to zero-shot topics. The code has this conditional:
if len(documents) > 0:
# Cluster reduced embeddings
documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
if self._is_zeroshot() and len(assigned_documents) > 0:
documents, embeddings = self._combine_zeroshot_topics(
documents, embeddings, assigned_documents, assigned_embeddings
)
else:
# All documents matches zero-shot topics
documents = assigned_documents
embeddings = assigned_embeddings
When all documents are assigned to zero-shot topics, len(documents) == 0
, so the else
branch is taken. However, this branch never calls _update_topic_size()
, which is normally called within _cluster_embeddings()
or _combine_zeroshot_topics()
.
Additionally, when nr_topics
is specified, the fallback call to _sort_mappings_by_frequency()
(which would call _update_topic_size()
) is skipped because if not self.nr_topics:
evaluates to False
.
Expected behavior:
topic_sizes_
should contain the count of documents per topicget_topic_info()
should return a populated DataFrame with topic information
Actual behavior:
topic_sizes_
is an empty dictionary{}
get_topic_info()
returns an empty DataFrametopic_representations_
works correctly (populated as expected)
Note: The issue does NOT occur when nr_topics
is not specified, because _sort_mappings_by_frequency()
gets called, which internally calls _update_topic_size()
.
Reproduction
from bertopic import BERTopic
# Sample documents and zero-shot topics
docs = ["I need help with my voucher", "Gift card not working", "Customer service was poor"] * 50
zeroshot_topics = ["Voucher inquiries", "Gift card issues", "Customer service feedback"]
# BUG: Setting nr_topics causes the issue
model = BERTopic(
zeroshot_topic_list=zeroshot_topics,
zeroshot_min_similarity=-1, # Force all documents to zero-shot assignment
nr_topics=4, # This triggers the bug
)
topics, _ = model.fit_transform(docs)
# Demonstrates the bug
print("topic_sizes_:", model.topic_sizes_) # Expected: Counter with topic counts, Actual: None
print("topic_representations_ works:", bool(model.topic_representations_)) # This works: True
BERTopic Version
0.17.0