Skip to content

Zero-shot topic modeling with nr_topics parameter results in empty topic_sizes_ and get_topic_info() #2384

@GiannisKav

Description

@GiannisKav

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

When using zero-shot topic modeling with the nr_topics parameter set, and all (or most) documents are assigned to zero-shot topics, the topic_sizes_ attribute remains empty and get_topic_info() returns an empty DataFrame.

Root cause:
The issue occurs in the fit_transform() method when all documents are assigned to zero-shot topics. The code has this conditional:

if len(documents) > 0:
    # Cluster reduced embeddings
    documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
    if self._is_zeroshot() and len(assigned_documents) > 0:
        documents, embeddings = self._combine_zeroshot_topics(
            documents, embeddings, assigned_documents, assigned_embeddings
        )
else:
    # All documents matches zero-shot topics
    documents = assigned_documents
    embeddings = assigned_embeddings

When all documents are assigned to zero-shot topics, len(documents) == 0, so the else branch is taken. However, this branch never calls _update_topic_size(), which is normally called within _cluster_embeddings() or _combine_zeroshot_topics().

Additionally, when nr_topics is specified, the fallback call to _sort_mappings_by_frequency() (which would call _update_topic_size()) is skipped because if not self.nr_topics: evaluates to False.

Expected behavior:

  • topic_sizes_ should contain the count of documents per topic
  • get_topic_info() should return a populated DataFrame with topic information

Actual behavior:

  • topic_sizes_ is an empty dictionary {}
  • get_topic_info() returns an empty DataFrame
  • topic_representations_ works correctly (populated as expected)

Note: The issue does NOT occur when nr_topics is not specified, because _sort_mappings_by_frequency() gets called, which internally calls _update_topic_size().

Reproduction

from bertopic import BERTopic

# Sample documents and zero-shot topics
docs = ["I need help with my voucher", "Gift card not working", "Customer service was poor"] * 50
zeroshot_topics = ["Voucher inquiries", "Gift card issues", "Customer service feedback"]

# BUG: Setting nr_topics causes the issue
model = BERTopic(
    zeroshot_topic_list=zeroshot_topics,
    zeroshot_min_similarity=-1,  # Force all documents to zero-shot assignment
    nr_topics=4,  # This triggers the bug
)

topics, _ = model.fit_transform(docs)

# Demonstrates the bug
print("topic_sizes_:", model.topic_sizes_)  # Expected: Counter with topic counts, Actual: None
print("topic_representations_ works:", bool(model.topic_representations_))  # This works: True

BERTopic Version

0.17.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions