What's the best approach to conduct a stability and reproducibility test for BERTopic? #2376

NokeYuan · 2025-06-09T14:01:07Z

NokeYuan
Jun 9, 2025

Hi Maarten,

I plan to use BERTopic for my research paper and would like to hear your thoughts on its reproducibility. Currently, I’m using UMAP + KMeans, since it allows me to control the number of topics. One of the sections in my paper discusses reproducibility, and I came across a [paper] that defines it as: “The term ‘stability’ refers to a topic model’s ability to produce solutions that are partially or completely identical across multiple runs with different random initializations.”

To investigate this, I fixed all parameters and only changed the random seed for UMAP. I then generated the topic keywords and computed topic similarity scores (in this case: Jaccard, Dice, and Spearman, as the paper suggested). However, I found that all the scores were relatively low (below 0.5). Some generated topics did not match in the same order — for example, what I originally assumed would be topic_1 from model A would differ from topic_1 in model B, showing variation in several key words.

I’d like to get your thoughts on the best approach for conducting a reproducibility study using BERTopic. For example, should I fix the UMAP random_state and instead bootstrap the data 100 times? I just want to make sure I’m not overcomplicating something that shouldn’t be.

from typing import List, Dict
import numpy as np
from scipy.stats import spearmanr
from scipy.optimize import linear_sum_assignment
from rbo import RankingSimilarity
import warnings


def evaluate_model_stability_max_match(
    model_a,
    model_b,
    top_n_words: int = 10,
    mode: str = "pairwise",  # Options: "pairwise", "recurrent", "flatten"
) -> Dict[str, float]:
    """
    Compare BERTopic models using appearance-based metrics with different matching strategies.
    Supports weighted averaging for pairwise mode using topic counts.
    """
    print(f"[INFO] Comparing models using mode='{mode}'...")

    def jaccard(set1, set2):
        return len(set1 & set2) / len(set1 | set2) if set1 | set2 else 0

    def dice(set1, set2):
        return 2 * len(set1 & set2) / (len(set1) + len(set2)) if set1 or set2 else 0

    def spearman(list1, list2):
        word_to_rank = {word: i for i, word in enumerate(list1)}
        ranks = [word_to_rank.get(word, len(list1)) for word in list2]
        ref = list(range(len(ranks)))

        if len(set(ranks)) <= 1:
            return 0.0, False

        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            corr, _ = spearmanr(ref, ranks)

        if np.isnan(corr):
            return 0.0, False

        return corr, True

    # Extract top-N keywords
    topics_a = [
        [word for word, _ in model_a.get_topic(i)[:top_n_words]]
        for i in range(len(model_a.get_topics()) - 1)
    ]
    topics_b = [
        [word for word, _ in model_b.get_topic(i)[:top_n_words]]
        for i in range(len(model_b.get_topics()) - 1)
    ]

    if mode == "flatten":
        flat_a = set(word for topic in topics_a for word in topic)
        flat_b = set(word for topic in topics_b for word in topic)
        return {
            "jaccard_all_words": jaccard(flat_a, flat_b),
            "dice_all_words": dice(flat_a, flat_b),
            "num_topics": len(topics_a),
            "matching": "flattened vocabulary comparison"
        }

    elif mode == "recurrent":
        jaccard_scores, dice_scores, spearman_scores, rbo_scores = [], [], [], []

        for i, ta in enumerate(topics_a):
            set_a = set(ta)
            scores = []
            for j, tb in enumerate(topics_b):
                sp_score, _ = spearman(ta, tb)
                scores.append((
                    jaccard(set_a, set(tb)),
                    dice(set_a, set(tb)),
                    sp_score,
                    RankingSimilarity(ta, tb).rbo(p=0.9),
                    j  # Save B index
                ))

            best = max(scores, key=lambda x: x[0])
            best_jaccard, best_dice, best_spearman, best_rbo, best_j = best

            print(f"[MATCHED RECURRENT] A#{i} ↔ B#{best_j} | Jaccard = {best_jaccard:.3f} | Spearman = {best_spearman:.3f}")

            jaccard_scores.append(best_jaccard)
            dice_scores.append(best_dice)
            spearman_scores.append(best_spearman)
            rbo_scores.append(best_rbo)

        return {
            "avg_jaccard": np.mean(jaccard_scores),
            "avg_dice": np.mean(dice_scores),
            "avg_spearman": np.mean(spearman_scores),
            "avg_rbo": np.mean(rbo_scores),
            "num_topics": len(topics_a),
            "matching": "recurrent (1-to-many best match)"
        }

    elif mode == "pairwise":
        num_topics = len(topics_a)
        jaccard_matrix = np.zeros((num_topics, num_topics))
        dice_matrix = np.zeros((num_topics, num_topics))
        spearman_matrix = np.zeros((num_topics, num_topics))
        rbo_matrix = np.zeros((num_topics, num_topics))

        for i, ta in enumerate(topics_a):
            for j, tb in enumerate(topics_b):
                jaccard_matrix[i, j] = jaccard(set(ta), set(tb))
                dice_matrix[i, j] = dice(set(ta), set(tb))
                sp_score, _ = spearman(ta, tb)
                spearman_matrix[i, j] = sp_score
                rbo_matrix[i, j] = RankingSimilarity(ta, tb).rbo(p=0.9)

        row_ind, col_ind = linear_sum_assignment(-jaccard_matrix)

        topic_info = model_a.get_topic_info()
        topic_count_map = dict(zip(topic_info.Topic, topic_info.Count))
        topic_weights = np.array([topic_count_map.get(i, 1) for i in row_ind])

        for i, j in zip(row_ind, col_ind):
            print(f"[MATCHED JACCARD] A#{i} ↔ B#{j} | Jaccard = {jaccard_matrix[i,j]:.3f} | Spearman = {spearman_matrix[i,j]:.3f}")

        return {
            "avg_jaccard": np.average([jaccard_matrix[i, j] for i, j in zip(row_ind, col_ind)], weights=topic_weights),
            "avg_dice": np.average([dice_matrix[i, j] for i, j in zip(row_ind, col_ind)], weights=topic_weights),
            "avg_spearman": np.average([spearman_matrix[i, j] for i, j in zip(row_ind, col_ind)], weights=topic_weights),
            "avg_rbo": np.average([rbo_matrix[i, j] for i, j in zip(row_ind, col_ind)], weights=topic_weights),
            "num_topics": num_topics,
            "matching": "pairwise (Hungarian 1-to-1, weighted)"
        }

    else:
        raise ValueError("Mode must be one of: 'pairwise', 'recurrent', or 'flatten'")

Thank you so much for helping.

MaartenGr · 2025-06-18T13:22:02Z

MaartenGr
Jun 18, 2025
Maintainer

An approach that I often see to get this "stability", is to indeed run UMAP several times with varying degrees of random states and/or other parameters. Then, the resulting reduced embeddings are typically averaged to get a more "stable" representation. Those averaged embeddings are then passed to BERTopic where you should then skip the dimensionality reduction step.

2 replies

NokeYuan Jun 18, 2025
Author

Hi Maarten, thanks for the reply. I think it makes sense in theory, and I will try it. For this approach, after I obtain the model with the averaged UMAP embeddings, do you think it is still necessary to conduct the reproducibility tests (e.g., using the Jaccard scores, Dice, and Spearman, etc.), or should I just carry on with the analysis? I can see this approach ensures "stability," but how can this be reproduced since it does not have a single random seed? (I guess it can be reproduced by looping over the same random seeds?)

MaartenGr Jun 18, 2025
Maintainer

Reproducibility would indeed be looping over the same random seeds here, I think. I'm not that familiar with those reproducibility tests but what you could do is test how many seeds you need before it "converges".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What's the best approach to conduct a stability and reproducibility test for BERTopic? #2376

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What's the best approach to conduct a stability and reproducibility test for BERTopic? #2376

Uh oh!

NokeYuan Jun 9, 2025

Replies: 1 comment · 2 replies

Uh oh!

MaartenGr Jun 18, 2025 Maintainer

Uh oh!

NokeYuan Jun 18, 2025 Author

Uh oh!

MaartenGr Jun 18, 2025 Maintainer

NokeYuan
Jun 9, 2025

Replies: 1 comment 2 replies

MaartenGr
Jun 18, 2025
Maintainer

NokeYuan Jun 18, 2025
Author

MaartenGr Jun 18, 2025
Maintainer