Skip to content

.merge_models() alters HDBSCAN clustering #2415

@vnguye65

Description

@vnguye65

Have you searched existing issues? 🔎

  • I have searched and found no existing issues

Desribe the bug

I have 2 bertopic model with HDBSCAN configured similarly trained on 2 different subsets of data. However, when these models are merged the resulting merged model defaults to BaseCluster and bypasses the clustering when calling .transform()

Image

Reproduction

umap_mode1l = UMAP(n_components=25, metric='cosine', random_state=42)
vectorizer_model1 = CountVectorizer(stop_words="english")

model1 = BERTopic(umap_model=umap_model1, 
                    vectorizer_model=vectorizer_model1,
                    calculate_probabilities=True,
                    verbose=True)
model1.fit(data1, embeddings=embeddings1)

umap_model2 = UMAP(n_components=25, metric='cosine', random_state=42)
vectorizer_model2 = CountVectorizer(stop_words="english")
model2 = BERTopic(umap_model=umap_model2, 
                    vectorizer_model=vectorizer_model2,
                    calculate_probabilities=True,
                    verbose=True)
model2.fit(data2, embeddings=embeddings2)

merged_model = BERTopic.merge_models([model1, model2], 
                                    min_similarity=0.7)

merged_model.hdbscan_model

BERTopic Version

0.17.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions