Skip to content

MLflow Autologging Integration for BERTopic #2429

@NJAHNAVI2907

Description

@NJAHNAVI2907

Feature request

I am currently implementing an autologging integration for BERTopic with MLflow for managing ML experiments. This would automatically log BERTopic's training parameters (e.g., embedding model, UMAP/HDBSCAN settings), metrics, artifacts , and the fitted model via MLflow's PyFunc flavor during fit_transform calls. The goal is to simplify experiment tracking in BERTopic workflows without manual logging.

Motivation

Approach: Monkey-patching BERTopic.fit_transform using MLflow's safe_patch for safe integration.
What Gets Logged:
Parameters: Embedding model name, UMAP (n_neighbors, n_components, ...), HDBSCAN (min_cluster_size...), vectorizer type
Metrics: n_documents, avg_doc_length, n_topics, n_outliers, avg/max/min_topic_size, vocab_size, embedding_dim, diversity, coherence (c_v, c_npmi, u_mass via gensim), per-topic coherence.
Artifacts: topic_info.csv, metrics.json, per_topic_coherence.csv, embeddings.npy and the full model as PyFunc.
Flavor Support: Registered as an MLflow flavor (bertopic) with @autologging_integrationenabling mlflow.autolog()

Mlflow issue : mlflow/mlflow#16792 (comment)

Your contribution

Seeking Feedback:

Is this something you would be interested in merging into BERTopic's core as an optional MLflow submodule or would it be better as an external package like mlflow-scikit-learn or mlflow-txtai.

Should I pursue adding this as a separate repo with a lazy import in MLflow's init.py via PR or integrate it directly into BERTopic? Pros/cons from your perspective?

I would like to hear your thoughts on this. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions