-
Notifications
You must be signed in to change notification settings - Fork 859
Description
Feature request
I am currently implementing an autologging integration for BERTopic with MLflow for managing ML experiments. This would automatically log BERTopic's training parameters (e.g., embedding model, UMAP/HDBSCAN settings), metrics, artifacts , and the fitted model via MLflow's PyFunc flavor during fit_transform
calls. The goal is to simplify experiment tracking in BERTopic workflows without manual logging.
Motivation
Approach: Monkey-patching BERTopic.fit_transform
using MLflow's safe_patch
for safe integration.
What Gets Logged:
Parameters: Embedding model name, UMAP (n_neighbors, n_components, ...), HDBSCAN (min_cluster_size...), vectorizer type
Metrics: n_documents, avg_doc_length, n_topics, n_outliers, avg/max/min_topic_size, vocab_size, embedding_dim, diversity, coherence (c_v, c_npmi, u_mass via gensim), per-topic coherence.
Artifacts: topic_info.csv, metrics.json, per_topic_coherence.csv, embeddings.npy and the full model as PyFunc.
Flavor Support: Registered as an MLflow flavor (bertopic
) with @autologging_integration
enabling mlflow.autolog()
Mlflow issue : mlflow/mlflow#16792 (comment)
Your contribution
Seeking Feedback:
Is this something you would be interested in merging into BERTopic's core as an optional MLflow submodule or would it be better as an external package like mlflow-scikit-learn or mlflow-txtai.
Should I pursue adding this as a separate repo with a lazy import in MLflow's init.py via PR or integrate it directly into BERTopic? Pros/cons from your perspective?
I would like to hear your thoughts on this. Thank you!