|
| 1 | +# MLflow Integration (Optional) |
| 2 | + |
| 3 | +Training Hub supports [MLflow](https://mlflow.org/) for experiment tracking. When MLflow is enabled on your RHOAI cluster, training metrics (loss, learning rate, etc.) are automatically logged to MLflow experiments — no additional code changes required beyond setting the experiment name. |
| 4 | + |
| 5 | +> [!NOTE] |
| 6 | +> MLflow integration is available for **interactive (single node)** notebooks only. Distributed training jobs do not currently support MLflow tracking. |
| 7 | +
|
| 8 | +## Enabling MLflow |
| 9 | + |
| 10 | +Each interactive notebook already includes a cell that sets the MLflow experiment name: |
| 11 | + |
| 12 | +```python |
| 13 | +os.environ["MLFLOW_EXPERIMENT_NAME"] = "<your-experiment-name>" |
| 14 | +``` |
| 15 | + |
| 16 | +For this to work, MLflow must be enabled as a component in your RHOAI installation. If MLflow is not enabled, the environment variable is simply ignored and training proceeds normally. |
| 17 | + |
| 18 | +**To enable MLflow on your cluster:** |
| 19 | + |
| 20 | +1. Enable the MLflow Operator component in your `DataScienceCluster` CR: |
| 21 | + |
| 22 | + ```bash |
| 23 | + oc patch datasciencecluster default-dsc \ |
| 24 | + --type=merge \ |
| 25 | + -p '{"spec":{"components":{"mlflowoperator":{"managementState":"Managed"}}}}' |
| 26 | + ``` |
| 27 | + |
| 28 | +2. Create an `MLflow` CR to deploy the tracking server (example using SQLite and a PV for storage): |
| 29 | + |
| 30 | + ```bash |
| 31 | + oc apply -f - <<EOF |
| 32 | + apiVersion: mlflow.opendatahub.io/v1 |
| 33 | + kind: MLflow |
| 34 | + metadata: |
| 35 | + name: mlflow |
| 36 | + spec: |
| 37 | + backendStoreUri: "sqlite:////mlflow/mlflow.db" |
| 38 | + defaultArtifactRoot: "file:///mlflow/artifacts" |
| 39 | + serveArtifacts: true |
| 40 | + storage: |
| 41 | + accessModes: |
| 42 | + - ReadWriteOnce |
| 43 | + resources: |
| 44 | + requests: |
| 45 | + storage: 10Gi |
| 46 | + EOF |
| 47 | + ``` |
| 48 | +
|
| 49 | +For full details, see the [Configuring MLflow in OpenShift AI](https://access.redhat.com/articles/7136121) Knowledgebase article (requires Red Hat Customer Portal login). |
| 50 | +
|
| 51 | +## Viewing MLflow Experiments |
| 52 | +
|
| 53 | +Once training completes with MLflow enabled, you can browse your experiment runs: |
| 54 | +
|
| 55 | +1. In the OpenShift AI dashboard, navigate to **Develop & train → Experiments** from the left sidebar menu. |
| 56 | +2. Select the experiment name to view all runs. |
| 57 | +3. Each run contains logged metrics (training loss, learning rate), parameters, and artifacts. |
| 58 | +
|
| 59 | +You can also launch the full MLflow UI by clicking the **"Launch MLflow"** link in the top right of the Experiments page: |
| 60 | +
|
| 61 | + |
| 62 | +
|
| 63 | +Each run logs metrics including training loss, learning rate, samples per second, and more: |
| 64 | +
|
| 65 | + |
0 commit comments