Skip to content

Commit 63f5546

Browse files
docs: add MLflow integration documentation to fine-tuning examples
Add an MLflow Integration section to the READMEs for lora, osft, and sft fine-tuning examples. Documents how to enable MLflow on the cluster via the DataScienceCluster CR and MLflow CR, what the notebook cells do, and how to navigate to the Experiments page in the RHOAI dashboard. MLflow tracking is available for interactive (single-node) notebooks only — distributed training does not currently support it. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent fb4edf5 commit 63f5546

5 files changed

Lines changed: 198 additions & 0 deletions

File tree

40.5 KB
Loading
54 KB
Loading

examples/fine-tuning/lora/README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,3 +172,69 @@ to seamlessly run fine-tuning jobs.
172172
> You can skip the token if switching to non-gated models.
173173
174174
You can now proceed with the instructions from the notebook. Enjoy!
175+
176+
## MLflow Integration (Optional)
177+
178+
Training Hub supports [MLflow](https://mlflow.org/) for experiment tracking. When MLflow is enabled on your RHOAI cluster, training metrics (loss, learning rate, etc.) are automatically logged to MLflow experiments — no additional code changes required beyond setting the experiment name.
179+
180+
> [!NOTE]
181+
> MLflow integration is available for **interactive (single node)** notebooks only. Distributed training jobs do not currently support MLflow tracking.
182+
183+
### Enabling MLflow
184+
185+
The interactive notebook already includes a cell that sets the MLflow experiment name:
186+
187+
```python
188+
os.environ["MLFLOW_EXPERIMENT_NAME"] = "lora-training"
189+
```
190+
191+
For this to work, MLflow must be enabled as a component in your RHOAI installation. If MLflow is not enabled, the environment variable is simply ignored and training proceeds normally.
192+
193+
**To enable MLflow on your cluster:**
194+
195+
1. Enable the MLflow Operator component in your `DataScienceCluster` CR:
196+
197+
```bash
198+
oc patch datasciencecluster default-dsc \
199+
--type=merge \
200+
-p '{"spec":{"components":{"mlflowoperator":{"managementState":"Managed"}}}}'
201+
```
202+
203+
2. Create an `MLflow` CR to deploy the tracking server (example using SQLite and a PV for storage):
204+
205+
```bash
206+
oc apply -f - <<EOF
207+
apiVersion: mlflow.opendatahub.io/v1
208+
kind: MLflow
209+
metadata:
210+
name: mlflow
211+
spec:
212+
backendStoreUri: "sqlite:////mlflow/mlflow.db"
213+
defaultArtifactRoot: "file:///mlflow/artifacts"
214+
serveArtifacts: true
215+
storage:
216+
accessModes:
217+
- ReadWriteOnce
218+
resources:
219+
requests:
220+
storage: 10Gi
221+
EOF
222+
```
223+
224+
For full details, see the [Configuring MLflow in OpenShift AI](https://access.redhat.com/articles/7136121) Knowledgebase article.
225+
226+
### Viewing MLflow Experiments
227+
228+
Once training completes with MLflow enabled, you can browse your experiment runs:
229+
230+
1. In the OpenShift AI dashboard, navigate to **Develop & train → Experiments** from the left sidebar menu.
231+
2. Select the experiment name (e.g., `lora-training`) to view all runs.
232+
3. Each run contains logged metrics (training loss, learning rate), parameters, and artifacts.
233+
234+
You can also launch the full MLflow UI by clicking the **"Launch MLflow"** link in the top right of the Experiments page:
235+
236+
![](../images/mlflow-experiments.png)
237+
238+
Each run logs metrics including training loss, learning rate, samples per second, and more:
239+
240+
![](../images/mlflow-run-metrics.png)

examples/fine-tuning/osft/README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -207,3 +207,69 @@ These images serve both as training runtime and jupyter notebook images and come
207207
> You can skip the token if switching to non-gated models.
208208
209209
You can now proceed with the instructions from the notebook. Enjoy!
210+
211+
## MLflow Integration (Optional)
212+
213+
Training Hub supports [MLflow](https://mlflow.org/) for experiment tracking. When MLflow is enabled on your RHOAI cluster, training metrics (loss, learning rate, etc.) are automatically logged to MLflow experiments — no additional code changes required beyond setting the experiment name.
214+
215+
> [!NOTE]
216+
> MLflow integration is available for **interactive (single node)** notebooks only. Distributed training jobs do not currently support MLflow tracking.
217+
218+
### Enabling MLflow
219+
220+
The interactive notebook already includes a cell that sets the MLflow experiment name:
221+
222+
```python
223+
os.environ["MLFLOW_EXPERIMENT_NAME"] = "osft-training"
224+
```
225+
226+
For this to work, MLflow must be enabled as a component in your RHOAI installation. If MLflow is not enabled, the environment variable is simply ignored and training proceeds normally.
227+
228+
**To enable MLflow on your cluster:**
229+
230+
1. Enable the MLflow Operator component in your `DataScienceCluster` CR:
231+
232+
```bash
233+
oc patch datasciencecluster default-dsc \
234+
--type=merge \
235+
-p '{"spec":{"components":{"mlflowoperator":{"managementState":"Managed"}}}}'
236+
```
237+
238+
2. Create an `MLflow` CR to deploy the tracking server (example using SQLite and a PV for storage):
239+
240+
```bash
241+
oc apply -f - <<EOF
242+
apiVersion: mlflow.opendatahub.io/v1
243+
kind: MLflow
244+
metadata:
245+
name: mlflow
246+
spec:
247+
backendStoreUri: "sqlite:////mlflow/mlflow.db"
248+
defaultArtifactRoot: "file:///mlflow/artifacts"
249+
serveArtifacts: true
250+
storage:
251+
accessModes:
252+
- ReadWriteOnce
253+
resources:
254+
requests:
255+
storage: 10Gi
256+
EOF
257+
```
258+
259+
For full details, see the [Configuring MLflow in OpenShift AI](https://access.redhat.com/articles/7136121) Knowledgebase article.
260+
261+
### Viewing MLflow Experiments
262+
263+
Once training completes with MLflow enabled, you can browse your experiment runs:
264+
265+
1. In the OpenShift AI dashboard, navigate to **Develop & train → Experiments** from the left sidebar menu.
266+
2. Select the experiment name (e.g., `osft-training`) to view all runs.
267+
3. Each run contains logged metrics (training loss, learning rate), parameters, and artifacts.
268+
269+
You can also launch the full MLflow UI by clicking the **"Launch MLflow"** link in the top right of the Experiments page:
270+
271+
![](../images/mlflow-experiments.png)
272+
273+
Each run logs metrics including training loss, learning rate, samples per second, and more:
274+
275+
![](../images/mlflow-run-metrics.png)

examples/fine-tuning/sft/README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,3 +154,69 @@ to seamlessly run fine-tuning jobs.
154154
> You can skip the token if switching to non-gated models.
155155
156156
You can now proceed with the instructions from the notebook. Enjoy!
157+
158+
## MLflow Integration (Optional)
159+
160+
Training Hub supports [MLflow](https://mlflow.org/) for experiment tracking. When MLflow is enabled on your RHOAI cluster, training metrics (loss, learning rate, etc.) are automatically logged to MLflow experiments — no additional code changes required beyond setting the experiment name.
161+
162+
> [!NOTE]
163+
> MLflow integration is available for **interactive (single node)** notebooks only. Distributed training jobs do not currently support MLflow tracking.
164+
165+
### Enabling MLflow
166+
167+
The interactive notebook already includes a cell that sets the MLflow experiment name:
168+
169+
```python
170+
os.environ["MLFLOW_EXPERIMENT_NAME"] = "sft-training"
171+
```
172+
173+
For this to work, MLflow must be enabled as a component in your RHOAI installation. If MLflow is not enabled, the environment variable is simply ignored and training proceeds normally.
174+
175+
**To enable MLflow on your cluster:**
176+
177+
1. Enable the MLflow Operator component in your `DataScienceCluster` CR:
178+
179+
```bash
180+
oc patch datasciencecluster default-dsc \
181+
--type=merge \
182+
-p '{"spec":{"components":{"mlflowoperator":{"managementState":"Managed"}}}}'
183+
```
184+
185+
2. Create an `MLflow` CR to deploy the tracking server (example using SQLite and a PV for storage):
186+
187+
```bash
188+
oc apply -f - <<EOF
189+
apiVersion: mlflow.opendatahub.io/v1
190+
kind: MLflow
191+
metadata:
192+
name: mlflow
193+
spec:
194+
backendStoreUri: "sqlite:////mlflow/mlflow.db"
195+
defaultArtifactRoot: "file:///mlflow/artifacts"
196+
serveArtifacts: true
197+
storage:
198+
accessModes:
199+
- ReadWriteOnce
200+
resources:
201+
requests:
202+
storage: 10Gi
203+
EOF
204+
```
205+
206+
For full details, see the [Configuring MLflow in OpenShift AI](https://access.redhat.com/articles/7136121) Knowledgebase article.
207+
208+
### Viewing MLflow Experiments
209+
210+
Once training completes with MLflow enabled, you can browse your experiment runs:
211+
212+
1. In the OpenShift AI dashboard, navigate to **Develop & train → Experiments** from the left sidebar menu.
213+
2. Select the experiment name (e.g., `sft-training`) to view all runs.
214+
3. Each run contains logged metrics (training loss, learning rate), parameters, and artifacts.
215+
216+
You can also launch the full MLflow UI by clicking the **"Launch MLflow"** link in the top right of the Experiments page:
217+
218+
![](../images/mlflow-experiments.png)
219+
220+
Each run logs metrics including training loss, learning rate, samples per second, and more:
221+
222+
![](../images/mlflow-run-metrics.png)

0 commit comments

Comments
 (0)