[tool] fix: attach to existing MLflow run when MLFLOW_RUN_ID is set #4740

dubin555 · 2025-12-30T10:38:46Z

Some cloud providers (e.g. Azure ML, Databricks) automatically set the MLFLOW_RUN_ID environment variable.
Previously, verl always called mlflow.set_experiment(...); mlflow.start_run(experiment_id=..., run_name=...), which can conflict with the provider-managed run context and cause MLflow to throw an exception.

This change detects MLFLOW_RUN_ID and attaches to the existing run instead of creating a new one, preventing duplicate runs and enabling seamless integration with managed MLflow environments.

What does this PR do?

Detect MLFLOW_RUN_ID during Tracking(..., default_backend="mlflow") initialization.
If MLFLOW_RUN_ID is present, call mlflow.start_run(run_id=...) to attach to the provider-managed run.
Otherwise, keep the previous behavior: mlflow.set_experiment(project_name) and start a new run under that experiment.

Why is this needed?

On Azure ML, MLflow run context is automatically created and MLFLOW_RUN_ID is injected.
Starting a new run under a different experiment ID triggers a mismatch error like:

MlflowException: Cannot start run with ID <...> experiment ID does not match environment run ID.
Make sure --experiment-name or --experiment-id matches experiment set with set_experiment()...

Because the cloud provider-generated MLFLOW_RUN_ID is not predictable and is set outside the user config, this cannot be reliably worked around via configuration alone.

Test

✅ Local smoke test (manual): verified the MLflow backend initializes correctly when:
- MLFLOW_RUN_ID is set → attaches to existing run
- MLFLOW_RUN_ID is not set → creates/uses experiment and starts a new run
CI: (existing unit tests)

(If you have an Azure ML job run link/log snippet, you can paste it here as additional evidence.)

API and Usage Example

No API changes. Behavior is automatically enabled when MLFLOW_RUN_ID is set by the environment.

Design & Code Changes

verl/utils/tracking.py
- Add a conditional path to honor MLFLOW_RUN_ID and attach to an existing MLflow run.
- Preserve the previous experiment-based run creation when MLFLOW_RUN_ID is absent.

Some cloud providers like Azure ML and Databricks automatically set the MLFLOW_RUN_ID environment variable. This change detects that variable and attaches to the existing run instead of creating a new one, preventing duplicate runs and ensuring proper integration with managed MLflow environments.

gemini-code-assist

Code Review

This pull request introduces a fix to allow verl to attach to an existing MLflow run when the MLFLOW_RUN_ID environment variable is set, a common scenario in managed ML environments like Azure ML or Databricks. The change correctly detects the presence of MLFLOW_RUN_ID and uses mlflow.start_run(run_id=...) to attach to the existing run, which avoids conflicts from the previous behavior of always creating a new run. The implementation is clean, directly addresses the described issue, and I found no high or critical severity issues in my review.

gemini-code-assist bot reviewed Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tool] fix: attach to existing MLflow run when MLFLOW_RUN_ID is set #4740

[tool] fix: attach to existing MLflow run when MLFLOW_RUN_ID is set #4740

dubin555 commented Dec 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[tool] fix: attach to existing MLflow run when MLFLOW_RUN_ID is set #4740

Are you sure you want to change the base?

[tool] fix: attach to existing MLflow run when MLFLOW_RUN_ID is set #4740

Conversation

dubin555 commented Dec 30, 2025

What does this PR do?

Why is this needed?

Test

API and Usage Example

Design & Code Changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant