Skip to content

Conversation

@dubin555
Copy link

Some cloud providers (e.g. Azure ML, Databricks) automatically set the MLFLOW_RUN_ID environment variable.
Previously, verl always called mlflow.set_experiment(...); mlflow.start_run(experiment_id=..., run_name=...), which can conflict with the provider-managed run context and cause MLflow to throw an exception.

This change detects MLFLOW_RUN_ID and attaches to the existing run instead of creating a new one, preventing duplicate runs and enabling seamless integration with managed MLflow environments.

What does this PR do?

  • Detect MLFLOW_RUN_ID during Tracking(..., default_backend="mlflow") initialization.
  • If MLFLOW_RUN_ID is present, call mlflow.start_run(run_id=...) to attach to the provider-managed run.
  • Otherwise, keep the previous behavior: mlflow.set_experiment(project_name) and start a new run under that experiment.

Why is this needed?

On Azure ML, MLflow run context is automatically created and MLFLOW_RUN_ID is injected.
Starting a new run under a different experiment ID triggers a mismatch error like:

MlflowException: Cannot start run with ID <...> experiment ID does not match environment run ID.
Make sure --experiment-name or --experiment-id matches experiment set with set_experiment()...

Because the cloud provider-generated MLFLOW_RUN_ID is not predictable and is set outside the user config, this cannot be reliably worked around via configuration alone.

Test

  • ✅ Local smoke test (manual): verified the MLflow backend initializes correctly when:
    • MLFLOW_RUN_ID is set → attaches to existing run
    • MLFLOW_RUN_ID is not set → creates/uses experiment and starts a new run
  • CI: (existing unit tests)

(If you have an Azure ML job run link/log snippet, you can paste it here as additional evidence.)

API and Usage Example

No API changes. Behavior is automatically enabled when MLFLOW_RUN_ID is set by the environment.

Design & Code Changes

  • verl/utils/tracking.py
    • Add a conditional path to honor MLFLOW_RUN_ID and attach to an existing MLflow run.
    • Preserve the previous experiment-based run creation when MLFLOW_RUN_ID is absent.

Some cloud providers like Azure ML and Databricks automatically set
the MLFLOW_RUN_ID environment variable. This change detects that
variable and attaches to the existing run instead of creating a new
one, preventing duplicate runs and ensuring proper integration with
managed MLflow environments.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix to allow verl to attach to an existing MLflow run when the MLFLOW_RUN_ID environment variable is set, a common scenario in managed ML environments like Azure ML or Databricks. The change correctly detects the presence of MLFLOW_RUN_ID and uses mlflow.start_run(run_id=...) to attach to the existing run, which avoids conflicts from the previous behavior of always creating a new run. The implementation is clean, directly addresses the described issue, and I found no high or critical severity issues in my review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant