[tool] fix: attach to existing MLflow run when MLFLOW_RUN_ID is set #4740
+11
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some cloud providers (e.g. Azure ML, Databricks) automatically set the
MLFLOW_RUN_IDenvironment variable.Previously,
verlalways calledmlflow.set_experiment(...); mlflow.start_run(experiment_id=..., run_name=...), which can conflict with the provider-managed run context and cause MLflow to throw an exception.This change detects
MLFLOW_RUN_IDand attaches to the existing run instead of creating a new one, preventing duplicate runs and enabling seamless integration with managed MLflow environments.What does this PR do?
MLFLOW_RUN_IDduringTracking(..., default_backend="mlflow")initialization.MLFLOW_RUN_IDis present, callmlflow.start_run(run_id=...)to attach to the provider-managed run.mlflow.set_experiment(project_name)and start a new run under that experiment.Why is this needed?
On Azure ML, MLflow run context is automatically created and
MLFLOW_RUN_IDis injected.Starting a new run under a different experiment ID triggers a mismatch error like:
Because the cloud provider-generated
MLFLOW_RUN_IDis not predictable and is set outside the user config, this cannot be reliably worked around via configuration alone.Test
MLFLOW_RUN_IDis set → attaches to existing runMLFLOW_RUN_IDis not set → creates/uses experiment and starts a new run(If you have an Azure ML job run link/log snippet, you can paste it here as additional evidence.)
API and Usage Example
No API changes. Behavior is automatically enabled when
MLFLOW_RUN_IDis set by the environment.Design & Code Changes
verl/utils/tracking.pyMLFLOW_RUN_IDand attach to an existing MLflow run.MLFLOW_RUN_IDis absent.