-
Notifications
You must be signed in to change notification settings - Fork 563
Description
Contact Details [Optional]
No response
System Information
ZenML Version: 0.92.0
Python Version: 3.10.11
Operating System: Linux (Ubuntu 22.04, Azure VM)
Azure ML SDK: azure-ai-ml 1.23.1
azure-core: 1.36.0
azure-identity: 1.25.1
What happened?
Expected Behavior:
When configuring different compute targets for different pipeline steps using step-level AzureMLOrchestratorSettings, each step should run on its designated compute cluster.
According to ZenML's configuration hierarchy documentation:
"Configurations at the step level override those made at the pipeline level"
Actual Behavior:
All steps run on the pipeline-level default compute (or serverless), ignoring step-level orchestrator settings. The AzureML orchestrator does not read or apply per-step compute configurations.
Discovery Credit: Issue discovered by @andras-diffusedrive during pipeline optimization.
Possible Root Cause
The AzureMLOrchestrator.submit_pipeline() method (lines 296-322 in azureml_orchestrator.py) only:
- Reads orchestrator settings at pipeline level
- Applies compute to the entire pipeline
- Never checks individual steps for their own orchestrator settings
- Never sets
component_job.computefor individual component jobs
Proposed Solution
File: zenml/integrations/azure/orchestrators/azureml_orchestrator.py
Method: AzureMLOrchestrator.submit_pipeline()
Location: Lines 318-340 (inside the nested azureml_pipeline() function)
Add logic in the azureml_pipeline() nested function to:
- Check each step for orchestrator settings
- Get compute target from step settings
- Apply to individual
component_job.compute
Patch code:
@pipeline(force_rerun=True, **pipeline_args)
def azureml_pipeline() -> None:
component_outputs: Dict[str, Any] = {}
for component_name, component in components.items():
component_inputs = {}
if component.inputs:
component_inputs.update({i: component_outputs[i] for i in component.inputs})
component_job = component(**component_inputs)
# PATCH: Apply per-step compute settings if available
step_config = snapshot.step_configurations[component_name]
if step_config.config.settings:
from zenml.utils import settings_utils
orchestrator_key = settings_utils.get_stack_component_setting_key(self)
step_settings = step_config.config.settings.get(orchestrator_key)
if step_settings:
try:
step_compute = create_or_get_compute(
ml_client,
step_settings,
default_compute_name=f"zenml_{self.id}_{component_name}"
)
if step_compute:
component_job.compute = step_compute
logger.info(f"Step '{component_name}' will run on compute: {step_compute}")
except (AttributeError, TypeError) as e:
logger.debug(f"Skipping per-step compute for '{component_name}': {e}")
# PATCH end
# Outputs
if component_job.outputs:
component_outputs[component_name] = (
component_job.outputs.completed
)Alternative Workarounds
- Use AzureML Step Operators - Requires additional stack components and more complex configuration
- Split into separate pipelines - Loses pipeline atomicity and lineage tracking
- Manual resource allocation - Not automated, requires custom orchestration
Additional Context
- Azure ML Python SDK v2 supports per-job compute via
component_job.computeproperty - The fix is minimal (15 lines)
References
- ZenML Configuration Hierarchy: https://docs.zenml.io/concepts/steps_and_pipelines/configuration#configuration-hierarchy
- ZenML Resource Settings (similar pattern): https://docs.zenml.io/concepts/steps_and_pipelines/configuration#resource-settings
- Azure ML SDK v2 documentation: https://learn.microsoft.com/en-us/python/api/azure-ai-ml/
- ZenML AzureML Orchestrator documentation: https://docs.zenml.io/stack-components/orchestrators/azureml
- Working implementation: See attached patch above
Note: A working patch has been tested and verified to solve this issue without breaking existing functionality.
Reproduction steps
- Create a pipeline with multiple steps
- Configure different compute targets per step:
from zenml import pipeline, step
from zenml.integrations.azure.flavors import AzureMLOrchestratorSettings
from zenml.config import DockerSettings
# Docker settings
docker_settings = DockerSettings(
parent_image="pytorch/pytorch:2.9.0-cuda12.8-cudnn9-devel",
requirements=["zenml", "torch"],
)
# CPU compute for data preparation
cpu_settings = AzureMLOrchestratorSettings(
mode="compute-cluster",
compute_name="cpu-cluster",
synchronous=False,
)
# GPU compute for model training
gpu_settings = AzureMLOrchestratorSettings(
mode="compute-cluster",
compute_name="gpu-cluster",
synchronous=False,
)
@step(settings={"orchestrator": cpu_settings})
def prepare_data():
# Data preparation on CPU cluster
pass
@step(settings={"orchestrator": gpu_settings})
def train_model():
# Training on GPU cluster
pass
@pipeline(settings={"docker": docker_settings})
def my_pipeline():
prepare_data()
train_model()- Run the pipeline
- Observe that both steps run on the same compute (or serverless)
Relevant log output
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
Type
Projects
Status