Skip to content

AzureML Orchestrator Does Not Support Per-Step Compute Configuration #4338

@dani-diffusedrive

Description

@dani-diffusedrive

Contact Details [Optional]

No response

System Information

ZenML Version: 0.92.0
Python Version: 3.10.11
Operating System: Linux (Ubuntu 22.04, Azure VM)
Azure ML SDK: azure-ai-ml 1.23.1
azure-core: 1.36.0
azure-identity: 1.25.1

What happened?

Expected Behavior:
When configuring different compute targets for different pipeline steps using step-level AzureMLOrchestratorSettings, each step should run on its designated compute cluster.

According to ZenML's configuration hierarchy documentation:

"Configurations at the step level override those made at the pipeline level"

Actual Behavior:
All steps run on the pipeline-level default compute (or serverless), ignoring step-level orchestrator settings. The AzureML orchestrator does not read or apply per-step compute configurations.

Discovery Credit: Issue discovered by @andras-diffusedrive during pipeline optimization.

Possible Root Cause

The AzureMLOrchestrator.submit_pipeline() method (lines 296-322 in azureml_orchestrator.py) only:

  1. Reads orchestrator settings at pipeline level
  2. Applies compute to the entire pipeline
  3. Never checks individual steps for their own orchestrator settings
  4. Never sets component_job.compute for individual component jobs

Proposed Solution

File: zenml/integrations/azure/orchestrators/azureml_orchestrator.py
Method: AzureMLOrchestrator.submit_pipeline()
Location: Lines 318-340 (inside the nested azureml_pipeline() function)

Add logic in the azureml_pipeline() nested function to:

  1. Check each step for orchestrator settings
  2. Get compute target from step settings
  3. Apply to individual component_job.compute

Patch code:

@pipeline(force_rerun=True, **pipeline_args)
def azureml_pipeline() -> None:
    component_outputs: Dict[str, Any] = {}
    for component_name, component in components.items():
        component_inputs = {}
        if component.inputs:
            component_inputs.update({i: component_outputs[i] for i in component.inputs})
        
        component_job = component(**component_inputs)
        
        # PATCH: Apply per-step compute settings if available
        step_config = snapshot.step_configurations[component_name]
        if step_config.config.settings:
            from zenml.utils import settings_utils
            orchestrator_key = settings_utils.get_stack_component_setting_key(self)
            step_settings = step_config.config.settings.get(orchestrator_key)
            
            if step_settings:
                try:
                    step_compute = create_or_get_compute(
                        ml_client, 
                        step_settings, 
                        default_compute_name=f"zenml_{self.id}_{component_name}"
                    )
                    if step_compute:
                        component_job.compute = step_compute
                        logger.info(f"Step '{component_name}' will run on compute: {step_compute}")
                except (AttributeError, TypeError) as e:
                    logger.debug(f"Skipping per-step compute for '{component_name}': {e}")

        # PATCH end
        
        # Outputs
        if component_job.outputs:
            component_outputs[component_name] = (
                component_job.outputs.completed
            )

Alternative Workarounds

  1. Use AzureML Step Operators - Requires additional stack components and more complex configuration
  2. Split into separate pipelines - Loses pipeline atomicity and lineage tracking
  3. Manual resource allocation - Not automated, requires custom orchestration

Additional Context

  • Azure ML Python SDK v2 supports per-job compute via component_job.compute property
  • The fix is minimal (15 lines)

References


Note: A working patch has been tested and verified to solve this issue without breaking existing functionality.

Reproduction steps

  1. Create a pipeline with multiple steps
  2. Configure different compute targets per step:
from zenml import pipeline, step
from zenml.integrations.azure.flavors import AzureMLOrchestratorSettings
from zenml.config import DockerSettings

# Docker settings
docker_settings = DockerSettings(
    parent_image="pytorch/pytorch:2.9.0-cuda12.8-cudnn9-devel",
    requirements=["zenml", "torch"],
)

# CPU compute for data preparation
cpu_settings = AzureMLOrchestratorSettings(
    mode="compute-cluster",
    compute_name="cpu-cluster",
    synchronous=False,
)

# GPU compute for model training
gpu_settings = AzureMLOrchestratorSettings(
    mode="compute-cluster",
    compute_name="gpu-cluster",
    synchronous=False,
)

@step(settings={"orchestrator": cpu_settings})
def prepare_data():
    # Data preparation on CPU cluster
    pass

@step(settings={"orchestrator": gpu_settings})
def train_model():
    # Training on GPU cluster
    pass

@pipeline(settings={"docker": docker_settings})
def my_pipeline():
    prepare_data()
    train_model()
  1. Run the pipeline
  2. Observe that both steps run on the same compute (or serverless)

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

core-teamIssues that are being handled by the core teamplannedPlanned for the short term

Type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions