Skip to content

Conversation

SebastianScherer88
Copy link

@SebastianScherer88 SebastianScherer88 commented Sep 14, 2025

Overview

As per this original issue in the official zenml github repository

Includes the following components for the proposed AWS Batch step operator:

  • step operator class and config
  • step operator flavor and settings
  • basic unit test coverage

The proposed implementation hardcodes ECS as an AWS orchestration backend - as opposed to EKS - as its a less complex cloud infrastructure stack to setup than EKS.

The AWS Bath Step operator supports both EC2 and FARGATE AWS Batch platform capabilities. Both can be used by the same zenml stack - the registration of the AWS Batch step operator component merely requires a default job queue (which has to be either FARGATE or EC2), but an arbitrary job queue can be passed to the step as part of its step settings. This means that an AWS stack with multiple job queues (i.e. one FARGATE type one for smaller workloads, and an EC2 type one to support GPU hardware) can be used without having to register separate zenml stacks.

Limitiations

Note that this feature in its current implementation does not support AWS Batch's multinode job type, as it was deemed surplus to requirements. AWS Batch offers a host of large, multi-GPU nodes for distributed and accelerated workloads than can be leverage in a single container type job supported by this feature.

Tests

To run the new tests, run

pytest tests/integration/integrations/aws/step_operators

Summary by CodeRabbit

  • New Features

    • Introduced AWS Batch step operator with a new flavor in the AWS integration.
    • Supports single- and multi-node execution, configurable instance types, environment variables, node count, and timeouts.
    • Provides runtime context access to AWS Batch job details.
    • Compatible with service connectors for AWS resources.
  • Tests

    • Added integration tests for environment/resource mapping and settings initialization.
  • Chores

    • Updated AWS integration dependencies to include boto3 (>=1.40.30).
    • Exposed the new AWS Batch flavor and step operator through the public API.

@CLAassistant
Copy link

CLAassistant commented Sep 14, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

coderabbitai bot commented Sep 14, 2025

📝 Walkthrough

Walkthrough

Introduces an official AWS Batch step operator, its flavor, config/settings, and entrypoint configuration; updates AWS integration exports and requirements; and adds integration tests validating environment/resource mapping and settings initialization.

Changes

Cohort / File(s) Summary of Changes
AWS integration exports & deps
src/zenml/integrations/aws/__init__.py
Adds AWS_BATCH_STEP_OPERATOR_FLAVOR constant; includes boto3>=1.40.30 in REQUIREMENTS; exports AWSBatchStepOperatorFlavor; includes it in AWSIntegration.flavors().
Flavors package
src/zenml/integrations/aws/flavors/__init__.py, src/zenml/integrations/aws/flavors/aws_batch_step_operator_flavor.py
Adds AWSBatchStepOperatorFlavor plus settings/config; re-exports flavor and config; defines service connector requirements, metadata, and lazy implementation import.
Step operators exports
src/zenml/integrations/aws/step_operators/__init__.py
Re-exports AWSBatchStepOperator and get_context; updates all.
AWS Batch step operator implementation
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py, src/zenml/integrations/aws/step_operators/aws_batch_step_operator_entrypoint_config.py
Implements AWSBatchStepOperator, runtime context, job definition models, environment/resource mapping, job definition generation, boto3-based launch flow with polling; adds AWSBatchEntrypointConfiguration class.
Tests
tests/integration/integrations/aws/step_operators/__init__.py, tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator.py, tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator_flavor.py
Adds license header; tests context derivation, env/resource mapping, and settings initialization.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Dev as User Code
    participant Z as ZenML Runtime
    participant Op as AWSBatchStepOperator
    participant B as boto3 (AWS Batch)
    participant AWS as AWS Batch Service

    Dev->>Z: Run pipeline step
    Z->>Op: launch(info, entrypoint, env)
    Note over Op: Resolve Docker image, entrypoint, settings
    Op->>B: register_job_definition(jobDef)
    B->>AWS: Create/Update JobDefinition
    AWS-->>B: ARN/Rev
    Op->>B: submit_job(name, jobQueue, jobDefinition, overrides)
    B->>AWS: Submit job
    AWS-->>B: jobId
    loop Poll until terminal state
        Op->>B: describe_jobs(jobId)
        B->>AWS: Fetch job status
        AWS-->>B: RUNNING | SUCCEEDED | FAILED
    end
    alt SUCCEEDED
        Op-->>Z: Report success
    else FAILED
        Op-->>Z: Raise failure
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Suggested labels

enhancement

Suggested reviewers

  • bcdurak
  • strickvl
  • avishniakov

Poem

I thump my paws with batchy cheer,
New queues to hop, the clouds draw near.
Env mapped neat, resources tight,
I launch a job and watch the night.
Poll, poll, poll—success in sight!
Carrot logs: SUCCEEDED bright. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description does not follow the repository’s required template, as it lacks the “## Describe changes”, “## Pre-requisites”, and “## Types of changes” sections and their associated content such as the implementation summary, contribution checklist, and change classification. This omission means the description is incomplete relative to the prescribed structure. Please update the description to include the “## Describe changes” section summarizing what was implemented and why, the “## Pre-requisites” checklist confirming compliance with contribution guidelines and test coverage, and the “## Types of changes” section with the appropriate checkbox selected.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 92.00% which is sufficient. The required threshold is 80.00%.
Title Check ✅ Passed The title succinctly describes the primary change—the addition of the AWS Batch step operator flavor and implementation. It directly reflects the main feature introduced by the pull request without extraneous details or vague terms. It is concise and clear for teammates reviewing the project history.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@htahir1 htahir1 linked an issue Sep 15, 2025 that may be closed by this pull request
1 task
@htahir1
Copy link
Contributor

htahir1 commented Sep 16, 2025

@SebastianScherer88 thanks for this - we glanced at this internally and it looks quite decent as a first draft. youll get reviews in due course... and we'll test it on our AWS too

Did you manage to set up your AWS to test it?

@htahir1
Copy link
Contributor

htahir1 commented Sep 16, 2025

@coderabbitai review

@htahir1
Copy link
Contributor

htahir1 commented Sep 16, 2025

@codex review

Copy link
Contributor

coderabbitai bot commented Sep 16, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

To use Codex here, create a Codex account and connect to github.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (10)
src/zenml/integrations/aws/flavors/aws_batch_step_operator_flavor.py (3)

14-14: Fix stale module docstring (SageMaker → AWS Batch).

Apply:

-"""Amazon SageMaker step operator flavor."""
+"""AWS Batch step operator flavor."""

35-37: Fix class docstring (SageMaker → AWS Batch).

-class AWSBatchStepOperatorSettings(BaseSettings):
-    """Settings for the Sagemaker step operator."""
+class AWSBatchStepOperatorSettings(BaseSettings):
+    """Settings for the AWS Batch step operator."""

38-42: Tighten field descriptions and fix grammar.

Current strings have broken spacing and minor grammar issues.

-    instance_type: Union[str,List[str]] = Field(
-        default='optimal',
-        description="The instance type for AWS Batch to use for the step" \
-        " execution. Example: 'm5.xlarge'",
-    )
+    instance_type: Union[str, List[str]] = Field(
+        default="optimal",
+        description=(
+            "Instance type(s) for the job. Example: 'm5.xlarge' or "
+            "['c6i.4xlarge', 'm6i.4xlarge']."
+        ),
+    )
@@
-    node_count: PositiveInt = Field(
-        default=1,
-        description="The number of AWS Batch nodes to run the step on. If > 1," \
-        "an AWS Batch multinode job will be run, with the network connectivity" \
-        "between the nodes provided by AWS Batch. See https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html" \
-        "for details."
-    )
+    node_count: PositiveInt = Field(
+        default=1,
+        description=(
+            "Number of nodes. If > 1, run as a multi‑node parallel job with "
+            "AWS‑managed networking. See: "
+            "https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html"
+        ),
+    )
@@
-    execution_role: str = Field(
-        "",
-        description="The ECS execution role required to execute the AWS Batch" \
-        " jobs as an ECS tasks."
-    )
+    execution_role: str = Field(
+        "",
+        description="ECS execution role assumed by tasks that run AWS Batch jobs.",
+    )
-    job_role: str = Field(
-        "",
-        description="The ECS job role required by the container runtime inside" \
-        "the ECS task."
-    )
+    job_role: str = Field(
+        "",
+        description="Task role (job role) assumed by the container at runtime.",
+    )

Also applies to: 48-54, 73-82

tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator_flavor.py (1)

3-8: Add assertions; current test only instantiates the settings.

Validate fields to make the test meaningful.

-def test_aws_batch_step_operator_settings():
-    AWSBatchStepOperatorSettings(
-        instance_type="g4dn.xlarge",
-        environment={"key_1":"value_1","key_2":"value_2"},
-        timeout_seconds=60
-    )
+def test_aws_batch_step_operator_settings():
+    """Ensure settings initialize and retain provided values."""
+    s = AWSBatchStepOperatorSettings(
+        instance_type="g4dn.xlarge",
+        environment={"key_1": "value_1", "key_2": "value_2"},
+        timeout_seconds=60,
+    )
+    assert s.instance_type == "g4dn.xlarge"
+    assert s.environment == {"key_1": "value_1", "key_2": "value_2"}
+    assert s.timeout_seconds == 60
+    # verify defaults
+    assert s.node_count == 1
src/zenml/integrations/aws/step_operators/aws_batch_step_operator_entrypoint_config.py (1)

14-14: Clarify module and class docstrings (explicit “AWS Batch”).

-"""Entrypoint configuration for ZenML Batch step operator."""
+"""Entrypoint configuration for the ZenML AWS Batch step operator."""
@@
-class AWSBatchEntrypointConfiguration(StepOperatorEntrypointConfiguration):
-    """Entrypoint configuration for ZenML Batch step operator."""
+class AWSBatchEntrypointConfiguration(StepOperatorEntrypointConfiguration):
+    """Entrypoint configuration for the ZenML AWS Batch step operator."""

Also applies to: 21-22

src/zenml/integrations/aws/step_operators/__init__.py (1)

14-24: Clean up docstring, remove unused noqa, and sort __all__.

Aligns with Ruff hints (RUF100, RUF022) and clarifies package purpose.

-"""Initialization of the Sagemaker Step Operator."""
+"""AWS step operators package (SageMaker and AWS Batch)."""
@@
-from zenml.integrations.aws.step_operators.sagemaker_step_operator import (  # noqa: F401
+from zenml.integrations.aws.step_operators.sagemaker_step_operator import (
     SagemakerStepOperator,
 )
-from zenml.integrations.aws.step_operators.aws_batch_step_operator import (  # noqa: F401
+from zenml.integrations.aws.step_operators.aws_batch_step_operator import (
     AWSBatchStepOperator,
     get_context
 )
-__all__ = ["SagemakerStepOperator","AWSBatchStepOperator","get_context"]
+__all__ = ["AWSBatchStepOperator", "SagemakerStepOperator", "get_context"]
tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator.py (1)

24-27: Pass strings to monkeypatch.setenv; env vars must be strings.

Prevents potential type issues on some platforms/Python versions.

-    monkeypatch.setenv('AWS_BATCH_JOB_MAIN_NODE_INDEX',0)
-    monkeypatch.setenv('AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS','test-address')
-    monkeypatch.setenv('AWS_BATCH_JOB_NODE_INDEX',1)
-    monkeypatch.setenv('AWS_BATCH_JOB_NUM_NODES',2)
+    monkeypatch.setenv("AWS_BATCH_JOB_MAIN_NODE_INDEX", "0")
+    monkeypatch.setenv("AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS", "test-address")
+    monkeypatch.setenv("AWS_BATCH_JOB_NODE_INDEX", "1")
+    monkeypatch.setenv("AWS_BATCH_JOB_NUM_NODES", "2")
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py (3)

14-14: Fix the module docstring to reference AWS Batch instead of SageMaker.

The docstring incorrectly references "Sagemaker Step Operator" instead of AWS Batch Step Operator.

Apply this diff to fix the docstring:

-"""Implementation of the Sagemaker Step Operator."""
+"""Implementation of the AWS Batch Step Operator."""

283-283: Fix improper string formatting in log message.

The log message uses an f-string but doesn't format the log properly through the logger.

Apply this diff to fix the logging:

-                    logger.info(f"AWS Batch only accepts int type cpu resource requirements. Converted {resource_settings.cpu_count} to {cpu_count_int}")
+                    logger.info("AWS Batch only accepts int type cpu resource requirements. Converted %s to %s", resource_settings.cpu_count, cpu_count_int)

361-361: Simplify node range generation using string formatting.

The targetNodes string generation can be simplified.

Apply this diff to simplify:

-                            targetNodes=','.join([str(node_index) for node_index in range(node_count)]),
+                            targetNodes=f"0:{node_count-1}",
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89ac517 and 5466799.

📒 Files selected for processing (9)
  • src/zenml/integrations/aws/__init__.py (3 hunks)
  • src/zenml/integrations/aws/flavors/__init__.py (2 hunks)
  • src/zenml/integrations/aws/flavors/aws_batch_step_operator_flavor.py (1 hunks)
  • src/zenml/integrations/aws/step_operators/__init__.py (1 hunks)
  • src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py (1 hunks)
  • src/zenml/integrations/aws/step_operators/aws_batch_step_operator_entrypoint_config.py (1 hunks)
  • tests/integration/integrations/aws/step_operators/__init__.py (1 hunks)
  • tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator.py (1 hunks)
  • tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator_flavor.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
src/zenml/**/*.py

⚙️ CodeRabbit configuration file

Review the Python code for conformity with Python best practices.

Files:

  • src/zenml/integrations/aws/step_operators/aws_batch_step_operator_entrypoint_config.py
  • src/zenml/integrations/aws/flavors/__init__.py
  • src/zenml/integrations/aws/step_operators/__init__.py
  • src/zenml/integrations/aws/__init__.py
  • src/zenml/integrations/aws/flavors/aws_batch_step_operator_flavor.py
  • src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
tests/**/*.py

⚙️ CodeRabbit configuration file

tests/**/*.py: "Assess the unit test code employing the PyTest testing framework. Confirm that:

  • The tests adhere to PyTest's established best practices.
  • Test descriptions are sufficiently detailed to clarify the purpose of each test."

Files:

  • tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator_flavor.py
  • tests/integration/integrations/aws/step_operators/__init__.py
  • tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator.py
🪛 Ruff (0.12.2)
src/zenml/integrations/aws/step_operators/__init__.py

16-16: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


19-19: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


23-23: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

🔇 Additional comments (2)
tests/integration/integrations/aws/step_operators/__init__.py (1)

1-14: LGTM: license header only.

No actionable issues.

src/zenml/integrations/aws/flavors/__init__.py (1)

32-35: LGTM: public re-exports for AWS Batch flavor.

Consistent with integration wiring.

Also applies to: 46-49

@SebastianScherer88
Copy link
Author

@SebastianScherer88 thanks for this - we glanced at this internally and it looks quite decent as a first draft. youll get reviews in due course... and we'll test it on our AWS too

Did you manage to set up your AWS to test it?

no, i haven't done any functional testing with an actual aws backend yet. might have some time this coming weekend 🤔

…ch container type) step after hacking + hardcoding some remote rds connection credentials into custom parent image. also needed parent image bc the new flavor module is obvs not installed in official distribution which is in the default docker image used by zenml
@SebastianScherer88
Copy link
Author

i managed to set up a test stack in aws and run some e2e tests. the current version works for single node aws batch jobs (i.e. type container) but the multinode approach doesnt work (yet). not sure if i can get that to work here or if that's a more fundamental design issue here - atm i'm generating identical job descriptions for all batch nodes, which probably means that these containers are all trying to identify as the same zenml step with the store 🤔

…ec2 instances sizes, and instead support both fargate and ec2 (for gpu) backends on ecs
@SebastianScherer88
Copy link
Author

i've decided to drop support for multinode type aws batch jobs as

  • i dont think i can get it to work and
  • its not really needed given that container type, ec2 platform capability jobs can choose extremely large, multigpu instances, so distributed training is covered in the reduced feature offering, too

@bcdurak
Copy link
Contributor

bcdurak commented Sep 25, 2025

Hey @SebastianScherer88, thanks a lot for your contribution. I will take over the reviewer role and test this as soon as possible. In the meanwhile, if you change it from a draft PR to "Ready for review", the CI can help us solve the formatting and linting issues as a start.

@bcdurak bcdurak self-requested a review September 25, 2025 15:29
@SebastianScherer88 SebastianScherer88 marked this pull request as ready for review September 25, 2025 21:34
@SebastianScherer88
Copy link
Author

SebastianScherer88 commented Sep 25, 2025

Hey @SebastianScherer88, thanks a lot for your contribution. I will take over the reviewer role and test this as soon as possible. In the meanwhile, if you change it from a draft PR to "Ready for review", the CI can help us solve the formatting and linting issues as a start.

ready for 👀 😄 @bcdurak

@schustmi schustmi changed the title Feature: AWS step operator AWS Batch step operator Oct 2, 2025
return mapped_resource_settings

@staticmethod
def sanitize_name(name: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this works, I think we can improve this a little. See for example here

def sanitize_label(label: str) -> str:
"""Sanitize a label for a Kubernetes resource.
Args:
label: The label to sanitize.
Returns:
The sanitized label.
"""
# https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#rfc-1035-label-names
label = re.sub(r"[^a-z0-9-]", "-", label.lower())
label = re.sub(r"^[-]+", "", label)
label = re.sub(r"[-]+", "-", label)
label = label[:63]
# Remove trailing dashes after truncation to make sure we end with an
# alphanumeric character
label = re.sub(r"[-]+$", "", label)
return label
for a similar method, which also removes leading/trailing/consecutive -.

I'm thinking something like this

job_name = f"{pipeline_name}-{step_name}"
job_name = sanitize_name(job_name, max_length=115)
job_name += random_str(6)

Also, docstring still missing for this method which will probably cause an issue in the CI

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we have now works fine imo, so i'll leave any additional formatting to you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A random string "would also work", and both the random string as well as your solution here won't get my approval as it's not just about "it somehow works".

Copy link
Author

@SebastianScherer88 SebastianScherer88 Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to disagree. If you wanted a very specific and non negotiable implementation of this, you should have formulated those requirements up front, rather than insisting on arbitrary personal cosmetic preferences after the fact. This is bad review practice imo, since it unnecessarily imposes on the time of unpaid volunteer contributors.

I have already spent a significant deal of my personal free time to provide a non trivial piece of functionality to the zenml framework for free, including provisioning my own cloud infrastructure for e2e testing, and will now leave this PR as is. You are of course free to use the provided work in its current state as you wish - implement the changes you deem a must-have yourself now or later, or reject the work in its current form. 👍

@schustmi
Copy link
Contributor

schustmi commented Oct 6, 2025

@SebastianScherer88 There are also some formatting issues, running bash scripts/format.sh should solve them I think.

@SebastianScherer88
Copy link
Author

@SebastianScherer88 There are also some formatting issues, running bash scripts/format.sh should solve them I think.

Got a link for the failing CI? If formatting is a prerequisite for passing the CI, you guys should look at pre commits

@schustmi
Copy link
Contributor

schustmi commented Oct 6, 2025

@SebastianScherer88 There are also some formatting issues, running bash scripts/format.sh should solve them I think.

Got a link for the failing CI? If formatting is a prerequisite for passing the CI, you guys should look at pre commits

So are linting, docstrings, unit tests, integration tests and many other things. And we do not run these either as part of a pre-commit hook, as they require extra dependencies and time.

It's all written down here: https://github.com/zenml-io/zenml/blob/main/CONTRIBUTING.md

Failing CI:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Official AWS Batch step operator

5 participants