-
Notifications
You must be signed in to change notification settings - Fork 548
AWS Batch step operator #3954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
AWS Batch step operator #3954
Conversation
Co-authored-by: ZenML GmbH <[email protected]> (cherry picked from commit af53077)
…ws batch job definition conversion
📝 WalkthroughWalkthroughIntroduces an official AWS Batch step operator, its flavor, config/settings, and entrypoint configuration; updates AWS integration exports and requirements; and adds integration tests validating environment/resource mapping and settings initialization. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as User Code
participant Z as ZenML Runtime
participant Op as AWSBatchStepOperator
participant B as boto3 (AWS Batch)
participant AWS as AWS Batch Service
Dev->>Z: Run pipeline step
Z->>Op: launch(info, entrypoint, env)
Note over Op: Resolve Docker image, entrypoint, settings
Op->>B: register_job_definition(jobDef)
B->>AWS: Create/Update JobDefinition
AWS-->>B: ARN/Rev
Op->>B: submit_job(name, jobQueue, jobDefinition, overrides)
B->>AWS: Submit job
AWS-->>B: jobId
loop Poll until terminal state
Op->>B: describe_jobs(jobId)
B->>AWS: Fetch job status
AWS-->>B: RUNNING | SUCCEEDED | FAILED
end
alt SUCCEEDED
Op-->>Z: Report success
else FAILED
Op-->>Z: Raise failure
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ws batch job definition conversion
a80f266
to
5466799
Compare
@SebastianScherer88 thanks for this - we glanced at this internally and it looks quite decent as a first draft. youll get reviews in due course... and we'll test it on our AWS too Did you manage to set up your AWS to test it? |
@coderabbitai review |
@codex review |
✅ Actions performedReview triggered.
|
To use Codex here, create a Codex account and connect to github. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Nitpick comments (10)
src/zenml/integrations/aws/flavors/aws_batch_step_operator_flavor.py (3)
14-14
: Fix stale module docstring (SageMaker → AWS Batch).Apply:
-"""Amazon SageMaker step operator flavor.""" +"""AWS Batch step operator flavor."""
35-37
: Fix class docstring (SageMaker → AWS Batch).-class AWSBatchStepOperatorSettings(BaseSettings): - """Settings for the Sagemaker step operator.""" +class AWSBatchStepOperatorSettings(BaseSettings): + """Settings for the AWS Batch step operator."""
38-42
: Tighten field descriptions and fix grammar.Current strings have broken spacing and minor grammar issues.
- instance_type: Union[str,List[str]] = Field( - default='optimal', - description="The instance type for AWS Batch to use for the step" \ - " execution. Example: 'm5.xlarge'", - ) + instance_type: Union[str, List[str]] = Field( + default="optimal", + description=( + "Instance type(s) for the job. Example: 'm5.xlarge' or " + "['c6i.4xlarge', 'm6i.4xlarge']." + ), + ) @@ - node_count: PositiveInt = Field( - default=1, - description="The number of AWS Batch nodes to run the step on. If > 1," \ - "an AWS Batch multinode job will be run, with the network connectivity" \ - "between the nodes provided by AWS Batch. See https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html" \ - "for details." - ) + node_count: PositiveInt = Field( + default=1, + description=( + "Number of nodes. If > 1, run as a multi‑node parallel job with " + "AWS‑managed networking. See: " + "https://docs.aws.amazon.com/batch/latest/userguide/multi-node-parallel-jobs.html" + ), + ) @@ - execution_role: str = Field( - "", - description="The ECS execution role required to execute the AWS Batch" \ - " jobs as an ECS tasks." - ) + execution_role: str = Field( + "", + description="ECS execution role assumed by tasks that run AWS Batch jobs.", + ) - job_role: str = Field( - "", - description="The ECS job role required by the container runtime inside" \ - "the ECS task." - ) + job_role: str = Field( + "", + description="Task role (job role) assumed by the container at runtime.", + )Also applies to: 48-54, 73-82
tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator_flavor.py (1)
3-8
: Add assertions; current test only instantiates the settings.Validate fields to make the test meaningful.
-def test_aws_batch_step_operator_settings(): - AWSBatchStepOperatorSettings( - instance_type="g4dn.xlarge", - environment={"key_1":"value_1","key_2":"value_2"}, - timeout_seconds=60 - ) +def test_aws_batch_step_operator_settings(): + """Ensure settings initialize and retain provided values.""" + s = AWSBatchStepOperatorSettings( + instance_type="g4dn.xlarge", + environment={"key_1": "value_1", "key_2": "value_2"}, + timeout_seconds=60, + ) + assert s.instance_type == "g4dn.xlarge" + assert s.environment == {"key_1": "value_1", "key_2": "value_2"} + assert s.timeout_seconds == 60 + # verify defaults + assert s.node_count == 1src/zenml/integrations/aws/step_operators/aws_batch_step_operator_entrypoint_config.py (1)
14-14
: Clarify module and class docstrings (explicit “AWS Batch”).-"""Entrypoint configuration for ZenML Batch step operator.""" +"""Entrypoint configuration for the ZenML AWS Batch step operator.""" @@ -class AWSBatchEntrypointConfiguration(StepOperatorEntrypointConfiguration): - """Entrypoint configuration for ZenML Batch step operator.""" +class AWSBatchEntrypointConfiguration(StepOperatorEntrypointConfiguration): + """Entrypoint configuration for the ZenML AWS Batch step operator."""Also applies to: 21-22
src/zenml/integrations/aws/step_operators/__init__.py (1)
14-24
: Clean up docstring, remove unusednoqa
, and sort__all__
.Aligns with Ruff hints (RUF100, RUF022) and clarifies package purpose.
-"""Initialization of the Sagemaker Step Operator.""" +"""AWS step operators package (SageMaker and AWS Batch).""" @@ -from zenml.integrations.aws.step_operators.sagemaker_step_operator import ( # noqa: F401 +from zenml.integrations.aws.step_operators.sagemaker_step_operator import ( SagemakerStepOperator, ) -from zenml.integrations.aws.step_operators.aws_batch_step_operator import ( # noqa: F401 +from zenml.integrations.aws.step_operators.aws_batch_step_operator import ( AWSBatchStepOperator, get_context ) -__all__ = ["SagemakerStepOperator","AWSBatchStepOperator","get_context"] +__all__ = ["AWSBatchStepOperator", "SagemakerStepOperator", "get_context"]tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator.py (1)
24-27
: Pass strings tomonkeypatch.setenv
; env vars must be strings.Prevents potential type issues on some platforms/Python versions.
- monkeypatch.setenv('AWS_BATCH_JOB_MAIN_NODE_INDEX',0) - monkeypatch.setenv('AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS','test-address') - monkeypatch.setenv('AWS_BATCH_JOB_NODE_INDEX',1) - monkeypatch.setenv('AWS_BATCH_JOB_NUM_NODES',2) + monkeypatch.setenv("AWS_BATCH_JOB_MAIN_NODE_INDEX", "0") + monkeypatch.setenv("AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS", "test-address") + monkeypatch.setenv("AWS_BATCH_JOB_NODE_INDEX", "1") + monkeypatch.setenv("AWS_BATCH_JOB_NUM_NODES", "2")src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py (3)
14-14
: Fix the module docstring to reference AWS Batch instead of SageMaker.The docstring incorrectly references "Sagemaker Step Operator" instead of AWS Batch Step Operator.
Apply this diff to fix the docstring:
-"""Implementation of the Sagemaker Step Operator.""" +"""Implementation of the AWS Batch Step Operator."""
283-283
: Fix improper string formatting in log message.The log message uses an f-string but doesn't format the log properly through the logger.
Apply this diff to fix the logging:
- logger.info(f"AWS Batch only accepts int type cpu resource requirements. Converted {resource_settings.cpu_count} to {cpu_count_int}") + logger.info("AWS Batch only accepts int type cpu resource requirements. Converted %s to %s", resource_settings.cpu_count, cpu_count_int)
361-361
: Simplify node range generation using string formatting.The targetNodes string generation can be simplified.
Apply this diff to simplify:
- targetNodes=','.join([str(node_index) for node_index in range(node_count)]), + targetNodes=f"0:{node_count-1}",
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
src/zenml/integrations/aws/__init__.py
(3 hunks)src/zenml/integrations/aws/flavors/__init__.py
(2 hunks)src/zenml/integrations/aws/flavors/aws_batch_step_operator_flavor.py
(1 hunks)src/zenml/integrations/aws/step_operators/__init__.py
(1 hunks)src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
(1 hunks)src/zenml/integrations/aws/step_operators/aws_batch_step_operator_entrypoint_config.py
(1 hunks)tests/integration/integrations/aws/step_operators/__init__.py
(1 hunks)tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator.py
(1 hunks)tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator_flavor.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
src/zenml/**/*.py
⚙️ CodeRabbit configuration file
Review the Python code for conformity with Python best practices.
Files:
src/zenml/integrations/aws/step_operators/aws_batch_step_operator_entrypoint_config.py
src/zenml/integrations/aws/flavors/__init__.py
src/zenml/integrations/aws/step_operators/__init__.py
src/zenml/integrations/aws/__init__.py
src/zenml/integrations/aws/flavors/aws_batch_step_operator_flavor.py
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
tests/**/*.py
⚙️ CodeRabbit configuration file
tests/**/*.py
: "Assess the unit test code employing the PyTest testing framework. Confirm that:
- The tests adhere to PyTest's established best practices.
- Test descriptions are sufficiently detailed to clarify the purpose of each test."
Files:
tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator_flavor.py
tests/integration/integrations/aws/step_operators/__init__.py
tests/integration/integrations/aws/step_operators/test_aws_batch_step_operator.py
🪛 Ruff (0.12.2)
src/zenml/integrations/aws/step_operators/__init__.py
16-16: Unused noqa
directive (non-enabled: F401
)
Remove unused noqa
directive
(RUF100)
19-19: Unused noqa
directive (non-enabled: F401
)
Remove unused noqa
directive
(RUF100)
23-23: __all__
is not sorted
Apply an isort-style sorting to __all__
(RUF022)
🔇 Additional comments (2)
tests/integration/integrations/aws/step_operators/__init__.py (1)
1-14
: LGTM: license header only.No actionable issues.
src/zenml/integrations/aws/flavors/__init__.py (1)
32-35
: LGTM: public re-exports for AWS Batch flavor.Consistent with integration wiring.
Also applies to: 46-49
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
…ation is not configured
no, i haven't done any functional testing with an actual aws backend yet. might have some time this coming weekend 🤔 |
…ch container type) step after hacking + hardcoding some remote rds connection credentials into custom parent image. also needed parent image bc the new flavor module is obvs not installed in official distribution which is in the default docker image used by zenml
i managed to set up a test stack in aws and run some e2e tests. the current version works for single node aws batch jobs (i.e. type |
…ec2 instances sizes, and instead support both fargate and ec2 (for gpu) backends on ecs
i've decided to drop support for multinode type aws batch jobs as
|
Hey @SebastianScherer88, thanks a lot for your contribution. I will take over the reviewer role and test this as soon as possible. In the meanwhile, if you change it from a draft PR to "Ready for review", the CI can help us solve the formatting and linting issues as a start. |
…etworking config attibute and was silently ignoring the kwargs passed in the FARGATE path
ready for 👀 😄 @bcdurak |
src/zenml/integrations/aws/step_operators/aws_batch_step_operator_entrypoint_config.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/flavors/aws_batch_step_operator_flavor.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
src/zenml/integrations/aws/step_operators/aws_batch_step_operator.py
Outdated
Show resolved
Hide resolved
… description name@
…ons for invalid characters
…urce mapping method. updated unit test coverage
return mapped_resource_settings | ||
|
||
@staticmethod | ||
def sanitize_name(name: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this works, I think we can improve this a little. See for example here
zenml/src/zenml/integrations/kubernetes/orchestrators/kube_utils.py
Lines 144 to 162 in e025293
def sanitize_label(label: str) -> str: | |
"""Sanitize a label for a Kubernetes resource. | |
Args: | |
label: The label to sanitize. | |
Returns: | |
The sanitized label. | |
""" | |
# https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#rfc-1035-label-names | |
label = re.sub(r"[^a-z0-9-]", "-", label.lower()) | |
label = re.sub(r"^[-]+", "", label) | |
label = re.sub(r"[-]+", "-", label) | |
label = label[:63] | |
# Remove trailing dashes after truncation to make sure we end with an | |
# alphanumeric character | |
label = re.sub(r"[-]+$", "", label) | |
return label |
-
.
I'm thinking something like this
job_name = f"{pipeline_name}-{step_name}"
job_name = sanitize_name(job_name, max_length=115)
job_name += random_str(6)
Also, docstring still missing for this method which will probably cause an issue in the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we have now works fine imo, so i'll leave any additional formatting to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A random string "would also work", and both the random string as well as your solution here won't get my approval as it's not just about "it somehow works".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to disagree. If you wanted a very specific and non negotiable implementation of this, you should have formulated those requirements up front, rather than insisting on arbitrary personal cosmetic preferences after the fact. This is bad review practice imo, since it unnecessarily imposes on the time of unpaid volunteer contributors.
I have already spent a significant deal of my personal free time to provide a non trivial piece of functionality to the zenml framework for free, including provisioning my own cloud infrastructure for e2e testing, and will now leave this PR as is. You are of course free to use the provided work in its current state as you wish - implement the changes you deem a must-have yourself now or later, or reject the work in its current form. 👍
@SebastianScherer88 There are also some formatting issues, running |
Got a link for the failing CI? If formatting is a prerequisite for passing the CI, you guys should look at pre commits |
So are linting, docstrings, unit tests, integration tests and many other things. And we do not run these either as part of a pre-commit hook, as they require extra dependencies and time. It's all written down here: https://github.com/zenml-io/zenml/blob/main/CONTRIBUTING.md Failing CI: |
Overview
As per this original issue in the official zenml github repository
Includes the following components for the proposed AWS Batch step operator:
The proposed implementation hardcodes
ECS
as an AWS orchestration backend - as opposed toEKS
- as its a less complex cloud infrastructure stack to setup thanEKS
.The AWS Bath Step operator supports both EC2 and FARGATE AWS Batch platform capabilities. Both can be used by the same zenml stack - the registration of the AWS Batch step operator component merely requires a default job queue (which has to be either FARGATE or EC2), but an arbitrary job queue can be passed to the step as part of its step settings. This means that an AWS stack with multiple job queues (i.e. one FARGATE type one for smaller workloads, and an EC2 type one to support GPU hardware) can be used without having to register separate zenml stacks.
Limitiations
Note that this feature in its current implementation does not support AWS Batch's
multinode
job type, as it was deemed surplus to requirements. AWS Batch offers a host of large, multi-GPU nodes for distributed and accelerated workloads than can be leverage in a singlecontainer
type job supported by this feature.Tests
To run the new tests, run
Summary by CodeRabbit
New Features
Tests
Chores