Skip to content

Conversation

strickvl
Copy link
Contributor

@strickvl strickvl commented Oct 3, 2025

#3733 is taking a while to get merged, so this takes the Step Operator improvements and builds on them.

Notably, our Modal integration depedency is now set to modal>=1. (Our Modal Deployer will also have this as its base, so I split these changes out into its own PR in the hope it can be merged sooner).

Otherwise this PR does some cleanup and adds some unit tests (the bulk of the code additions).

(It also addresses some fixes to the CI that are unrelated to this PR).

Inline GPU configuration logic directly into the step operator instead of
using a utility function. The logic is simple enough (10 lines of string
formatting) that a shared utility adds unnecessary indirection.

Changes:
- Inline GPU value construction in modal_step_operator.py
- Remove get_gpu_values utility function from utils.py
- Remove test file that only tested trivial string formatting

This addresses the TODO comment to refactor and remove the get_gpu_values
helper function.
Aligns type hint with build_modal_image signature for consistency.
Reduces verbosity by combining assignment and conditional check.
Aligns with existing pattern used throughout Modal integration.
Improves maintainability by defining the value in a single location.
Previously, setting gpu_count > 0 without specifying a GPU type would
silently run on CPU with no warning. This adds comprehensive validation
that raises clear configuration errors with actionable guidance.

Changes:
- Add _compute_modal_gpu_arg() helper to validate and format GPU settings
- Raise StackComponentInterfaceError when gpu_count > 0 but gpu is None
- Warn and default to 1 GPU when gpu is set but gpu_count is 0
- Validate gpu_count is non-negative integer
- Normalize whitespace in GPU type strings
- Add missing List import to utils.py
Previously, setup_modal_client always used the component-level
modal_environment config, preventing per-step overrides via settings.
Now prioritizes settings.modal_environment when provided, falling back
to component config for backward compatibility.
Improves type safety by explicitly typing the resource_settings parameter
as ResourceSettings in _compute_modal_gpu_arg method.
Adds Pydantic field constraints to enforce the timeout range (1-86400 seconds)
that was previously only documented. Uses ge=1 and le=DEFAULT_TIMEOUT_SECONDS
to validate input at instantiation time.
Changes command execution from shell-based to direct varargs invocation,
eliminating shell metacharacter issues and removing the implicit bash
dependency. Adds proper argument quoting via shlex.join() as a fallback
for backward compatibility with older Modal client versions that don't
support varargs command passing.
Improves error handling during the creation of a Modal sandbox by adding a fallback mechanism for command execution. If the direct varargs invocation fails due to a TypeError, the code now falls back to a shell-quoted invocation using shlex.join() to ensure compatibility with older Modal client versions. This change enhances robustness and maintains backward compatibility while preserving argument safety.
@strickvl strickvl added enhancement New feature or request internal To filter out internal PRs and issues dependencies Pull requests that update a dependency file labels Oct 3, 2025
@strickvl strickvl requested a review from Copilot October 3, 2025 16:48
Copy link
Contributor

github-actions bot commented Oct 3, 2025

Documentation Link Check Results

Absolute links check failed
There are broken absolute links in the documentation. See workflow logs for details
Relative links check passed
Last checked: 2025-10-10 07:40:29 UTC

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the Modal step operator integration to use Modal>=1 and introduces significant improvements including proper utility functions, enhanced GPU argument handling, and comprehensive unit tests.

Key changes:

  • Updates Modal dependency from modal>=0.64.49,<1 to modal>=1
  • Refactors shared functionality into a new utils.py module with authentication setup, image building, and stack validation
  • Replaces the simplistic get_gpu_values function with robust _compute_modal_gpu_arg method that validates GPU configurations and prevents silent CPU fallbacks

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/zenml/integrations/modal/__init__.py Updates Modal dependency requirement to version >=1
src/zenml/integrations/modal/utils.py New utility module with authentication setup, image building, and stack validation functions
src/zenml/integrations/modal/step_operators/modal_step_operator.py Major refactor using new utilities, improved GPU handling, and better error reporting
src/zenml/integrations/modal/flavors/modal_step_operator_flavor.py Enhanced configuration with detailed field descriptions, validation, and new authentication fields
docs/book/component-guide/step-operators/modal.md Updated documentation with clearer examples and GPU configuration guidance
tests/integration/integrations/modal/step_operators/ New comprehensive test suite covering utilities, GPU configurations, and flavor validation
Comments suppressed due to low confidence (2)

docs/book/component-guide/step-operators/modal.md:1

  • The scarf tracking image has been removed from the documentation. Verify this removal is intentional and aligns with documentation standards.
---

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@strickvl strickvl requested a review from schustmi October 3, 2025 16:50
@strickvl strickvl removed the request for review from schustmi October 6, 2025 07:55
strickvl and others added 4 commits October 6, 2025 10:07
Add missing parameter documentation to helper functions and exception
documentation to the launch method to satisfy darglint requirements.
@strickvl strickvl requested a review from schustmi October 6, 2025 11:41
Rename zenml_image to modal_image to accurately reflect that the function
returns a Modal-specific image object, not a ZenML construct.

Update error message to describe actual failure scenarios: Modal's inability
to access the container registry or extend the Docker image, rather than
suggesting the image might not exist (which is incorrect since ZenML just
built and pushed it successfully).
Move the memory_int calculation from inside the async with app.run()
context to outside, grouping it with other resource preparations like
gpu_values. This is a pure data transformation that doesn't require
Modal's runtime context, improving code organization and reducing
unnecessary nesting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file enhancement New feature or request internal To filter out internal PRs and issues run-slow-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants