Skip to content
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
a599f69
Add Modal orchestrator with step operator and orchestrator flavors
htahir1 Jun 12, 2025
b122c63
Add Modal Orchestrator Integration
htahir1 Jun 12, 2025
cd6b59f
Add pipeline-wide resource settings for hardware resources
htahir1 Jun 12, 2025
58d72c5
Add Modal step operator orchestrator to run pipelines on Modal
htahir1 Jun 12, 2025
9a0f2a5
Add log streaming for async execution in Modal Orchestrator
htahir1 Jun 12, 2025
7cf8fd5
Add app warming window hours for container reuse
htahir1 Jun 12, 2025
60d227a
Remove unnecessary exception handling in ModalOrchestrator and ModalS…
htahir1 Jun 12, 2025
45ed008
Refactor return statement to assign and return validator
htahir1 Jun 12, 2025
0b6b9f9
Add Modal integration utils module for mypy discoverability
htahir1 Jun 12, 2025
d8318b1
Update Modal integration to require Modal version 1
htahir1 Jun 12, 2025
b7400b6
Introduce ModalExecutionMode enum for execution modes
htahir1 Jun 12, 2025
ee15117
Refactor Modal authentication setup and deployment
htahir1 Jun 12, 2025
85aeef4
Update platform references in log messages
htahir1 Jun 12, 2025
01f7ef2
Include pipeline run ID for isolation and conflict prevention
htahir1 Jun 13, 2025
6deba69
Merge remote-tracking branch 'origin/develop' into feature/modal-orch…
htahir1 Jun 13, 2025
481972b
Refactor Modal log streaming and improve resource selection
htahir1 Jun 13, 2025
12ac7c4
Clean up Modal orchestrator imports and code organization
htahir1 Jun 13, 2025
788ae84
Improve Modal integration import consistency and requirements
htahir1 Jun 13, 2025
8cdb3c2
Comprehensive Modal orchestrator documentation improvements
htahir1 Jun 13, 2025
f2218b1
Refactor error message formatting and GPU count handling
htahir1 Jun 13, 2025
e9e2574
Add base image requirements and GPU configuration details
htahir1 Jun 13, 2025
74bf8ca
Merge branch 'develop' into feature/modal-orchestrator
htahir1 Jun 16, 2025
331a164
Update src/zenml/integrations/modal/utils.py
safoinme Jun 17, 2025
2770359
Merge remote-tracking branch 'origin/develop' into feature/modal-orch…
htahir1 Jun 18, 2025
f3c0159
Merge remote-tracking branch 'origin/develop' into feature/modal-orch…
htahir1 Jun 23, 2025
3173f6a
Apply suggestions from code review
htahir1 Jun 23, 2025
37d4a57
Update environment variable names to be consistent.- Rename `environm…
htahir1 Jun 23, 2025
e8b6b42
Refactor ModalOrchestrator app naming for image builds
htahir1 Jun 23, 2025
32a56df
Use orchestrator run ID for complete isolation during testing
htahir1 Jun 23, 2025
7616eef
Refactor modal orchestrator for new app-function architecture
htahir1 Jun 23, 2025
3e7fbc5
Refactor Modal orchestrator for better efficiency
htahir1 Jun 23, 2025
5f22bb0
Update environment variable naming convention in Modal step operator
htahir1 Jun 24, 2025
376d7e4
Update Modal orchestrator to use sandbox architecture
htahir1 Jun 24, 2025
04d0204
Update Modal orchestrator for per-step sandboxes
htahir1 Jun 24, 2025
34f717a
Add generate_sandbox_tags function and set tags in sandboxes
htahir1 Jun 24, 2025
bc1f488
Refactor Modal orchestrator and image building functions
htahir1 Jun 24, 2025
32b5571
Added Modal step operator setup guide and examples
htahir1 Jun 24, 2025
31b2c13
Refactor configuration options inheritance logic
htahir1 Jun 24, 2025
4a90c3d
Update modal environment flag format in step operators
htahir1 Jun 24, 2025
312d490
Update modal orchestrator and step operator flavors
htahir1 Jun 24, 2025
1392ec0
Merge branch 'develop' into feature/modal-orchestrator
htahir1 Jun 24, 2025
f9c0818
Merge remote-tracking branch 'origin/develop' into feature/modal-orch…
htahir1 Jun 24, 2025
11d63b3
Update entrypoint configuration description
htahir1 Jun 24, 2025
75e0615
Update configuration section heading to "Configuration Examples".
htahir1 Jun 25, 2025
4fa278f
Create deep copy of settings for Modal orchestrator
htahir1 Jun 25, 2025
79a449a
Add ModalOrchestrator for running pipelines on Modal platform
htahir1 Jun 26, 2025
a499755
Merge remote-tracking branch 'origin/develop' into feature/modal-orch…
htahir1 Jun 26, 2025
2ae1e8b
Merge branch 'develop' into feature/modal-orchestrator
htahir1 Jun 26, 2025
72a9f72
Merge branch 'develop' into feature/modal-orchestrator
htahir1 Jun 27, 2025
9dc7bfd
Merge remote-tracking branch 'origin/develop' into feature/modal-orch…
htahir1 Jul 11, 2025
5be346d
Refactor building Modal image with deployment caching
htahir1 Jul 11, 2025
c12a37f
Refactor environment variable secrets handling
htahir1 Jul 11, 2025
9a6daaf
Add environment variables as command prefix in Modal orchestrator
htahir1 Jul 11, 2025
0b5311a
Implement Modal sandbox executor for ZenML orchestration
htahir1 Jul 11, 2025
5f4c207
Refactor execution methods for better clarity
htahir1 Jul 11, 2025
623c5fb
Refactor entrypoint command creation for sandbox executor
htahir1 Jul 11, 2025
27af8a3
Remove redundant logging message in orchestrator class
htahir1 Jul 11, 2025
0b37850
Refactor ModalSandboxExecutor for step-specific settings
htahir1 Jul 11, 2025
b5935f6
Refactor image caching logic for pipeline builds
htahir1 Jul 11, 2025
4be1fc7
Add shared image cache for pipeline step execution
htahir1 Jul 12, 2025
be58a00
Ensure deployment build exists before image cache operations
htahir1 Jul 12, 2025
1b2cfbe
Refactor resource configuration methods for robustness
htahir1 Jul 12, 2025
c739f88
Refactor resource settings conversion and GPU type handling
htahir1 Jul 12, 2025
5d7a44d
Refactor resource settings extraction for clarity
htahir1 Jul 12, 2025
7685c46
Refactor GPU configuration handling, handle missing type
htahir1 Jul 12, 2025
d9fbeb2
Use sane defaults for pipeline resource settings
htahir1 Jul 12, 2025
4fb95b2
Refactor function docstrings for better clarity
htahir1 Jul 12, 2025
c410179
Refactor Modal image building logic for clarity
htahir1 Jul 13, 2025
6387d42
Update README.md links for LLM-Complete Guide project
htahir1 Jul 13, 2025
083320f
Add Modal orchestrator integration tests
htahir1 Jul 13, 2025
15e27a8
Update logging messages to be more descriptive
htahir1 Jul 13, 2025
4f4ab4a
Merge branch 'develop' into feature/modal-orchestrator
htahir1 Jul 14, 2025
5f4c69a
Update src/zenml/integrations/modal/step_operators/modal_step_operato…
htahir1 Jul 14, 2025
bbbe9df
Update src/zenml/integrations/modal/orchestrators/modal_orchestrator.py
htahir1 Jul 14, 2025
0208989
Update execution mode setting to "mode" in Modal orchestrator
htahir1 Jul 14, 2025
83d1bbc
Refactor Modal orchestrator code for better readability
htahir1 Jul 14, 2025
b7d8cc2
Merge branch 'feature/modal-orchestrator' of github.com:zenml-io/zenm…
htahir1 Jul 14, 2025
bb9790b
Add Docker configuration validation for Modal orchestrator
htahir1 Jul 14, 2025
5b09c58
Refactor variable names for execution mode in ModalOrchestrator
htahir1 Jul 14, 2025
2e618d7
Refactor log statements for better readability
htahir1 Jul 15, 2025
44121d1
Update ModalOrchestrator to submit pipelines to Modal
htahir1 Jul 15, 2025
099264e
Refactor method signature in ModalOrchestrator class
htahir1 Jul 15, 2025
8d6bfec
Add custom app name option for Modal orchestrator
htahir1 Jul 15, 2025
8ec1c08
Add pydantic Field descriptions to ModalStepOperatorSettings
htahir1 Jul 15, 2025
d272580
Start some cleanup
schustmi Jul 29, 2025
0160c16
Merge branch 'develop' into feature/modal-orchestrator
schustmi Jul 29, 2025
1490d39
Next round of cleanup
schustmi Jul 29, 2025
6f23076
More cleanup
schustmi Jul 29, 2025
0533723
Use correct key
schustmi Jul 29, 2025
a638328
Add todo
schustmi Jul 29, 2025
092ed52
Merge remote-tracking branch 'origin/develop' into feature/modal-orch…
htahir1 Aug 7, 2025
f15ce9c
Merge branch 'develop' into feature/modal-orchestrator
htahir1 Sep 12, 2025
9053fc1
Merge branch 'develop' into feature/modal-orchestrator
htahir1 Oct 2, 2025
e131b22
Merge branch 'feature/modal-orchestrator' of github.com:zenml-io/zenm…
htahir1 Oct 2, 2025
322f1e2
Merge remote-tracking branch 'origin/develop' into feature/modal-orch…
htahir1 Oct 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
375 changes: 375 additions & 0 deletions docs/book/component-guide/orchestrators/modal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,375 @@
---
description: Orchestrating your pipelines to run on Modal's serverless cloud platform.
---

# Modal Orchestrator

Using the ZenML `modal` integration, you can orchestrate and scale your ML pipelines on [Modal's](https://modal.com/) serverless cloud platform with minimal setup and maximum efficiency.

The Modal orchestrator is designed for speed and cost-effectiveness, running entire pipelines in single serverless functions to minimize cold starts and optimize resource utilization.

Comment on lines 7 to 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some representative screenshot of the Modal UI in here to make the docs a bit friendlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its fine without

{% hint style="warning" %}
This component is only meant to be used within the context of a [remote ZenML deployment scenario](https://docs.zenml.io/getting-started/deploying-zenml/). Usage with a local ZenML deployment may lead to unexpected behavior!
{% endhint %}

## When to use it

You should use the Modal orchestrator if:

* you want a serverless solution that scales to zero when not in use.
* you're looking for fast pipeline execution with minimal cold start overhead.
* you want cost-effective ML pipeline orchestration without managing infrastructure.
* you need easy access to GPUs and high-performance computing resources.
* you prefer a simple setup process without complex Kubernetes configurations.

## How to deploy it

The Modal orchestrator runs on Modal's cloud infrastructure, so you don't need to deploy or manage any servers. You just need:

1. A [Modal account](https://modal.com/) (free tier available)
2. Modal CLI installed and authenticated
3. A [remote ZenML deployment](https://docs.zenml.io/getting-started/deploying-zenml/) for production use

## How to use it

To use the Modal orchestrator, we need:

* The ZenML `modal` integration installed. If you haven't done so, run:
```shell
zenml integration install modal
```
* [Docker](https://www.docker.com) installed and running.
* A [remote artifact store](../artifact-stores/README.md) as part of your stack.
* A [remote container registry](../container-registries/README.md) as part of your stack.
* Modal CLI installed and authenticated:
```shell
pip install modal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need the pip install if you have the integration already installed?

modal setup
```

### Setting up the orchestrator

You can register the orchestrator with or without explicit Modal credentials:

**Option 1: Using Modal CLI authentication (recommended for development)**

```shell
# Register the orchestrator (uses Modal CLI credentials)
zenml orchestrator register <ORCHESTRATOR_NAME> \
--flavor=modal \
--synchronous=true

# Register and activate a stack with the new orchestrator
zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set
```

**Option 2: Using Modal API token (recommended for production)**

```shell
# Register the orchestrator with explicit credentials
zenml orchestrator register <ORCHESTRATOR_NAME> \
--flavor=modal \
--token=<MODAL_TOKEN> \
--workspace=<MODAL_WORKSPACE> \
--synchronous=true
Comment on lines 100 to 106
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should use --token-id and --token-secret separately as per the code?


# Register and activate a stack with the new orchestrator
zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set
```

You can get your Modal token from the [Modal dashboard](https://modal.com/settings/tokens).

{% hint style="info" %}
ZenML will build a Docker image called `<CONTAINER_REGISTRY_URI>/zenml:<PIPELINE_NAME>` which includes your code and use it to run your pipeline steps in Modal functions. Check out [this page](https://docs.zenml.io/how-to/customize-docker-builds/) if you want to learn more about how ZenML builds these images and how you can customize them.
{% endhint %}

You can now run any ZenML pipeline using the Modal orchestrator:

```shell
python file_that_runs_a_zenml_pipeline.py
```

### Modal UI

Modal provides an excellent web interface where you can monitor your pipeline runs in real-time, view logs, and track resource usage.

You can access the Modal dashboard at [modal.com/apps](https://modal.com/apps) to see your running and completed functions.

### Configuration overview

The Modal orchestrator uses two types of settings following ZenML's standard pattern:

1. **`ResourceSettings`** (standard ZenML) - for hardware resource quantities:
- `cpu_count` - Number of CPU cores
- `memory` - Memory allocation (e.g., "16GB")
- `gpu_count` - Number of GPUs to allocate

2. **`ModalOrchestratorSettings`** (Modal-specific) - for Modal platform configuration:
- `gpu` - GPU type specification (e.g., "T4", "A100", "H100")
- `region` - Cloud region preference
- `cloud` - Cloud provider selection
- `execution_mode` - How to run the pipeline
- `timeout`, `min_containers`, `max_containers` - Performance settings

{% hint style="info" %}
**GPU Configuration**: Use `ResourceSettings.gpu_count` to specify how many GPUs you need, and `ModalOrchestratorSettings.gpu` to specify what type of GPU. Modal will combine these automatically (e.g., `gpu_count=2` + `gpu="A100"` becomes `"A100:2"`).
{% endhint %}

### Additional configuration

Here's how to configure both types of settings:

```python
from zenml.integrations.modal.flavors.modal_orchestrator_flavor import (
ModalOrchestratorSettings
)
from zenml.config import ResourceSettings

# Configure Modal-specific settings
modal_settings = ModalOrchestratorSettings(
gpu="A100", # GPU type (optional)
region="us-east-1", # Preferred region
cloud="aws", # Cloud provider
execution_mode="pipeline", # or "per_step"
timeout=3600, # 1 hour timeout
min_containers=1, # Keep warm containers
max_containers=10, # Scale up to 10 containers
)

# Configure hardware resources (quantities)
resource_settings = ResourceSettings(
cpu_count=16, # Number of CPU cores
memory="32GB", # 32GB RAM
gpu_count=1 # Number of GPUs (combined with gpu type below)
)

@pipeline(
settings={
"orchestrator": modal_settings,
"resources": resource_settings
}
)
def my_modal_pipeline():
# Your pipeline steps here
...
```

### Resource configuration

{% hint style="info" %}
**Pipeline-Level Resources**: The Modal orchestrator uses pipeline-level resource settings to configure the Modal function for the entire pipeline. All steps share the same Modal function resources. Configure resources at the `@pipeline` level for best results.
{% endhint %}

You can configure pipeline-wide resource requirements using `ResourceSettings` for hardware resources and `ModalOrchestratorSettings` for Modal-specific configurations:

```python
from zenml.config import ResourceSettings
from zenml.integrations.modal.flavors.modal_orchestrator_flavor import (
ModalOrchestratorSettings
)

# Configure resources at the pipeline level (recommended)
@pipeline(
settings={
"resources": ResourceSettings(
cpu_count=16,
memory="32GB",
gpu_count=1 # These resources apply to the entire pipeline
),
"orchestrator": ModalOrchestratorSettings(
gpu="A100", # GPU type for the entire pipeline
region="us-west-2"
)
}
)
def my_pipeline():
first_step() # Runs with pipeline resources: 16 CPU, 32GB RAM, 1x A100
second_step() # Runs with same resources: 16 CPU, 32GB RAM, 1x A100
...

@step
def first_step():
# Uses pipeline-level resource configuration
...

@step
def second_step():
# Uses same pipeline-level resource configuration
...
```

### Execution modes

The Modal orchestrator supports two execution modes:

1. **`pipeline` (default)**: Runs the entire pipeline in a single Modal function for maximum speed and cost efficiency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why this pipeline option is max speed. Isn't it running everything sequentially in the same container? Wouldn't running things in parallel in separate Modal function calls run faster?

2. **`per_step`**: Runs each step in a separate Modal function call for granular control and debugging

{% hint style="info" %}
**Resource Sharing**: Both execution modes use the same Modal function with the same resource configuration (from pipeline-level settings). The difference is whether steps run sequentially in one function call (`pipeline`) or as separate function calls (`per_step`).
{% endhint %}

```python
# Fast execution (default) - entire pipeline in one function
modal_settings = ModalOrchestratorSettings(
execution_mode="pipeline"
)

# Granular execution - each step separate (useful for debugging)
modal_settings = ModalOrchestratorSettings(
execution_mode="per_step"
)
```

### Using GPUs

Modal makes it easy to use GPUs for your ML workloads. Use `ResourceSettings` to specify the number of GPUs and `ModalOrchestratorSettings` to specify the GPU type:

```python
from zenml.config import ResourceSettings
from zenml.integrations.modal.flavors.modal_orchestrator_flavor import (
ModalOrchestratorSettings
)

@step(
settings={
"resources": ResourceSettings(
gpu_count=1 # Number of GPUs to allocate
),
"orchestrator": ModalOrchestratorSettings(
gpu="A100", # GPU type: "T4", "A10G", "A100", "H100"
region="us-east-1"
)
}
)
def train_model():
# Your GPU-accelerated training code
# Modal will provision 1x A100 GPU (gpu_count=1 + gpu="A100")
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
...
```

Available GPU types include:
- `T4` - Cost-effective for inference and light training
- `A10G` - Balanced performance for training and inference
- `A100` - High-performance for large model training
- `H100` - Latest generation for maximum performance

**Examples of GPU configurations (applied to entire pipeline):**

```python
# Pipeline with GPU - configure on first step or pipeline level
@pipeline(
settings={
"resources": ResourceSettings(gpu_count=1),
"orchestrator": ModalOrchestratorSettings(gpu="A100")
}
)
def gpu_pipeline():
# All steps in this pipeline will have access to 1x A100 GPU
step_one()
step_two()

# Multiple GPUs - configure at pipeline level
@pipeline(
settings={
"resources": ResourceSettings(gpu_count=4),
"orchestrator": ModalOrchestratorSettings(gpu="A100")
}
)
def multi_gpu_pipeline():
# All steps in this pipeline will have access to 4x A100 GPUs
training_step()
evaluation_step()
```

### Synchronous vs Asynchronous execution

You can choose whether to wait for pipeline completion or run asynchronously:

```python
# Wait for completion (default)
modal_settings = ModalOrchestratorSettings(
synchronous=True
)

# Fire-and-forget execution
modal_settings = ModalOrchestratorSettings(
synchronous=False
)
```

### Authentication with different environments

For production deployments, you can specify different Modal environments:
Comment on lines 519 to 533
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could have a little info box in this section (or maybe even above, linking down here) to say that you might want to have two different stacks, each associated with a different modal environment, one for prod and the other for development etc etc.


```python
modal_settings = ModalOrchestratorSettings(
environment="production", # or "staging", "dev", etc.
workspace="my-company"
)
```

### Warm containers for faster execution

Modal orchestrator uses persistent apps with warm containers to minimize cold starts:

```python
modal_settings = ModalOrchestratorSettings(
min_containers=2, # Keep 2 containers warm
max_containers=20, # Scale up to 20 containers
)

@pipeline(
settings={
"orchestrator": modal_settings
}
)
def my_pipeline():
...
```

This ensures your pipelines start executing immediately without waiting for container initialization.

## Best practices

1. **Use pipeline mode for production**: The default `pipeline` execution mode runs your entire pipeline in one function, minimizing overhead and cost.

2. **Separate resource and orchestrator settings**: Use `ResourceSettings` for hardware (CPU, memory, GPU count) and `ModalOrchestratorSettings` for Modal-specific configurations (GPU type, region, etc.).

3. **Configure appropriate timeouts**: Set realistic timeouts for your workloads:
```python
modal_settings = ModalOrchestratorSettings(
timeout=7200 # 2 hours
)
```

4. **Choose the right region**: Select regions close to your data sources to minimize transfer costs and latency.

5. **Use appropriate GPU types**: Match GPU types to your workload requirements - don't use A100s for simple inference tasks.

6. **Monitor resource usage**: Use Modal's dashboard to track your resource consumption and optimize accordingly.

## Troubleshooting

### Common issues

1. **Authentication errors**: Ensure your Modal token is correctly configured and has the necessary permissions.

2. **Image build failures**: Check that your Docker registry credentials are properly configured in your ZenML stack.

3. **Resource limits**: If you hit resource limits, consider breaking large steps into smaller ones or requesting quota increases from Modal.

4. **Network timeouts**: For long-running steps, ensure your timeout settings are appropriate.

### Getting help

- Check the [Modal documentation](https://modal.com/docs) for platform-specific issues
- Monitor your functions in the [Modal dashboard](https://modal.com/apps)
- Use `zenml logs` to view detailed pipeline execution logs

For more information and a full list of configurable attributes of the Modal orchestrator, check out the [SDK Docs](https://sdkdocs.zenml.io/latest/integration_code_docs/integrations-modal.html#zenml.integrations.modal.orchestrators).

<figure><img src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" alt="ZenML Scarf"><figcaption></figcaption></figure>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<figure><img src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" alt="ZenML Scarf"><figcaption></figcaption></figure>

1 change: 1 addition & 0 deletions docs/book/component-guide/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
* [Skypilot VM Orchestrator](orchestrators/skypilot-vm.md)
* [HyperAI Orchestrator](orchestrators/hyperai.md)
* [Lightning AI Orchestrator](orchestrators/lightning.md)
* [Modal Orchestrator](orchestrators/modal.md)
* [Develop a custom orchestrator](orchestrators/custom.md)
* [Artifact Stores](artifact-stores/README.md)
* [Local Artifact Store](artifact-stores/local.md)
Expand Down
Loading
Loading