Skip to content

[Modal] Add Modal (modal.com) as a cloud provider#9952

Open
jqueguiner wants to merge 5 commits into
skypilot-org:masterfrom
jqueguiner:master
Open

[Modal] Add Modal (modal.com) as a cloud provider#9952
jqueguiner wants to merge 5 commits into
skypilot-org:masterfrom
jqueguiner:master

Conversation

@jqueguiner

@jqueguiner jqueguiner commented Jun 25, 2026

Copy link
Copy Markdown

Adds Modal (modal.com) as a first-class SkyPilot cloud provider, so tasks, jobs, and services can run on Modal's serverless GPU/CPU infrastructure through the standard sky interface. Modeled after the existing RunPod integration.

  • Cloud + authsky/clouds/modal.py + sky/adaptors/modal.py (LazyImport), registered in sky/clouds/__init__.py and sky/skylet/constants.py. Token-pair auth (~/.modal.toml or MODAL_TOKEN_ID/MODAL_TOKEN_SECRET) via sky check modal. Unsupported features (STOP, MULTI_NODE, AUTOSTOP, …) rejected with clear NotSupportedError messages.
  • Catalogsky/catalog/modal_catalog.py seeds a static catalog at import (T4, L4, A10, A100-40GB/80GB, L40S, H100, H200, B200) with hourly pricing; the optimizer enumerates/prices/ranks Modal with no network call on the optimizer path.
  • Provisionersky/provision/modal/ launches a modal.Sandbox with sshd over an unencrypted tunnel, exposes the tunnel host:port to SkyPilot's SSH bootstrap, re-queries the tunnel live each call, and terminates the Sandbox on sky down. 24h Sandbox-lifetime warning emitted at launch.
  • Backend wiringsky/templates/modal-ray.yml.j2, a _get_cluster_config_template entry, and the configure_ssh_info auth dispatch. Additive only — no change to existing clouds or the client/server API.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh — clean (YAPF/isort/mypy/pylint); import sky shows zero Modal SDK import cost.
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Test details

Unit (49 tests, all pass):

pytest tests/unit_tests/test_modal.py tests/unit_tests/test_modal_warning.py \
       tests/unit_tests/test_modal_catalog.py tests/unit_tests/test_modal_provision.py

Covers cloud class, catalog schema/pricing (offline), credential check, provisioner (mocked SDK), and 13 per-feature NotSupportedError assertions.

Smoke: credential-gated tests/smoke_tests/test_cluster_job.py::test_modal_launch — collected and skipped cleanly without credentials.

Live end-to-end on real Modal infra (T4):

  • sky launch -c modal-try --gpus T4 provisioned the Sandbox, installed Ray+Skylet over the tunnel, and ran the task to completion — nvidia-smi reported Tesla T4, 15360 MiB.
  • sky exec modal-try -- nvidia-smi -L succeeded twice (live tunnel re-query, not stale cache).
  • sky down modal-try terminated cleanly with zero lingering sandboxes (modal.Sandbox.list(tags={'skypilot-cluster':'modal-try'}) == 0).

Backward compatibility is not separately run — the change is additive (a new cloud + new files) and does not modify the client/server protocol or any existing provider.

jqueguiner and others added 2 commits June 25, 2026 19:11
Adds Modal as a first-class SkyPilot cloud so tasks, jobs, and services can
run on Modal's serverless GPU/CPU infrastructure through the standard `sky`
interface, modeled after the existing RunPod integration.

- Cloud class + LazyImport adaptor (sky/clouds/modal.py, sky/adaptors/modal.py),
  registered in sky/clouds/__init__.py and sky/skylet/constants.py; token-pair
  auth (~/.modal.toml / MODAL_TOKEN_ID+MODAL_TOKEN_SECRET) surfaced via
  `sky check modal`; unsupported features (STOP, MULTI_NODE, AUTOSTOP, ...)
  rejected with clear NotSupportedError messages.
- Static catalog seeded at import (sky/catalog/modal_catalog.py) covering
  T4, L4, A10, A100-40GB/80GB, L40S, H100, H200, B200 with hourly pricing;
  optimizer enumerates/prices/ranks Modal offline (no network on the
  optimizer path).
- Provisioner (sky/provision/modal/) that launches a modal.Sandbox with sshd
  over an unencrypted tunnel, exposes the tunnel host:port to SkyPilot's SSH
  bootstrap, and tears the Sandbox down on `sky down`. 24h Sandbox-lifetime
  warning emitted at launch.
- Backend wiring: modal-ray.yml.j2 template + _get_cluster_config_template
  entry + configure_ssh_info dispatch.

Tested:
- 49 unit tests (cloud class, catalog schema, credential check, provisioner
  mocks, 13 per-feature NotSupportedError assertions): all pass.
- Credential-gated smoke test (tests/smoke_tests/test_cluster_job.py::
  test_modal_launch); collected and skipped cleanly without credentials.
- `bash format.sh` clean; `import sky` shows zero Modal SDK import cost.
- Live end-to-end on real Modal infra (T4): `sky launch` provisions the
  Sandbox, installs Ray+Skylet over the tunnel, runs the task to completion
  (nvidia-smi reports "Tesla T4, 15360 MiB"); two `sky exec` calls succeed
  with live tunnel re-query; `sky down` terminates with zero lingering
  sandboxes.
[Modal] Add Modal (modal.com) as a cloud provider

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Modal as a new cloud provider in SkyPilot, implementing the Modal cloud adaptor, local catalog seeding, and sandbox-based instance provisioning. The feedback focuses on optimizing performance by skipping slow network tunnel queries when they are not needed (such as during status checks, launch, and termination), improving robustness by handling OSError during credential checks and using atomic file writes for catalog seeding to avoid race conditions, ensuring bulk termination does not abort early on a single failure, and fixing a template bug where the SSH cluster key mount was incorrectly placed inside a loop.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread sky/provision/modal/instance.py Outdated
Comment thread sky/catalog/modal_catalog.py Outdated
Comment thread sky/clouds/modal.py
Comment thread sky/provision/modal/instance.py Outdated
Comment thread sky/provision/modal/instance.py
Comment thread sky/provision/modal/instance.py Outdated
Comment thread sky/provision/modal/instance.py Outdated
Comment thread sky/provision/modal/utils.py Outdated
Comment thread sky/templates/modal-ray.yml.j2
CI fixes:
- mypy: widen make_deploy_resources_variables return type to Dict[str, Any]
  (cpu/memory are Optional[float|int], not Union[str, bool, None]).
- pylint (pylint_quotes C4001): store the bundled catalog as a single
  triple-quoted string literal instead of ~450 double-quoted fragments.
- optimizer dryruns (KeyError 'AvailabilityZone'): Modal has no zones and its
  catalog omits the AvailabilityZone column; validate_region_zone now rejects a
  passed zone with a clean ValueError so cloud inference skips Modal instead of
  crashing in _filter_region_zone.
- unit (test_24h_warning_is_emitted): capture the warning even when the module
  logger has propagate=False by attaching caplog's handler to sky.clouds.modal
  directly.

Review feedback:
- terminate_instances now attempts every Sandbox and aggregates failures before
  raising, so one failure cannot orphan the remaining billable Sandboxes.
- _seed_catalog_if_missing writes to a temp file and atomically os.replace()s it
  to avoid races between concurrent imports; an existing catalog is never
  overwritten.
- query_instances (status only) skips the slow per-Sandbox SSH-tunnel lookup via
  a new query_tunnels flag on list_instances; get_cluster_info still re-queries
  the live tunnel each call (D-03).
@jqueguiner jqueguiner marked this pull request as ready for review June 25, 2026 22:22
…rnings

- Wrap three lines exceeding the 80-column limit (C0301).
- Chain the SSH-tunnel timeout RuntimeError with `from exc` (W0707).
- `del` the interface-required-but-unused arguments in launch(),
  query_instances(), and query_ports() (W0613).
…ery, ssh-key mount

- _check_modal_credentials: catch OSError (permission denied / file removed
  between the existence check and open) and fail gracefully instead of crashing.
- get_cluster_info: pass query_tunnels=True explicitly, since the live SSH
  tunnel endpoint is required to build the cluster's InstanceInfo.
- modal-ray.yml.j2: move the ~/.ssh/sky-cluster-key file mount out of the
  credentials loop so it is mounted exactly once (was duplicated per credential
  and omitted entirely when there were no credentials).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant