[Modal] Add Modal (modal.com) as a cloud provider#9952
Conversation
Adds Modal as a first-class SkyPilot cloud so tasks, jobs, and services can run on Modal's serverless GPU/CPU infrastructure through the standard `sky` interface, modeled after the existing RunPod integration. - Cloud class + LazyImport adaptor (sky/clouds/modal.py, sky/adaptors/modal.py), registered in sky/clouds/__init__.py and sky/skylet/constants.py; token-pair auth (~/.modal.toml / MODAL_TOKEN_ID+MODAL_TOKEN_SECRET) surfaced via `sky check modal`; unsupported features (STOP, MULTI_NODE, AUTOSTOP, ...) rejected with clear NotSupportedError messages. - Static catalog seeded at import (sky/catalog/modal_catalog.py) covering T4, L4, A10, A100-40GB/80GB, L40S, H100, H200, B200 with hourly pricing; optimizer enumerates/prices/ranks Modal offline (no network on the optimizer path). - Provisioner (sky/provision/modal/) that launches a modal.Sandbox with sshd over an unencrypted tunnel, exposes the tunnel host:port to SkyPilot's SSH bootstrap, and tears the Sandbox down on `sky down`. 24h Sandbox-lifetime warning emitted at launch. - Backend wiring: modal-ray.yml.j2 template + _get_cluster_config_template entry + configure_ssh_info dispatch. Tested: - 49 unit tests (cloud class, catalog schema, credential check, provisioner mocks, 13 per-feature NotSupportedError assertions): all pass. - Credential-gated smoke test (tests/smoke_tests/test_cluster_job.py:: test_modal_launch); collected and skipped cleanly without credentials. - `bash format.sh` clean; `import sky` shows zero Modal SDK import cost. - Live end-to-end on real Modal infra (T4): `sky launch` provisions the Sandbox, installs Ray+Skylet over the tunnel, runs the task to completion (nvidia-smi reports "Tesla T4, 15360 MiB"); two `sky exec` calls succeed with live tunnel re-query; `sky down` terminates with zero lingering sandboxes.
[Modal] Add Modal (modal.com) as a cloud provider
There was a problem hiding this comment.
Code Review
This pull request introduces support for Modal as a new cloud provider in SkyPilot, implementing the Modal cloud adaptor, local catalog seeding, and sandbox-based instance provisioning. The feedback focuses on optimizing performance by skipping slow network tunnel queries when they are not needed (such as during status checks, launch, and termination), improving robustness by handling OSError during credential checks and using atomic file writes for catalog seeding to avoid race conditions, ensuring bulk termination does not abort early on a single failure, and fixing a template bug where the SSH cluster key mount was incorrectly placed inside a loop.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
CI fixes: - mypy: widen make_deploy_resources_variables return type to Dict[str, Any] (cpu/memory are Optional[float|int], not Union[str, bool, None]). - pylint (pylint_quotes C4001): store the bundled catalog as a single triple-quoted string literal instead of ~450 double-quoted fragments. - optimizer dryruns (KeyError 'AvailabilityZone'): Modal has no zones and its catalog omits the AvailabilityZone column; validate_region_zone now rejects a passed zone with a clean ValueError so cloud inference skips Modal instead of crashing in _filter_region_zone. - unit (test_24h_warning_is_emitted): capture the warning even when the module logger has propagate=False by attaching caplog's handler to sky.clouds.modal directly. Review feedback: - terminate_instances now attempts every Sandbox and aggregates failures before raising, so one failure cannot orphan the remaining billable Sandboxes. - _seed_catalog_if_missing writes to a temp file and atomically os.replace()s it to avoid races between concurrent imports; an existing catalog is never overwritten. - query_instances (status only) skips the slow per-Sandbox SSH-tunnel lookup via a new query_tunnels flag on list_instances; get_cluster_info still re-queries the live tunnel each call (D-03).
…rnings - Wrap three lines exceeding the 80-column limit (C0301). - Chain the SSH-tunnel timeout RuntimeError with `from exc` (W0707). - `del` the interface-required-but-unused arguments in launch(), query_instances(), and query_ports() (W0613).
…ery, ssh-key mount - _check_modal_credentials: catch OSError (permission denied / file removed between the existence check and open) and fail gracefully instead of crashing. - get_cluster_info: pass query_tunnels=True explicitly, since the live SSH tunnel endpoint is required to build the cluster's InstanceInfo. - modal-ray.yml.j2: move the ~/.ssh/sky-cluster-key file mount out of the credentials loop so it is mounted exactly once (was duplicated per credential and omitted entirely when there were no credentials).
Adds Modal (modal.com) as a first-class SkyPilot cloud provider, so tasks, jobs, and services can run on Modal's serverless GPU/CPU infrastructure through the standard
skyinterface. Modeled after the existing RunPod integration.sky/clouds/modal.py+sky/adaptors/modal.py(LazyImport), registered insky/clouds/__init__.pyandsky/skylet/constants.py. Token-pair auth (~/.modal.tomlorMODAL_TOKEN_ID/MODAL_TOKEN_SECRET) viasky check modal. Unsupported features (STOP, MULTI_NODE, AUTOSTOP, …) rejected with clearNotSupportedErrormessages.sky/catalog/modal_catalog.pyseeds a static catalog at import (T4, L4, A10, A100-40GB/80GB, L40S, H100, H200, B200) with hourly pricing; the optimizer enumerates/prices/ranks Modal with no network call on the optimizer path.sky/provision/modal/launches amodal.Sandboxwithsshdover an unencrypted tunnel, exposes the tunnel host:port to SkyPilot's SSH bootstrap, re-queries the tunnel live each call, and terminates the Sandbox onsky down. 24h Sandbox-lifetime warning emitted at launch.sky/templates/modal-ray.yml.j2, a_get_cluster_config_templateentry, and theconfigure_ssh_infoauth dispatch. Additive only — no change to existing clouds or the client/server API.Tested (run the relevant ones):
bash format.sh— clean (YAPF/isort/mypy/pylint);import skyshows zero Modal SDK import cost./smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)Test details
Unit (49 tests, all pass):
Covers cloud class, catalog schema/pricing (offline), credential check, provisioner (mocked SDK), and 13 per-feature
NotSupportedErrorassertions.Smoke: credential-gated
tests/smoke_tests/test_cluster_job.py::test_modal_launch— collected and skipped cleanly without credentials.Live end-to-end on real Modal infra (T4):
sky launch -c modal-try --gpus T4provisioned the Sandbox, installed Ray+Skylet over the tunnel, and ran the task to completion —nvidia-smireportedTesla T4, 15360 MiB.sky exec modal-try -- nvidia-smi -Lsucceeded twice (live tunnel re-query, not stale cache).sky down modal-tryterminated cleanly with zero lingering sandboxes (modal.Sandbox.list(tags={'skypilot-cluster':'modal-try'})== 0).Backward compatibility is not separately run — the change is additive (a new cloud + new files) and does not modify the client/server protocol or any existing provider.