SkyPilot v0.11.2
SkyPilot v0.11.2: Slurm Support, JobGroups, Enhanced Pools, External Links, Autostop Hooks, 7x data mount speed up and More
SkyPilot v0.11.2 delivers Slurm support in Beta, JobGroups for heterogeneous parallel workloads, and significantly enhanced Pools with autoscaling, multi-job scheduling and heterogeneous GPU support. This release also brings Autostop Hooks, 7x MOUNT_CACHED mode uploads speed up, automatic EFA on EKS, and numerous admin, security, and performance improvements.
Get it now with:
uv pip install "skypilot>=0.11.2"Or, upgrade your team SkyPilot API server:
NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.11.2
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
--set apiService.image=berkeleyskypilot/skypilot:$VERSION \
--version $VERSION --devel --reuse-valuesBreaking Change: Python 3.9+ required — Python 3.7 and 3.8 are no longer supported (#8489). Please upgrade before installing this release.
Highlights
[Beta] Slurm Support
SkyPilot now supports Slurm as a new infrastructure backend, enabling users to orchestrate workloads on HPC clusters alongside cloud VMs and Kubernetes — all through the same unified interface (docs, #5491, #8138).
This release brings comprehensive Slurm capabilities:
- Multi-node distributed workloads — run distributed training jobs across multiple Slurm nodes with proper environment variable propagation and per-node logging (#8219)
- Containerized execution via NVIDIA pyxis/enroot — specify Docker images with
--image-idfor reproducible, GPU-accelerated workloads (#8604, #8609) - Resource-scoped SSH sessions —
ssh <cluster>drops you inside the Slurm job allocation, sonvidia-smicorrectly reflects only your allocated GPUs (#8268) - Interactive SSH authentication (2FA, password prompts) for clusters requiring keyboard-interactive auth (#8317)
- Partition support — Slurm partitions are mapped to SkyPilot zones, enabling partition-aware scheduling (#8198)
- Multi-cluster support — configure and use multiple Slurm clusters simultaneously
- Dashboard integration — Slurm clusters appear in the infrastructure page with status and GPU utilization
See the Slurm documentation for setup instructions.
JobGroups: Heterogeneous Parallel Workloads
JobGroups enable running multiple jobs with different resource requirements together as a managed group (blog, #8456). Define multi-task pipelines in a single multi-document YAML file:
# Header: job group metadata
name: rl-training
execution: parallel
primary_tasks: [trainer]
termination_delay: 30s
---
name: trainer
resources:
accelerators: A100:8
run: python train.py
---
name: reward-server
resources:
accelerators: A100:1
run: python reward_server.pyKey capabilities:
- Parallel cluster launch and monitoring — all jobs launched and tracked concurrently
- Inter-job networking — easy task name based hostname discovery (.) for job-to-job communication
- Dashboard UI — expandable rows for multi-task groups, task-specific log filtering
- CLI support —
sky jobs logs --task-name <name>for viewing specific task logs
Example applications included: RL post-training (RLHF) pipeline and parallel train-eval pipeline.
Enhanced Pools: Multi-Job Scheduling and Heterogeneous GPUs
SkyPilot Pools receive significant upgrades in this release:
- Multiple jobs per worker — the scheduler now performs resource-aware bin-packing, tracking CPU, memory, and accelerator usage to fit multiple jobs on a single worker (#8192, #8279)
- Autoscaling — pools can now automatically scale workers up and down (including to zero) based on queue length. Specify
min_workers,max_workers, and target queue length; aQueueLengthAutoscalerhandles the rest while protecting running jobs from cancellation (#8483) - Heterogeneous GPU support — specify
any_ofresource configurations and the scheduler dynamically resolves to available hardware (#8315):
resources:
any_of:
- accelerators: T4:1
- accelerators: A100:1- 9x faster concurrent job launch — launching 100 concurrent jobs reduced from 4.5 minutes to 30 seconds using the new
--num-jobsargument (#7891) - Fractional GPU scheduling fix — fractional GPU jobs now correctly schedule across all workers (#8509)
External Links
SkyPilot dashboard now automatically detects your W&B links generated by your AI workloads. No need to dig into the job logs to figure out where your training panels are. (#8405)
Autostop Hooks
An autostop hook mechanism allow running custom scripts before a cluster is automatically stopped — for example, to save checkpoints, sync W&B runs, or send Slack notifications (#8412):
resources:
autostop:
idle_minutes:10
hook:|
wandb sync
curl -X POST $SLACK_WEBHOOK -d '{"text": "Cluster shutting down"}'
hook_timeout:300- New
sky logs --autostopcommand to view hook execution logs sky execis rejected onAUTOSTOPPINGclusters;sky launchwaits for autostop to complete before restarting
Automatic EFA Setup on Amazon EKS
SkyPilot now automatically configures Elastic Fabric Adapter (EFA) on EKS with a single flag (#8557):
resources:
network_tier: bestThis automates what was previously a complex manual setup, delivering ~78.8 GB/s inter-node bandwidth (vs ~4.1 GB/s without EFA), critical for distributed training performance. EFA interfaces are allocated proportionally to the requested GPU count.
7x MOUNT_CACHED Uploads Speed Up
Parallel uploads are now the default for MOUNT_CACHED file mounts, delivering a 7x speedup — flush time dropped from 151s to 21s for a ~14.6 GB test workload (#8455). A new data.mount_cached.sequential_upload config option allows reverting to sequential uploads if needed.
Exit Code-Based Job Recovery
Users can now specify exit codes that trigger automatic job recovery in managed jobs (#8324):
resources:
job_recovery:
recover_on_exit_codes:[29]When a job exits with a specified code, SkyPilot automatically recovers it — useful for transient failures with known error codes.
Windows WSL Support
Automatically detect that SkyPilot is running in WSL, and seamless set up VSCode Remote-SSH for Windows users (#8669)
Admin Deployment Improvement
- External authentication proxy support — deploy behind AWS ALB with Cognito, Azure Front Door, or custom SSO proxies; supports both plaintext and JWT header formats (#8751)
- Sidecar container support — Istio, Datadog, and other sidecar injection no longer breaks K8s provisioning; SkyPilot explicitly targets the
ray-nodecontainer (#8353, #8444)
What’s New
Kubernetes
- Improved Kueue integration — 24-hour provisioning timeout, workspace-level queue configuration, controller pod exclusion, and pod annotations (#8484)
- GPU detection fix — L40S no longer misidentified as L4; added Blackwell, newer Hopper, and Ada Lovelace GPUs (#8593)
kubernetes.set_pod_resource_limits— set pod CPU/memory limits relative to requests for pod resource limit enforcement (#8644)- Volume
NOT_READYstatus — actionable error messages for PVC issues; background refresh daemon;--refreshflag (#8524) - Unified ingress resource — share a single ingress across services (#8532)
- Kubernetes python client race condition fix (#8705)
- Service resource leak fix for pod eviction/Kueue preemption (#8745), GPU misconfiguration hints (#8629), per-task
remote_identityoverride (#8659), not-ready node exclusion (#8172), GKE autoscaler compatibility (#8326)
API Server & Security
- Concurrent request context isolation — fixed workspace isolation violations from context leaking between users (#8354)
- Zip Slip vulnerability fix — patched path traversal in
/uploadendpoint (#8723) - Polling-based
sky api login— works around Chrome Private Network Access restrictions (#8590) - Memory leak fix — AWS session cache leak in status refresh daemon (#8098)
- Kubeconfig no longer uploaded when using
SERVICE_ACCOUNT(#8386) - Docker password redaction in logs (#8080), Bearer token + Basic Auth fix (#8503), disable basic auth option (#8694)
- API server plugin system (#7993, #8272, #8700, #8410)
Deployment
- Persists managed jobs logs with RWX persistent storage with RollingUpdate (#8537)
fullnameOverride(#8528)- CoreWeave, DigitalOcean credentials support (#8200, #7931)
- SSH node pool config (#8249)
- Helm RollingUpdate SQLite fix (#8607)
- Scheduling constraints (#8134)
Managed Jobs & Dashboard
- GPU metrics for managed jobs — utilization, memory, temperature, and power in the dashboard (#8718)
- 6x infra page speedup and server-side pagination (#8523, #8611, #8651)
- Label filtering for clusters and infrastructure (#8507)
- Log download compression ~96% (#8626), GPU temperature panel (#8472), Grafana link (#8599)
- Job cancel reliability fix (#8203), multi-user cluster fix (#8233), SSH key permission fix (#8316)
Cloud Integrations
- AWS multi-VPC failover — specify multiple VPC names with automatic failover across regions (#8722)
- AWS
p5e.48xlargeH200 and Melbourne region support (#8465, #8055) - Choose instance type based on AWS local disk (#8661)
- S3 mounting with non-static credentials — fixed for AWS SSO, IAM roles, Pod Identity/IRSA (#8358)
- Together AI InfiniBand (#8581)
- GCP Queued Resources for TPUs — TPU VM provisioning requests are queued until capacity becomes available (#8481)
- Vast.ai
create_instance_kwargs(#8536), Vast.ai SSH fix (#8614) - Azure blobfuse2 Debian 13 fix (#8730)
- Nebius UFW security hardening (#8627)
Core, Backend & UX
--secret-fileCLI flag — avoid shell history exposure (#8646)provision.install_condaconfig — skip Miniconda for faster launches (#8662)- Pending cluster state (#8262), custom
.skylocation (#8153), restore autostop on start (#8022), multiple skylets per host (#8156) - SSH access during bad Ray state (#8649), volume fail-fast (#8739),
greenletimport fix (#8653),Resources.copy()fix (#8648), pandas compatibility (#8643), Docker MOTD fix (#8632) - SSH Node Pools: deployment refactored (#8154, #8173, #8226), heterogeneous node ban (#8230),
lstrip('ssh-')bug fix (#8417), package distribution fix (#8508)
Documentation & Examples
- Air-gapped environment docs (#7312), NVIDIA Dynamo serving (#7333), EKS IAM roles guide (#8358), Slurm migration guide (#8515)
- Prefect integration (#8506), SAM3 video segmentation (#8384), VeRL search (#8241), Marimo notebooks (#8123)
Contributors
Thank you to all contributors who made this release possible!
@aflah02, @alex000kim, @andylizf, @aylei, @Bokki-Ryu, @brianstrauch, @cblmemo, @cg505, @concretevitamin, @cwhitak3r, @dan-blanchard, @DanielZhangQD, @Elden123, @funkypenguin, @hentt30, @isagi-y22, @JiangJiaWei1103, @kevinmingtarja, @koaning, @kyuds, @laimis9133, @liuwb, @lloyd-brown, @lucamanolache, @m-braganca, @Maknee, @Michaelvll, @mk0walsk, @mmcclean-aws, @mt5225, @nakinnubis, @oelachqar, @otutukingsley, @Philmod, @php-workx, @qicz, @rohansonecha, @romilbhardwaj, @sachdva, @SalikovAlex, @seahyinghang8, @SeungjinYang, @YashIIT0909, @yurekami, @atoniolo76, @zpoint
New Contributors:
- @aflah02 made their first contribution in #8093
- @atoniolo76 made their first contribution
- @Bokki-Ryu made their first contribution in #8134
- @dan-blanchard made their first contribution in #8037
- @hentt30 made their first contribution
- @isagi-y22 made their first contribution
- @laimis9133 made their first contribution
- @liuwb made their first contribution
- @m-braganca made their first contribution
- @mk0walsk made their first contribution in #8237
- @mt5225 made their first contribution
- @nakinnubis made their first contribution
- @oelachqar made their first contribution
- @otutukingsley made their first contribution
- @Philmod made their first contribution
- @php-workx made their first contribution in #8444
- @qicz made their first contribution in #7890
- @sachdva made their first contribution
- @seahyinghang8 made their first contribution
- @YashIIT0909 made their first contribution
- @yurekami made their first contribution
Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!
Full Changelog
For a complete list of changes, see the commit history.


