Skip to content

SkyPilot v0.11.2

Choose a tag to compare

@Michaelvll Michaelvll released this 03 Mar 23:46
· 705 commits to master since this release
b94dad7

SkyPilot v0.11.2: Slurm Support, JobGroups, Enhanced Pools, External Links, Autostop Hooks, 7x data mount speed up and More

SkyPilot v0.11.2 delivers Slurm support in Beta, JobGroups for heterogeneous parallel workloads, and significantly enhanced Pools with autoscaling, multi-job scheduling and heterogeneous GPU support. This release also brings Autostop Hooks, 7x MOUNT_CACHED mode uploads speed up, automatic EFA on EKS, and numerous admin, security, and performance improvements.

Get it now with:

uv pip install "skypilot>=0.11.2"

Or, upgrade your team SkyPilot API server:

NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.11.2

helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
  --set apiService.image=berkeleyskypilot/skypilot:$VERSION \
  --version $VERSION --devel --reuse-values

Breaking Change: Python 3.9+ required — Python 3.7 and 3.8 are no longer supported (#8489). Please upgrade before installing this release.


Highlights

[Beta] Slurm Support

SkyPilot now supports Slurm as a new infrastructure backend, enabling users to orchestrate workloads on HPC clusters alongside cloud VMs and Kubernetes — all through the same unified interface (docs, #5491, #8138).

This release brings comprehensive Slurm capabilities:

  • Multi-node distributed workloads — run distributed training jobs across multiple Slurm nodes with proper environment variable propagation and per-node logging (#8219)
  • Containerized execution via NVIDIA pyxis/enroot — specify Docker images with --image-id for reproducible, GPU-accelerated workloads (#8604, #8609)
  • Resource-scoped SSH sessionsssh <cluster> drops you inside the Slurm job allocation, so nvidia-smi correctly reflects only your allocated GPUs (#8268)
  • Interactive SSH authentication (2FA, password prompts) for clusters requiring keyboard-interactive auth (#8317)
  • Partition support — Slurm partitions are mapped to SkyPilot zones, enabling partition-aware scheduling (#8198)
  • Multi-cluster support — configure and use multiple Slurm clusters simultaneously
  • Dashboard integration — Slurm clusters appear in the infrastructure page with status and GPU utilization

See the Slurm documentation for setup instructions.

image

JobGroups: Heterogeneous Parallel Workloads

JobGroups enable running multiple jobs with different resource requirements together as a managed group (blog, #8456). Define multi-task pipelines in a single multi-document YAML file:

# Header: job group metadata
name: rl-training
execution: parallel
primary_tasks: [trainer]
termination_delay: 30s

---
name: trainer
resources:
accelerators: A100:8
run: python train.py
---
name: reward-server
resources:
accelerators: A100:1
run: python reward_server.py

Key capabilities:

  • Parallel cluster launch and monitoring — all jobs launched and tracked concurrently
  • Inter-job networking — easy task name based hostname discovery (.) for job-to-job communication
  • Dashboard UI — expandable rows for multi-task groups, task-specific log filtering
  • CLI supportsky jobs logs --task-name <name> for viewing specific task logs

JobGroups banner

Example applications included: RL post-training (RLHF) pipeline and parallel train-eval pipeline.

Enhanced Pools: Multi-Job Scheduling and Heterogeneous GPUs

SkyPilot Pools receive significant upgrades in this release:

  • Multiple jobs per worker — the scheduler now performs resource-aware bin-packing, tracking CPU, memory, and accelerator usage to fit multiple jobs on a single worker (#8192, #8279)
  • Autoscaling — pools can now automatically scale workers up and down (including to zero) based on queue length. Specify min_workers, max_workers, and target queue length; a QueueLengthAutoscaler handles the rest while protecting running jobs from cancellation (#8483)
  • Heterogeneous GPU support — specify any_of resource configurations and the scheduler dynamically resolves to available hardware (#8315):
resources:
any_of:
  - accelerators: T4:1
  - accelerators: A100:1
  • 9x faster concurrent job launch — launching 100 concurrent jobs reduced from 4.5 minutes to 30 seconds using the new --num-jobs argument (#7891)
  • Fractional GPU scheduling fix — fractional GPU jobs now correctly schedule across all workers (#8509)

External Links

SkyPilot dashboard now automatically detects your W&B links generated by your AI workloads. No need to dig into the job logs to figure out where your training panels are. (#8405)

image 1

Autostop Hooks

An autostop hook mechanism allow running custom scripts before a cluster is automatically stopped — for example, to save checkpoints, sync W&B runs, or send Slack notifications (#8412):

resources:
autostop:
idle_minutes:10
    hook:|
      wandb sync
      curl -X POST $SLACK_WEBHOOK -d '{"text": "Cluster shutting down"}'
hook_timeout:300
  • New sky logs --autostop command to view hook execution logs
  • sky exec is rejected on AUTOSTOPPING clusters; sky launch waits for autostop to complete before restarting

Automatic EFA Setup on Amazon EKS

SkyPilot now automatically configures Elastic Fabric Adapter (EFA) on EKS with a single flag (#8557):

resources:
  network_tier: best

This automates what was previously a complex manual setup, delivering ~78.8 GB/s inter-node bandwidth (vs ~4.1 GB/s without EFA), critical for distributed training performance. EFA interfaces are allocated proportionally to the requested GPU count.

efa-speedup

7x MOUNT_CACHED Uploads Speed Up

Parallel uploads are now the default for MOUNT_CACHED file mounts, delivering a 7x speedup — flush time dropped from 151s to 21s for a ~14.6 GB test workload (#8455). A new data.mount_cached.sequential_upload config option allows reverting to sequential uploads if needed.

mount-cached-speedup

Exit Code-Based Job Recovery

Users can now specify exit codes that trigger automatic job recovery in managed jobs (#8324):

resources:
  job_recovery:
    recover_on_exit_codes:[29]

When a job exits with a specified code, SkyPilot automatically recovers it — useful for transient failures with known error codes.

Windows WSL Support

Automatically detect that SkyPilot is running in WSL, and seamless set up VSCode Remote-SSH for Windows users (#8669)

Admin Deployment Improvement

  • External authentication proxy support — deploy behind AWS ALB with Cognito, Azure Front Door, or custom SSO proxies; supports both plaintext and JWT header formats (#8751)
  • Sidecar container support — Istio, Datadog, and other sidecar injection no longer breaks K8s provisioning; SkyPilot explicitly targets the ray-node container (#8353, #8444)

What’s New

Kubernetes

  • Improved Kueue integration — 24-hour provisioning timeout, workspace-level queue configuration, controller pod exclusion, and pod annotations (#8484)
  • GPU detection fix — L40S no longer misidentified as L4; added Blackwell, newer Hopper, and Ada Lovelace GPUs (#8593)
  • kubernetes.set_pod_resource_limits — set pod CPU/memory limits relative to requests for pod resource limit enforcement (#8644)
  • Volume NOT_READY status — actionable error messages for PVC issues; background refresh daemon; --refresh flag (#8524)
  • Unified ingress resource — share a single ingress across services (#8532)
  • Kubernetes python client race condition fix (#8705)
  • Service resource leak fix for pod eviction/Kueue preemption (#8745), GPU misconfiguration hints (#8629), per-task remote_identity override (#8659), not-ready node exclusion (#8172), GKE autoscaler compatibility (#8326)

API Server & Security

  • Concurrent request context isolation — fixed workspace isolation violations from context leaking between users (#8354)
  • Zip Slip vulnerability fix — patched path traversal in /upload endpoint (#8723)
  • Polling-based sky api login — works around Chrome Private Network Access restrictions (#8590)
  • Memory leak fix — AWS session cache leak in status refresh daemon (#8098)
  • Kubeconfig no longer uploaded when using SERVICE_ACCOUNT (#8386)
  • Docker password redaction in logs (#8080), Bearer token + Basic Auth fix (#8503), disable basic auth option (#8694)
  • API server plugin system (#7993, #8272, #8700, #8410)

Deployment

  • Persists managed jobs logs with RWX persistent storage with RollingUpdate (#8537)
  • fullnameOverride (#8528)
  • CoreWeave, DigitalOcean credentials support (#8200, #7931)
  • SSH node pool config (#8249)
  • Helm RollingUpdate SQLite fix (#8607)
  • Scheduling constraints (#8134)

Managed Jobs & Dashboard

  • GPU metrics for managed jobs — utilization, memory, temperature, and power in the dashboard (#8718)
  • 6x infra page speedup and server-side pagination (#8523, #8611, #8651)
  • Label filtering for clusters and infrastructure (#8507)
  • Log download compression ~96% (#8626), GPU temperature panel (#8472), Grafana link (#8599)
  • Job cancel reliability fix (#8203), multi-user cluster fix (#8233), SSH key permission fix (#8316)

Cloud Integrations

  • AWS multi-VPC failover — specify multiple VPC names with automatic failover across regions (#8722)
  • AWS p5e.48xlarge H200 and Melbourne region support (#8465, #8055)
  • Choose instance type based on AWS local disk (#8661)
  • S3 mounting with non-static credentials — fixed for AWS SSO, IAM roles, Pod Identity/IRSA (#8358)
  • Together AI InfiniBand (#8581)
  • GCP Queued Resources for TPUs — TPU VM provisioning requests are queued until capacity becomes available (#8481)
  • Vast.ai create_instance_kwargs (#8536), Vast.ai SSH fix (#8614)
  • Azure blobfuse2 Debian 13 fix (#8730)
  • Nebius UFW security hardening (#8627)

Core, Backend & UX

  • --secret-file CLI flag — avoid shell history exposure (#8646)
  • provision.install_conda config — skip Miniconda for faster launches (#8662)
  • Pending cluster state (#8262), custom .sky location (#8153), restore autostop on start (#8022), multiple skylets per host (#8156)
  • SSH access during bad Ray state (#8649), volume fail-fast (#8739), greenlet import fix (#8653), Resources.copy() fix (#8648), pandas compatibility (#8643), Docker MOTD fix (#8632)
  • SSH Node Pools: deployment refactored (#8154, #8173, #8226), heterogeneous node ban (#8230), lstrip('ssh-') bug fix (#8417), package distribution fix (#8508)

Documentation & Examples

  • Air-gapped environment docs (#7312), NVIDIA Dynamo serving (#7333), EKS IAM roles guide (#8358), Slurm migration guide (#8515)
  • Prefect integration (#8506), SAM3 video segmentation (#8384), VeRL search (#8241), Marimo notebooks (#8123)

Contributors

Thank you to all contributors who made this release possible!

@aflah02, @alex000kim, @andylizf, @aylei, @Bokki-Ryu, @brianstrauch, @cblmemo, @cg505, @concretevitamin, @cwhitak3r, @dan-blanchard, @DanielZhangQD, @Elden123, @funkypenguin, @hentt30, @isagi-y22, @JiangJiaWei1103, @kevinmingtarja, @koaning, @kyuds, @laimis9133, @liuwb, @lloyd-brown, @lucamanolache, @m-braganca, @Maknee, @Michaelvll, @mk0walsk, @mmcclean-aws, @mt5225, @nakinnubis, @oelachqar, @otutukingsley, @Philmod, @php-workx, @qicz, @rohansonecha, @romilbhardwaj, @sachdva, @SalikovAlex, @seahyinghang8, @SeungjinYang, @YashIIT0909, @yurekami, @atoniolo76, @zpoint

New Contributors:

Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!

Full Changelog

For a complete list of changes, see the commit history.