Skip to content

Releases: skypilot-org/skypilot

SkyPilot v0.3.2

01 Jul 05:21
3b3e633
Compare
Choose a tag to compare

This is a patch release to ship bug fixes faster to our users! This release includes many feature updates and bug fixes, including the pedantic dependency issue, disk cloning, file mounts, and cloud-specific improvements.

Detailed changelog coming up in v0.4!

SkyPilot v0.3.1

04 Jun 17:29
Compare
Choose a tag to compare

This is a patch release to ship several important enhancements and bug fixes:

Enhancements

  • On-demand H100 GPU from Lambda is supported! sky launch --gpus h100
    • To use it, remove any previous Lambda catalog: rm -rf ~/.sky/catalogs/v5/lambda
  • Managed spot: make job cancellation during failover more robust to mitigate a rare FAILED_SETUP error (#1998)

Fixes

  • Provisioner / Backend
    • Fix provision failover encountering FileNotFoundError (#2005)
    • Fix user-level ray cluster causing SkyPilot cluster to be in INIT state (#2020)
  • Logging
    • Fix certain logs of multi-node jobs not being streamed due to Ray 2.4 log dedup (#2026)
    • Fix logs being created in current pwd $PWD/~/sky_logs in some cases (#2009)
  • Managed spot
    • Fix sky spot launch --retry-until-up to make it actually retry until up (#2004)
  • Storage
    • Fix a rare storage cloud check error if sky check has never been called (#2017)
  • On-prem
    • Fix detecting A5000 and A6000 GPUs (#2023)

Full Changelog: v0.3.0...v0.3.1

SkyPilot v0.3.0

30 May 17:29
Compare
Choose a tag to compare

SkyPilot v0.3.0: LLM Support, New Clouds, Enhanced Production-Readiness

We are excited to release SkyPilot v0.3, the most significant release thus far in the project's history.

v0.3 focuses on:

  • LLM support (Vicuna, LLaMA)
  • New clouds (Lambda Cloud; IBM; Cloudflare R2)
  • Enhanced production readiness

See the release blog post for a deep-dive into highlights.

Release notes below are as compared to v0.2 (full changelog).

Release Highlights

  • LLM support
    • Vicuna LLM chatbot trained using SkyPilot for $300 on spot instances!
    • Serve your own LLaMA LLM chatbot on any cloud: full example, blog, repo
    • Significantly expanded GPU availability by leveraging the widest selection of clouds (see below)
  • More clouds, more choices: delivering the highest GPU availability & cost savings
    • Lambda Cloud is now supported!
      • This brings high-end GPUs at lower costs to SkyPilot. (#1557, #1838)
      • Simply run sky check to set it up. Docs here.
    • IBM Cloud is now supported!
      • This brings the first hyperscaler cloud after AWS/GCP/Azure to SkyPilot. (#1598)
    • Cloudflare R2 object store is now supported!
  • Managed Spot is made significantly more robust via a host of fixes/enhancements.
  • Cluster leakage prevention and detection are significantly improved.
  • CLI/API & Backend shipped many new features:
    • sky cost-report; fine-grained optimizer; user identity; AWS SSO; private IP-only VPCs; Ray runtime is decoupled from user's Ray clusters; ...

CLI/API

New Features

  • New CLI sky cost-report: show the estimated cost of launched clusters (#1301, #1621, #1780, #1680, #1788)
    • Experimental: Costs for clusters with auto{stop,down} scheduled may not be accurate.
  • New resource filtering support in sky launch / YAML resources: field
    • Add vCPUs --cpus support #1622
    • Add memory --memory support #1746
    • Add disk performance tier --disk-tier support #1812
  • Add --detach-setup and --detach-run to sky launch #1379
  • Add --retry-until-up, --region, --zone, and --idle-minutes-to-autostop for interactive nodes #1297
  • Add autodown (#1217, #1254)
  • Support calling sky status/sky.status() on specific clusters #1568
  • Support --region in sky show-gpus #1187
  • Support passing AMIs for different regions in image_id field under resources #1384

Enhancements

  • Improvements to sky show-gpus
    • Show DEVICE_MEMORY for AWS & Lambda #1825
    • Support querying specific number/type of devices sky show-gpus <gpu>:<num> (same syntax as sky launch --gpus) #1924
  • Check image existence and its size can fit in OS disk #1508
  • Make sky down -p bypass identity mismatch errors. #1892

Fixes

  • Make repeated sky {cpu,gpu,tpu}node commands correctly reuse existing cluster if possible #1787
  • Fix errors from empty 'resources' field in YAML. #1816
  • Make autostop more robust for AWS custom images that by default export 2 credential env vars (#1880, #1894, #1946)

Managed spot

New Features

  • Latest in-progress spot jobs are shown in sky status (#1270, #1467, #1691)
  • Detailed reasons for failed spot jobs are exposed in sky spot queue -a (#1655)

Enhancements

  • Make sky spot launch default -r/--retry-until-up to True. #1781
  • Make job termination/cancellation significantly more robust (#1433, #1745)
  • Catch "pre-launch" errors early (e.g., invalid cluster names, no cloud access) to avoid unnecessary retries (#1714)
  • sky start on the spot controller resets the default autostop #1453
  • sky spot queue displays job states with colors (#1473)
  • sky spot queue no longer shows a cached (and possibly stale) version of the jobs (#1742)
  • Disallow sky down on spot controller when in-progress spot jobs exist #1667
  • New state FAILED_SETUP for spot jobs that fail during setup (#1479)
  • New state CANCELLING for spot jobs that are being cancelled (#1785)
  • Keep env var SKYPILOT_JOB_ID the same for all recoveries of the same job #1400

Fixes

TPU

Robustness is enhanced for TPUs in various modes: VMs, pods, spot (#1500, #1279, #1359, #1483, #1562, ...).

Provisioner

Enhancements

  • Cluster leakage prevention is significantly improved!
    • Skip Ray's launch hash check, which caused many leakage (#1671)
    • Launch existing cluster in the same zone to avoid leakage (#1700)
    • Existing cluster's cluster YAML will keep certain fields unchanged across re-launch (#1235, #1251)
    • Fix leakage of existing cluster when failed to start #1497
  • Disable unattended-upgrade (nondeterministic APT lock) on cluster start
    • Previously, apt install ... in setup may non-deterministically fail due to APT lock being held by background unattended upgrades
    • Now: for AWS cloud-init ensures unattended-upgrade is disabled at boot (#1949, #1954); for other clouds we kill the processes (#1347)
  • Generate valid cluster names when username has invalid characters #1526

Fixes

Storage

New Features

  • Cloudflare R2 is now supported! #1736
    • R2 is an S3-compatible object store with zero egress fee.
    • To use it, see setup docs and usage docs.
  • Support multiple paths in the source of a storage mount, e.g., source: [~/mydir/myfile.txt, ~/datasets] #1311 #1677

Enhancements

Fixes

  • Fix sky storage delete for externally deleted buckets #1875
  • Disallow single files for upload to Storage #1231
  • fix rsync for paths with spaces #1190

Backend

New Features

  • New feature: Fine-grained optimizer
    • Optimizing & provisioning retries at the granularity of regions/zones #975
    • In other words, SkyPilot now automatically recognizes and optimizes across the cost differences between zones (e.g., AWS zones have different prices for the same spot instance type) or regions
  • New feature: User identity is associated with each cluster (#1513, #1550, #1809)
    • Identities are e.g., different AWS profiles / GCP projects
    • With this, users are free to switch across identities, and SkyPilot will properly protect each cluster

Enhancements

  • Ray runtime on SkyPilot clusters is upgraded to v2.4.0 (#1734)
    • All existing clusters are automatically upgraded on its next sky launch/start
    • Local client's ray requirement is updated to ray[default]>=2.2.0,<=2.4.0 to fix some dependency conflicts with click/grpcio/protobuf
  • Ray cluster used by the SkyPilot runt...
Read more

SkyPilot v0.2.5

20 Mar 19:19
Compare
Choose a tag to compare

Another patch release to ship bug fixes faster to our users! This release includes many fixes, including those for managed spot and cloud specific improvements.

Detailed changelog coming up in v0.3!

SkyPilot v0.2.4

06 Feb 06:43
Compare
Choose a tag to compare

This patch release brings more bug fixes, including fixes for cloud-specific networking and VPC configuration and managed spot.

Detailed changelog coming up in v0.3!

SkyPilot v0.2.3

27 Jan 01:06
Compare
Choose a tag to compare

What's Changed

This is a patch release with lots of bug fixes across the board, including many cloud-specific networking and VPC fixes.

Stay tuned for a detailed changelog coming up in v0.3!

SkyPilot v0.2.2

09 Jan 07:15
Compare
Choose a tag to compare
SkyPilot v0.2.2 Pre-release
Pre-release

What's Changed

This is a patch release with several bug fixes for TPU, Spot, Onprem and Storage.

Detailed announcements will be made in 0.3.0.

SkyPilot v0.2.0

11 Oct 15:24
Compare
Choose a tag to compare

We are excited to release SkyPilot 0.2.0, which receives a host of new features, with many enhancements and fixes.

Highlights

  • Managed Spot is made much more robust and easier to use.
    • Try using sky spot launch on your existing yamls!
    • We've seen users running 1000s of spot jobs in a recurring schedule.
  • TPU Pods are now supported.
    • To use a TPU Pod, simply modify e.g., accelerators: tpu-v2-8 to accelerators: tpu-v2-32.
  • Benchmark: use sky bench to easily measure the performance and cost of different cloud resources for your task.
  • Provisioning is sped up by ~1 minute.
  • Catalog is updated to V3 with 100s of resource changes and 1000s of price changes.
    • A100-80GB is now available on 3 clouds. Check out sky show-gpus -a for GPU prices.
    • No action needed as this will be automatically downloaded.

CLI & Task interface

New Features

  • Add zone support in YAML #1014
  • Add shell completion support for CLI by #1162
  • Add --no-setup option to sky launch to allow for remounting of files without running setup commands again #1184
  • Add sky start --all to start all clusters #1065
  • Add glob support for sky storage delete #1117
  • Add --no-follow option to sky logs and sky spot logs (print logs so far and exit)

Enhancements

  • Show vCPUs in optimizer/benchmark messages #1076
  • Make entrypoint optional: for quick VM launching, no more sky launch <flags> '', simply do sky launch <flags> #1191
  • Make sky check automatically enable necessary GCP APIs (#1197, #1209); make it more robust for AWS checks (#1194)

Managed spot

New Features

  • sky spot launch now automatically translates file_mounts in a YAML to use cloud storage. #1081 #1215
    • This means the same YAML for on-demand resources launched by sky launch can now be launched by sky spot launch.
  • Add --retry-until-up for sky spot launch; improve the responsiveness for sky spot cancel #1098
  • Expose a $SKYPILOT_RUN_ID environment variable shared by all recoveries of the same spot job (useful for identifying it in Weights & Biases) #1196
    • See the last Note block in docs.

Enhancements

  • Distinguish spot controller names for different users #1101
    • This may leak an old stopped controller if you have used spot launch with <= 0.1.2.
  • Add retry for spot cluster termination #1139
  • Enable purge for spot controller #1107
  • Show FAILED_CONTROLLER when controller exit abnormally #1143
  • Make get_job_timestamp fetching more robust #1148
  • Fail early when spot cluster name too long occurs on GCP #1183

Fixes

  • Fix the retry logic for spot cluster launching #1150
  • Fix non-persistent storage deletion for spot #974
  • Fix spot recovery without cloud specified #1077
  • Fix spot job duration #1104
  • Fix sky spot status -a for resources and region information #1135

TPU support

Provisioner

Enhancements

Fixes

  • Fix GCP VM leak issue #1102
  • Fix GCP A100 launch error #1166
  • Fix K80 gpunode by correcting GCP image version #1090

On-prem

Enhancements

  • Simplified on-prem deployment
    • sky admin deploy now automatically installs skypilot, ray (and python3 and pip3) on the local cluster under admin user #1116
  • Add cluster config schema check #1044
  • Modify Sky Admin's Setup on Docs #1085
  • Align Python Versions #1086

Fixes

  • Fix Sky Status Logging #1041

Backend

Enhancements

  • Catalog is updated to V3 with 100s of resource changes and 1000s of price changes #1204
  • Canonicalize accelerator names in Resources #1075
  • Reduce the frequency of job status update and remove parallel query #1096
  • Increase thread limit and fix nofile limit #1128

Fixes

  • [Storage] Fix public bucket source check in SkyPilot Storage #1087
  • Fixes ray dashboard hanging problem (#1088) #1109
  • Fix placement group not scheduled issue (issue #1130) #1134

Misc. enhancements

  • New example: Stable Diffusion #1149
  • pip install skypilot now installs skypilot[aws] by default #1055
  • Improve error messages for cloud import errors #1156
  • Change ~/.ssh/config permissions #1174
  • Relative cluster yaml #1176
  • UX: remove DURATION, move HOURLY_PRICE in status table (-a) #1129

Thanks to all Contributors!

New contributors

Many thanks to all contributors who contributed to this release!

@Michaelvll, @concretevitamin, @infwinston, @michaelzhiluo, @WoosukKwon, @romilbhardwaj, @sumanthgenz, @ewzeng, @iojw, @franklsf95

SkyPilot v0.1.1

09 Aug 22:46
92ed4c4
Compare
Choose a tag to compare

Highlights

This is our first release for SkyPilot -- a framework for easily running machine learning workloads on any cloud through a unified interface. No knowledge of cloud offerings is required or expected – you simply define the workload and its resource requirements, and SkyPilot will automatically execute it on AWS, Google Cloud Platform or Microsoft Azure.

Key features

  • Run existing projects on the cloud with zero code changes
  • Easily provision VMs across multiple cloud platforms (AWS, Azure or GCP)
  • Easily manage multiple clusters to handle different projects
  • Quick access to cloud instances for development
  • Store datasets on the cloud and access them like you would on a local file system
  • No cloud lock-in – seamlessly run your code across cloud providers

Thanks

Many thanks to all those who contributed to this release!
@concretevitamin @romilbhardwaj @Michaelvll @infwinston @michaelzhiluo @WoosukKwon @suquark @mraheja @gmittal @iojw @lhqing @franklsf95

Full Changelog: https://github.com/skypilot-org/skypilot/commits/v0.1.1