Releases: skypilot-org/skypilot
SkyPilot v0.3.2
This is a patch release to ship bug fixes faster to our users! This release includes many feature updates and bug fixes, including the pedantic dependency issue, disk cloning, file mounts, and cloud-specific improvements.
Detailed changelog coming up in v0.4!
SkyPilot v0.3.1
This is a patch release to ship several important enhancements and bug fixes:
Enhancements
- On-demand H100 GPU from Lambda is supported!
sky launch --gpus h100
- To use it, remove any previous Lambda catalog:
rm -rf ~/.sky/catalogs/v5/lambda
- To use it, remove any previous Lambda catalog:
- Managed spot: make job cancellation during failover more robust to mitigate a rare
FAILED_SETUP
error (#1998)
Fixes
- Provisioner / Backend
- Logging
- Managed spot
- Fix
sky spot launch --retry-until-up
to make it actually retry until up (#2004)
- Fix
- Storage
- Fix a rare storage cloud check error if
sky check
has never been called (#2017)
- Fix a rare storage cloud check error if
- On-prem
- Fix detecting A5000 and A6000 GPUs (#2023)
Full Changelog: v0.3.0...v0.3.1
SkyPilot v0.3.0
SkyPilot v0.3.0: LLM Support, New Clouds, Enhanced Production-Readiness
We are excited to release SkyPilot v0.3, the most significant release thus far in the project's history.
v0.3 focuses on:
- LLM support (Vicuna, LLaMA)
- New clouds (Lambda Cloud; IBM; Cloudflare R2)
- Enhanced production readiness
See the release blog post for a deep-dive into highlights.
Release notes below are as compared to v0.2 (full changelog).
Release Highlights
- LLM support
- Vicuna LLM chatbot trained using SkyPilot for $300 on spot instances!
- Full finetuning & serving YAMLs released here to build off of!
- Serve your own LLaMA LLM chatbot on any cloud: full example, blog, repo
- Significantly expanded GPU availability by leveraging the widest selection of clouds (see below)
- Vicuna LLM chatbot trained using SkyPilot for $300 on spot instances!
- More clouds, more choices: delivering the highest GPU availability & cost savings
- Lambda Cloud is now supported!
- IBM Cloud is now supported!
- This brings the first hyperscaler cloud after AWS/GCP/Azure to SkyPilot. (#1598)
- Cloudflare R2 object store is now supported!
- This brings zero-egress cost object storage to SkyPilot. (#1736)
- To use it, see setup docs and usage docs.
- Managed Spot is made significantly more robust via a host of fixes/enhancements.
- Cluster leakage prevention and detection are significantly improved.
- CLI/API & Backend shipped many new features:
sky cost-report
; fine-grained optimizer; user identity; AWS SSO; private IP-only VPCs; Ray runtime is decoupled from user's Ray clusters; ...
CLI/API
New Features
- New CLI
sky cost-report
: show the estimated cost of launched clusters (#1301, #1621, #1780, #1680, #1788)- Experimental: Costs for clusters with auto{stop,down} scheduled may not be accurate.
- New resource filtering support in
sky launch
/ YAMLresources:
field - Add
--detach-setup
and--detach-run
tosky launch
#1379 - Add
--retry-until-up
,--region
,--zone
, and--idle-minutes-to-autostop
for interactive nodes #1297 - Add autodown (#1217, #1254)
- Support calling
sky status/sky.status()
on specific clusters #1568 - Support
--region
insky show-gpus
#1187 - Support passing AMIs for different regions in
image_id
field underresources
#1384
Enhancements
- Improvements to
sky show-gpus
- Check image existence and its size can fit in OS disk #1508
- Make
sky down -p
bypass identity mismatch errors. #1892
Fixes
- Make repeated
sky {cpu,gpu,tpu}node
commands correctly reuse existing cluster if possible #1787 - Fix errors from empty 'resources' field in YAML. #1816
- Make autostop more robust for AWS custom images that by default export 2 credential env vars (#1880, #1894, #1946)
Managed spot
New Features
- Latest in-progress spot jobs are shown in
sky status
(#1270, #1467, #1691) - Detailed reasons for failed spot jobs are exposed in
sky spot queue -a
(#1655)
Enhancements
- Make
sky spot launch
default-r/--retry-until-up
to True. #1781 - Make job termination/cancellation significantly more robust (#1433, #1745)
- Catch "pre-launch" errors early (e.g., invalid cluster names, no cloud access) to avoid unnecessary retries (#1714)
sky start
on the spot controller resets the default autostop #1453sky spot queue
displays job states with colors (#1473)sky spot queue
no longer shows a cached (and possibly stale) version of the jobs (#1742)- Disallow
sky down
on spot controller when in-progress spot jobs exist #1667 - New state
FAILED_SETUP
for spot jobs that fail duringsetup
(#1479) - New state
CANCELLING
for spot jobs that are being cancelled (#1785) - Keep env var
SKYPILOT_JOB_ID
the same for all recoveries of the same job #1400
Fixes
- Robustness fixes (#851, #1329, #1411, #1545, #1738, #1757, #1798, #1951, ...)
- Fixes for spot TPUs (#1249, #1470, #1500, #1555, #1717)
- Fix spot jobs with the same name (
-n
) possibly overwriting each other #1782 - Make spot job failover only use the regions in
ssh_proxy_command
if specified #1792 - Fix failing to launch spot jobs when spot controller is created with AWS SSO #1817
TPU
Robustness is enhanced for TPUs in various modes: VMs, pods, spot (#1500, #1279, #1359, #1483, #1562, ...).
Provisioner
Enhancements
- Cluster leakage prevention is significantly improved!
- Disable unattended-upgrade (nondeterministic APT lock) on cluster start
- Generate valid cluster names when username has invalid characters #1526
Fixes
- GCP/provisioner: Handle the occasional RESOURCE_NOT_FOUND error. #1842
- Robustness fixes (#1236, #1287, #1619, #1969)
Storage
New Features
- Cloudflare R2 is now supported! #1736
- R2 is an S3-compatible object store with zero egress fee.
- To use it, see setup docs and usage docs.
- Support multiple paths in the
source
of a storage mount, e.g.,source: [~/mydir/myfile.txt, ~/datasets]
#1311 #1677
Enhancements
- Exclude uploading
.git
folder for cloud storage mounts #1494 - If a
file_mounts
destination path is a relative path, it is treated as being under workdir #1315 - Upgrade GCSFuse version to 0.42.3 #1829
- Mounting options improvements (#1312, #1296, #1320)
- API improvements (#1223, #1239)
- UX/logging improvements (#1200, #1285, #1457, #1833, #1857, #1908, #1858)
Fixes
- Fix
sky storage delete
for externally deleted buckets #1875 - Disallow single files for upload to Storage #1231
- fix rsync for paths with spaces #1190
Backend
New Features
- New feature: Fine-grained optimizer
- Optimizing & provisioning retries at the granularity of regions/zones #975
- In other words, SkyPilot now automatically recognizes and optimizes across the cost differences between zones (e.g., AWS zones have different prices for the same spot instance type) or regions
- New feature: User identity is associated with each cluster (#1513, #1550, #1809)
- Identities are e.g., different AWS profiles / GCP projects
- With this, users are free to switch across identities, and SkyPilot will properly protect each cluster
Enhancements
- Ray runtime on SkyPilot clusters is upgraded to v2.4.0 (#1734)
- All existing clusters are automatically upgraded on its next
sky launch/start
- Local client's ray requirement is updated to
ray[default]>=2.2.0,<=2.4.0
to fix some dependency conflicts with click/grpcio/protobuf
- All existing clusters are automatically upgraded on its next
- Ray cluster used by the SkyPilot runt...
SkyPilot v0.2.5
Another patch release to ship bug fixes faster to our users! This release includes many fixes, including those for managed spot and cloud specific improvements.
Detailed changelog coming up in v0.3!
SkyPilot v0.2.4
This patch release brings more bug fixes, including fixes for cloud-specific networking and VPC configuration and managed spot.
Detailed changelog coming up in v0.3!
SkyPilot v0.2.3
What's Changed
This is a patch release with lots of bug fixes across the board, including many cloud-specific networking and VPC fixes.
Stay tuned for a detailed changelog coming up in v0.3!
SkyPilot v0.2.2
What's Changed
This is a patch release with several bug fixes for TPU, Spot, Onprem and Storage.
Detailed announcements will be made in 0.3.0.
SkyPilot v0.2.0
We are excited to release SkyPilot 0.2.0, which receives a host of new features, with many enhancements and fixes.
Highlights
- Managed Spot is made much more robust and easier to use.
- Try using
sky spot launch
on your existing yamls! - We've seen users running 1000s of spot jobs in a recurring schedule.
- Try using
- TPU Pods are now supported.
- To use a TPU Pod, simply modify e.g.,
accelerators: tpu-v2-8
toaccelerators: tpu-v2-32
.
- To use a TPU Pod, simply modify e.g.,
- Benchmark: use
sky bench
to easily measure the performance and cost of different cloud resources for your task. - Provisioning is sped up by ~1 minute.
- Catalog is updated to V3 with 100s of resource changes and 1000s of price changes.
A100-80GB
is now available on 3 clouds. Check outsky show-gpus -a
for GPU prices.- No action needed as this will be automatically downloaded.
CLI & Task interface
New Features
- Add zone support in YAML #1014
- Add shell completion support for CLI by #1162
- Add
--no-setup
option tosky launch
to allow for remounting of files without running setup commands again #1184 - Add
sky start --all
to start all clusters #1065 - Add glob support for
sky storage delete
#1117 - Add
--no-follow
option tosky logs
andsky spot logs
(print logs so far and exit)
Enhancements
- Show vCPUs in optimizer/benchmark messages #1076
- Make entrypoint optional: for quick VM launching, no more
sky launch <flags> ''
, simply dosky launch <flags>
#1191 - Make
sky check
automatically enable necessary GCP APIs (#1197, #1209); make it more robust for AWS checks (#1194)
Managed spot
New Features
sky spot launch
now automatically translates file_mounts in a YAML to use cloud storage. #1081 #1215- This means the same YAML for on-demand resources launched by
sky launch
can now be launched bysky spot launch
.
- This means the same YAML for on-demand resources launched by
- Add
--retry-until-up
forsky spot launch
; improve the responsiveness forsky spot cancel
#1098 - Expose a
$SKYPILOT_RUN_ID
environment variable shared by all recoveries of the same spot job (useful for identifying it in Weights & Biases) #1196- See the last Note block in docs.
Enhancements
- Distinguish spot controller names for different users #1101
- This may leak an old stopped controller if you have used
spot launch
with <= 0.1.2.
- This may leak an old stopped controller if you have used
- Add retry for spot cluster termination #1139
- Enable purge for spot controller #1107
- Show FAILED_CONTROLLER when controller exit abnormally #1143
- Make get_job_timestamp fetching more robust #1148
- Fail early when spot cluster name too long occurs on GCP #1183
Fixes
- Fix the retry logic for spot cluster launching #1150
- Fix non-persistent storage deletion for spot #974
- Fix spot recovery without cloud specified #1077
- Fix spot job duration #1104
- Fix
sky spot status -a
for resources and region information #1135
TPU support
- Support TPU Pod #1001
Provisioner
Enhancements
- Improving provision speed by ~1 minute (#1092, #1103, #1108, #1111, #1126)
- Add host VM - GPU compatibility checks for GCP #989
Fixes
- Fix GCP VM leak issue #1102
- Fix GCP A100 launch error #1166
- Fix K80 gpunode by correcting GCP image version #1090
On-prem
Enhancements
- Simplified on-prem deployment
sky admin deploy
now automatically installsskypilot
,ray
(andpython3
andpip3
) on the local cluster under admin user #1116
- Add cluster config schema check #1044
- Modify Sky Admin's Setup on Docs #1085
- Align Python Versions #1086
Fixes
- Fix Sky Status Logging #1041
Backend
Enhancements
- Catalog is updated to V3 with 100s of resource changes and 1000s of price changes #1204
- Canonicalize accelerator names in Resources #1075
- Reduce the frequency of job status update and remove parallel query #1096
- Increase thread limit and fix nofile limit #1128
Fixes
- [Storage] Fix public bucket source check in SkyPilot Storage #1087
- Fixes ray dashboard hanging problem (#1088) #1109
- Fix placement group not scheduled issue (issue #1130) #1134
Misc. enhancements
- New example: Stable Diffusion #1149
pip install skypilot
now installsskypilot[aws]
by default #1055- Improve error messages for cloud import errors #1156
- Change
~/.ssh/config
permissions #1174 - Relative cluster yaml #1176
- UX: remove DURATION, move HOURLY_PRICE in status table (-a) #1129
Thanks to all Contributors!
New contributors
- @sumanthgenz made their first contribution in #1065
- @ewzeng made their first contribution in #1174
Many thanks to all contributors who contributed to this release!
@Michaelvll, @concretevitamin, @infwinston, @michaelzhiluo, @WoosukKwon, @romilbhardwaj, @sumanthgenz, @ewzeng, @iojw, @franklsf95
SkyPilot v0.1.1
Highlights
This is our first release for SkyPilot -- a framework for easily running machine learning workloads on any cloud through a unified interface. No knowledge of cloud offerings is required or expected – you simply define the workload and its resource requirements, and SkyPilot will automatically execute it on AWS, Google Cloud Platform or Microsoft Azure.
Key features
- Run existing projects on the cloud with zero code changes
- Easily provision VMs across multiple cloud platforms (AWS, Azure or GCP)
- Easily manage multiple clusters to handle different projects
- Quick access to cloud instances for development
- Store datasets on the cloud and access them like you would on a local file system
- No cloud lock-in – seamlessly run your code across cloud providers
Thanks
Many thanks to all those who contributed to this release!
@concretevitamin @romilbhardwaj @Michaelvll @infwinston @michaelzhiluo @WoosukKwon @suquark @mraheja @gmittal @iojw @lhqing @franklsf95
Full Changelog: https://github.com/skypilot-org/skypilot/commits/v0.1.1