Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
v1.57.2: automate nvidia-bug-report collection on GCE COS VM
v1.57.1: Add Kueue-0.12.2 and make it as default
Release v1.57.0
Highlights
- CHS integrations to GKE blueprints A3 Mega, A3 Ultra, and A4 by @ishitachail in #4293 #4321 #4323 #4328 #4330
What's Changed
Breaking changes 🚨
As part of #4275 the install_cloud_rdma_drivers.sh startup script will now be removed from H4D blueprints, users should update to this version of Cluster Toolkit as the latest HPC VM/Slurm images will have compatible versions of the RDMA packages pre-installed
Key New Features 🎉
- CHS Integration for A3 Mega by @ishitachail in #4293
- CHS Integration for A3 Ultra by @ishitachail in #4321
- CHS Integration for A4 by @ishitachail in #4323
- CHS for A3 Ultra by @ishitachail in #4328
- CHS for a4 by @ishitachail in #4330
- [update] new weight request form URL by @fschuerm in #4320
New Modules 🧱
Improvements 🛠
- update nccl to 1.0.6 by @cboneti in #4303
- enable wait_for_rollout for kubectl dependencies by @ighosh98 in #4305
Deprecations 💤
- Revamp install_cloud_rdma_drivers startup script by @abbas1902 in #4275
Bug fixes 🐞
- enable tas for kueue v0.11.4 by @ighosh98 in #4304
- Fix TAS Flag in v0.11.4 by @ighosh98 in #4319
- Remove Kueue topology annotation as DWS does not work with TAS (yet) by @SwarnaBharathiMantena in #4336
Full Changelog: v1.56.0...v1.57.0
Release v1.56.0
What's Changed
Breaking changes 🚨
There was a schema change introduced for load_bq.py in v1.56.0
- Fix job row insertion on load_bq.py by @abbas1902 in #4257
Improvements 🛠
- SlurmGCP Resume Improvements by @alyssa-sm in #4276
Version Updates ⏫
- Bump urllib3 from 2.3.0 to 2.5.0 in /community/front-end/ofe by @dependabot in #4296
- Bump protobuf from 5.29.3 to 5.29.5 in /community/front-end/ofe by @dependabot in #4286
- Bump requests from 2.32.3 to 2.32.4 in /community/front-end/ofe by @dependabot in #4285
Bug fixes 🐞
- Fix job row insertion on load_bq.py by @abbas1902 in #4257
Full Changelog: v1.55.1...v1.56.0
v1.55.1 Hotfix: Reduce the severity of missed metadata fetches
This is a hotfix in order to reduce the severity of missed metadata fetches for new supported metadata fields in Slurm-GCP.
What's Changed
- Reduce log pollution of failed Metadata fetch by @abbas1902 in #4290
Full Changelog: v1.55.0...v1.55.1
Release v1.55.0
Highlights
- New blueprint example that lets you create a high-throughput execution environment for Google Deepmind's AlphaFold 3
- Updated A3-Ultra GCSFuse example blueprint to align with best practices
What's Changed
Key New Features 🎉
Improvements 🛠
- Removing MGLRU dependency from Google cloud cluster toolkit by @shubpal07 in #4255
- Modify reservation variable to accommodate different reservation options by @SwarnaBharathiMantena in #4253
- kubernetes provider module implementation by @ighosh98 in #4247
- Align GCSFuse configurations with best practices by @samskillman in #4263
- Information on DWS Calendar consumption option in GKE blueprint by @SwarnaBharathiMantena in #4259
Version Updates ⏫
- Bump django from 5.1.9 to 5.1.10 in /community/front-end/ofe by @dependabot in #4248
Bug fixes 🐞
- Kueue Config Integration Tests incorporating different Accelerator types for different machines by @ishitachail in #4252
Full Changelog: v1.54.0...v1.55.0
Release v1.54.0
Highlights
- The Managed Lustre support for non-default ports with GKE compatibility has been added. Improvement to speed up GKE cluster deployment. Further, A3 High network blocking script has been implemented as a startup-script feature.
What's Changed
Module Improvements 🔨
- Add Managed Lustre support for non-default ports (GKE compatibility) by @tpdownes in #4210
- Implement A3 High network blocking script as startup-script feature by @tpdownes in #4233
Improvements 🛠
- Speed up deployment of GKE clusters by @ighosh98 in #4215
- Update to datacenter-gpu-manager-4 package in A-series blueprints by @RUEI4341 in #4228
- Remove Docker configuration warning by @tpdownes in #4229
Full Changelog: v1.53.0...v1.54.0
Release v1.53.0
Highlights
- The A3Mega Slurm solution now standardizes on Ubuntu: the Debian-based custom Slurm image has been deprecated and replaced with a custom Ubuntu Slurm image. Correspondingly, the A3M Slurm Ubuntu solutions have been refactored into a single, consolidated blueprint.
What's Changed
Key New Features 🎉
- Add Ubuntu 24.04 Ansible installation and test coverage by @tpdownes in #4140
- Added GKE a4x blueprint and related configs by @parulbajaj01 in #4199
- Add jobset for a4x by @ighosh98 in #4206
Module Improvements 🔨
- Allow deploying cluster without live reservation by @vikramvs-gg in #4057
- Remove redundant http provider from kubectl apply by @vikramvs-gg in #4190
Improvements 🛠
- Add kueue dependency for nvidia dra driver by @parulbajaj01 in #4147
- Update condition for workload policy by @parulbajaj01 in #4172
- Updating the checkpoint PV for A3U and A4 to the recommended mount options by @raushan2016 in #4180
- Refactor A3M slurm ubuntu solutions into 1 blueprint by @RachaelSTamakloe in #4216
Deprecations 💤
- Deprecate old GKE A3 mega blueprint by @ighosh98 in #4134
- Drop CentOS 7 support for Ansible installation by @tpdownes in #4138
- Drop Ubuntu 20.04 support for Ansible installation by @tpdownes in #4167
- Drop PBS Pro modules and tests by @tpdownes in #4165
- A3M Slurm Solution- deprecate debian slurm image and add ubuntu slurm image by @RachaelSTamakloe in #4170
- Deprecate imex, gpu driver and dra driver manifests by @ighosh98 in #4198
- Adding deprecation warnings to DDN-Exascaler module (and references) by @cdunbar13 in #4189
Version Updates ⏫
- Upgrade Ansible to maximum allowed version on oldest supported OS distributions by @tpdownes in #4139
Bug fixes 🐞
- Add recurse to condor spool directory by @aneo-ssam in #4178
- Fixed the parser error in test-gke-a2-highgpu-kueue by @ishitachail in #4204
- Fixing the missing comma between the mount_options config for gcs A3U and A4 by @raushan2016 in #4207
- Cleanup GCS Fuse configurations and add required permissions for fio-job-template by @ighosh98 in #4214
New Contributors
- @vikramvs-gg made their first contribution in #4142
- @ishitachail made their first contribution in #4148
- @shubpal07 made their first contribution in #4149
- @RUEI4341 made their first contribution in #4184
Full Changelog: v1.52.0...v1.53.0
Release v1.52.0
What's Changed
Breaking Changes 🚨
- CloudSQL improvements: database flags, query insights, bump default version by @wiktorn in #4115
NOTE: only affects users of slurm-cloudsql-federation module.
Improvements 🛠
- Add support for multiple task prolog and epilog scripts. Closes #4100 by @gkcalat in #4105
- remove gvnic-1 from pods manifests by @liuyuan10 in #4088
- Move core logic of helm install into independent helm module by @ighosh98 in #4127
Bug fixes 🐞
- Fix multiple bugs in notifying jobs during failed resume by @mr0re1 in #4107
- Relax
retry_exceptiontest by @mr0re1 in #4065 - Fixes to nodeset dynamic by @cboneti in #4125
- Fix missing region by @wiktorn in #4113
- Block broken release of nvidia-container-toolkit by @tpdownes in #4152
New Contributors
- @liuyuan10 made their first contribution in #4088
- @pure-jliao made their first contribution in #4124
Full Changelog: v1.51.1...v1.52.0