Skip to content

Releases: GoogleCloudPlatform/cluster-toolkit

Release v1.45.0

15 Jan 23:54
79299a1

Choose a tag to compare

Highlights:

  • A3 Ultra GKE blueprints updated to use Kueue 0.10.0 and Jobset 0.7.2 which are now supported.
  • Module improvements to support GKE cluster deletion protection, default node pools with shielded instances, latest GKE version in Rapid channel for A3 Ultra clusters, configurable upgrade settings for node pools and managed hyperdisk support.
  • Example for running NVIDIA NeMo on a3-ultragpu-8g Slurm clusters

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Bug fixes 🐞

New Contributors

Full Changelog: v1.44.2...v1.45.0

v1.44.2: Fix for Slurm autoscaler support for future reservations

09 Jan 00:21
484da6e

Choose a tag to compare

What's Changed

Bug fixes 🐞

  • Hotfix: Slurm autoscaler support for future reservations by @tpdownes in #3508

Full Changelog: v1.44.1...v1.44.2

Release v1.44.1: Support for a3-ultragpu-8g VMs and GKE, Slurm clusters

30 Dec 23:36
346d015

Choose a tag to compare

Release notes v1.44.1

This release announces Toolkit support for the new A3 Ultra machine type from Google Cloud. This machine type includes 8 NVIDIA H200 GPUs each with dedicated CX-7 networking with RDMA support via RoCE.

The release includes 4 blueprints that maximize performance for the machine type:

  1. A simple Slurm blueprint provisioning A3 Ultra compute nodes with a shared Filestore /home
  2. A GKE blueprint that provisions an A3 Ultra compute node pool
  3. An advanced Slurm blueprint that additionally mounts a GCS bucket with performance-optimized caching settings for I/O and checkpointing.
  4. A blueprint that provisions A3 Ultra compute nodes as VM instances (no scheduler) with RDMA networking

Example solutions using NCCL are provided for blueprints running under a scheduler.

v1.44.0: Future Reservations in Slurm, Topology Aware GKE, Expanded GPU RDMA Support

19 Dec 22:55
6a19416

Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Version Updates ⏫

Bug fixes 🐞

Full Changelog: v1.43.1...v1.44.0

v1.43.1: Patch version bump in OFE

12 Dec 20:02
0a8385b

Choose a tag to compare

What's Changed

Version Updates ⏫

  • Bump django from 4.2.16 to 4.2.17 in /community/front-end/ofe by @dependabot in #3358

Full Changelog: v1.43.0...v1.43.1

v1.43.0: GKE and networking enhancements

05 Dec 06:57
7ca11fc

Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Bug fixes 🐞

  • Revert "update a3 machines local ssd to use nvme instead of scsi for better performance" by @chengcongdu in #3272
  • remove GKE reservation validation for local ssd NVMe/CSCI interface by @chengcongdu in #3281

New Contributors

Full Changelog: v1.42.0...v1.43.0

v1.42.0: Filestore deletion protection, GCP maintenance as Slurm job, Docker daemon configuration

20 Nov 19:27
1a1e22a

Choose a tag to compare

What's Changed

Key New Features 🎉

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

  • Refactor mount/mode setting for local SSD RAID by @tpdownes in #3214
  • Fix a bug where try was hiding extraction of gpu driver version by @ankitkinra in #3257
  • Fix the gpu_installation_config default for case where no customer input by @ankitkinra in #3259
  • SlurmGCP. Fix bug that prevents resourcePolicies clean up. by @mr0re1 in #3266

New Contributors

Full Changelog: v1.41.0...v1.42.0

v1.41.0 Adoption of Slurm 24.05 and Improvements to GKE Support

25 Oct 16:58
26fafe0

Choose a tag to compare

What's Changed

Key New Features 🎉

New Modules 🧱

Module Improvements 🔨

Improvements 🛠

  • Create and use non-default service accounts in GKE by @annuay-google in #3123
  • Added documentation on cloud-ops-agent installation and stackdriver removal by @jrossthomson in #3029
  • Ensure local SSD filesystem is assembled into a RAID even upon power off/on cycles by @tpdownes in #3129

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

  • Fixed the exact number constraint problem for additional vpcs in gpu_direct checks by @sharabiani in #3078
  • Provide explicit project information by @wiktorn in #3060
  • Chrome Remote Desktop: increase resilience of apt operations by @tpdownes in #3093
  • Add mount parallelstore service to mount parallelstore for every reboot by @harshthakkar01 in #3125

New Contributors

Full Changelog: v1.40.1...v1.41.0

v1.40.1 Fix issue that affected GKE blueprints due to dynamic provisioning

10 Oct 01:20
eb00254

Choose a tag to compare

What's Changed

Other changes

  • Revert PR#3046 and add more line breaks for readability by @ankitkinra in #3115

Full Changelog: v1.40.0...v1.40.1

v1.40.0: A3 Mega and A3 High families supported in GKE

03 Oct 21:13
f9f9256

Choose a tag to compare

What's Changed

Important

All HPC VM images based upon CentOS 7 have been deprecated. This means that
referring to the "hpc-centos-7" family in the "cloud-hpc-image-public"
project will fail. We recommend migrating to the "hpc-rocky-linux-8" family
that is the new default throughout the Toolkit. If CentOS 7 is truly needed,
the final HPC CentOS 7 image can be used by its name: "hpc-centos-7-v20240712".

Key New Features 🎉

New Modules 🧱

Module Improvements 🔨

Improvements 🛠

Deprecations 💤

Version Updates ⏫

Bug fixes 🐞

Other changes

  • NeMo readme instructions for preloading gpt2 tokenizer by @koallison in #3075

New Contributors

Full Changelog: v1.39.0...v1.40.0