Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
v1.71.0
What's Changed
Module Improvements 🔨
- Adding validations for naming resources by @vikramvs-gg in #4788
Improvements 🛠
- Add Managed Lustre support in gke-a4x blueprint by @parulbajaj01 in #4793
Bug fixes 🐞
New Contributors
Full Changelog: v1.70.0...v1.71.0
v1.70.0
What's Changed
Breaking Changes 🚨
- Removing support for maintenance_interval for reservations created by TAMs by @LAVEEN in #4748
- Migration of jobset from static manifests to helm chart and upgrading version to 0.10.1 by @shubpal07 in #4765
Module Improvements 🔨
- Add automated TPU support and GCS integration in TPU v6 blueprint by @shubpal07 in #4755
Improvements 🛠
- H4d blueprint refactored by @rachit-google in #4740
Full Changelog: v1.69.0...v1.70.0
v1.69.0
What's Changed
Key New Features 🎉
- Add NUMA-aware scheduling in GKE clusters (enabled for G4) by @kadupoornima in #4760
- Add daily PR integration tests for G4 machines by @kadupoornima in #4761
New Modules 🧱
Improvements 🛠
- Adding GKE sample for running nvidia-bug-report by @raushan2016 in #4741
- PSA update by @okrause in #4744
New Contributors
- @aslam-quad made their first contribution in #4742
Full Changelog: v1.68.0...v1.69.0
v1.68.0
What's Changed
Key New Features 🎉
- downloading libnccl2 and libnccl-dev for a3u and a4h by @rachit-google in #4680
Breaking Changes 🚨
- Allowing setting use_job_duration with non-exclusive partitions. by @arpit974 in #4696
- Add multi-network support in TPU v6e by @agrawalkhushi18 in #4723
- Update vpc and cloud_router versions in VPC network module by @kadupoornima in #4732
Module Improvements 🔨
- Refactoring in gke persistent module by @vikramvs-gg in #4618
- Migrate Kueue installation to use Helm chart by @shubpal07 in #4542
Improvements 🛠
- Update nvidia DRA driver version to v25.3.0 by @parulbajaj01 in #4670
- Updated A3-mega and A4-high Slurm blueprints to adopt nvidia add repository scirpt. by @rachit-google in #4667
- Update H4D blueprint: disable automatic updates, provide image info, and delete duplicate filestore by @Neelabh94 in #4644
- Add Managed Lustre support in gke-a4 by @parulbajaj01 in #4654
- Add Managed Lustre support in gke a3 ultra by @parulbajaj01 in #4700
- Adds an irdma health check to h4d nodes by @samskillman in #4704
- Enable Spot VM Provisioning For H4D by @LAVEEN in #4735
- Add slurm-gke blueprint by @ACW101 in #4607
Version Updates ⏫
Bug fixes 🐞
- Remove superfluous addition of chs logs to cloud ops config by @abbas1902 in #4679
- Adding "datacenter-gpu-manager-4-dev" as an additional installation in A* YAML files. by @Neelabh94 in #4623
- minor bug fix on MFT version comparison by @ljqg in #4689
- Fix inconsistent plan on Slurm cluster reconfigure by @wiktorn in #4538
- Update process to filter out starting comments in a source yaml file by @SwarnaBharathiMantena in #4707
- Fix gke build failures by @annuay-google in #4708
- Update machine-leaning/a3-ultragpu-8g/nemo-framework to fix segmentation fault error by @SwarnaBharathiMantena in #4725
New Contributors
- @mufaqam-gcl made their first contribution in #4688
- @wtempel made their first contribution in #4705
- @nikosavola made their first contribution in #4720
- @ACW101 made their first contribution in #4607
Full Changelog: v1.67.0...v1.68.0
Release v1.67.0
What's Changed
Key New Features 🎉
Module Improvements 🔨
- added nvidia-repositories script by @rachit-google in #4553
Improvements 🛠
- Install NCCL/gIB .deb and .rpm packages for A3U and A4 by @rachit-google in #4543
- updating example to use jax ai images by @pulasthi in #4575
- Enabling Spot VM For A3 Mega/High by @LAVEEN in #4634
New Contributors
Full Changelog: v1.66.0...v1.67.0
Release v1.66.0
What's Changed
Key New Features 🎉
- H4D enable gcsfuse and set cluster availability type to ZONAL by @kadupoornima in #4608
- Add G4 GKE base blueprints by @kadupoornima in #4560
Module Improvements 🔨
- Slinky upgraded to v0.3.1 by @sharabiani in #4548
- Update Managed lustre gke blueprint by @parulbajaj01 in #4603
Improvements 🛠
- Making separate integration test for nccl test in gke a3 ultra by @shubpal07 in #4622
- Upgrade to Slurm 25.05 by @LAVEEN in #4606
- Hotfix: H4D Blueprint provisioning model option update by @abbas1902 in #4640
New Contributors
- @saara-tyagi27 made their first contribution in #4619
Full Changelog: v1.65.0...v1.66.0
v1.65.1: Hotfix: H4D Blueprint provisioning model options update
What's Changed
Improvements 🛠
- Hotfix: H4D Blueprint provisioning model options update by @abbas1902 in #4640
Full Changelog: v1.65.0...v1.65.1
v1.65.0
What's Changed
Improvements 🛠
- Surface Managed Lustre support in a4x by @RachaelSTamakloe in #4576
- Expand A* gpu network wait solution by @RachaelSTamakloe in #4584
- Restart slurmctld.service before scontrol reconfigure by @RachaelSTamakloe in #4609
- Support use of other shared file locations for NCCL Tests by @RachaelSTamakloe in #4615
- Add sudo to systemctl restart by @RachaelSTamakloe in #4626
Deprecations 💤
- Deprecate Debian blueprints from a3 mega gpu by @rachit-google in #4537
Bug fixes 🐞
- Power down non-responding node if there is not instance attached by @abbas1902 in #4627
Full Changelog: v1.64.0...v1.65.0
Release v1.64.0
What's Changed
Key New Features 🎉
- GKE Managed Lustre integration by @vikramvs-gg in #4572
Breaking Changes 🚨
- updated the storage for a3Ultra to basic ssd by @rachit-google in #4516
Improvements 🛠
Version Updates ⏫
- Revert gke-node-pool module to using google-beta provider by @kadupoornima in #4577
New Contributors
Full Changelog: v1.63.0...v1.64.0
Release v1.63.0
What's Changed
Key New Features 🎉
- Switch to 6-11 slurm image versions by @abbas1902 in #4425
Breaking Changes 🚨
- updated file storage to basic hdd for A3ULTRA. by @rachit-google in #4511
- Update vm boot disk and coupling it with VM instance by @arpit974 in #4558
Module Improvements 🔨
- Improvements applied to gke-pv-module by @sharabiani in #4499
- Add support for regional Instance Template in Slurm scripts by @sharabiani in #4501
- Added partial Slinky install option by @sharabiani in #4508
- Improvements on gke-node-pool module by @sharabiani in #4509
- Add GPU type to Slurm's gres.config by @sharabiani in #4546
- Add GKE Nodes support to Slurm controller by @sharabiani in #4547
Improvements 🛠
- add a4x image build recipe by @RachaelSTamakloe in #4495
- Transition DWS Flex-Start to Regional MIGs by @abbas1902 in #4491
Bug fixes 🐞
- Fix: Slurm scripts incorrectly identifying all accelerators as GPUs by @sharabiani in #4498
- Get assuredCount via specificReservation by @abbas1902 in #4540
- Fix filter for gke-nodepool instance_templates by @sharabiani in #4559
New Contributors
- @Neelabh94 made their first contribution in #4552
Full Changelog: v1.62.2...v1.63.0