Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
v1.79.0
v1.78.0
What's Changed
Breaking Changes 🚨
- Fix private address space for gke-a3-megagpu.yaml by @omartin2010 in #4478
Improvements 🛠
- Add precondition checks to disallow setting conflicting consumption options by @kadupoornima in #5062
Deprecations 💤
- Add deprecation notice for paralellstore module by @parulbajaj01 in #5083
- Deprecate a3u-gcs blueprint as its no longer maintained by @bytetwin in #4871
Version Updates ⏫
- Add gIB versions v1.1.1 and v1.1.0 for arm64 by @duncanspani in #5090
New Contributors
- @AdarshK15 made their first contribution in #5095
- @duncanspani made their first contribution in #5090
- @siddhartha-quad made their first contribution in #4792
Full Changelog: v1.77.0...v1.78.0
v1.77.0
What's Changed
Key New Features 🎉
- Integrate Kueue support for GKE TPU v6 and v7x blueprints by @agrawalkhushi18 in #5007
- feat: Enable Block topology for A4X by @Neelabh94 in #5021
- Support shared reservations in gke-node-pool module by @SwarnaBharathiMantena in #5040
- Add automated GCP resource cleanup script and Cloud Build pipeline by @simrankaurb in #5039
- Add integration test for A3 high-GPU with spot VMs by @simrankaurb in #4984
- feat: Add community module for executing gcloud commands by @cboneti in #4923
Breaking Changes 🚨
- Graduate network/private-service-access to core modules by @SwarnaBharathiMantena in #5029
Improvements 🛠
- Refactor fio job template with best practices by @parulbajaj01 in #4977
- Enable h4d-vm test to run on Spot VMs by @simrankaurb in #5022
- Adding Robust destroy in cluster toolkit by @shubpal07 in #4866
Bug fixes 🐞
- Adding G4 configuration by @LAVEEN in #5024
- Use ternary operator for anywhere_cache precondition in main.tf by @Neelabh94 in #5033
Full Changelog: v1.76.0...v1.77.0
v1.76.0
What's Changed
Key New Features 🎉
- feat: Add support for Anywhere Cache in cloud-storage-bucket by @Neelabh94 in #4889
- Adding test for A3 UltraGPU JBVMs with Spot VMs by @simrankaurb in #4968
- On Spot A4 by @LAVEEN in #4953
- Enable Spot VM testing for GKE with A3 mega GPUs by @simrankaurb in #4951
- Enable Spot VM testing for a3-megagpu instances by @simrankaurb in #4901
- Add a post-deploy test specific to TPUs by @agrawalkhushi18 in #4969
Breaking Changes 🚨
- Move community/modules/project/service-account module to core modules directory by @SwarnaBharathiMantena in #4958
Module Improvements 🔨
- Make waiting for kueue installation configurable, and wait for kueue in the G4 GKE blueprint by @kadupoornima in #4973
Improvements 🛠
- Update GKE A4X Readme by @parulbajaj01 in #4955
- Add example nccl test script for slurm on gke by @ACW101 in #4960
Deprecations 💤
- Remove all references to ubuntu20.04 by @sarthakag in #4963
Bug fixes 🐞
Full Changelog: v1.75.1...v1.76.0
v1.75.1
What's Changed
Module Improvements 🔨
- Add exclusion_end_time_behavior and update release channel maintenance window by @SwarnaBharathiMantena in #4990
Full Changelog: v1.75.0...v1.75.1
v1.75.0
What's Changed
Key New Features 🎉
- Add integration test files for TPU v6e by @agrawalkhushi18 in #4906
- Enable Spot VM testing for a3-ultragpu instances by @simrankaurb in #4862
- Add integration test for TPU 7x by @agrawalkhushi18 in #4916
- Adding ML dependencies for G4 & guidance to use dual NIC by @LAVEEN in #4922
- Enable spot VM Testing for GKE: a3ultra by @simrankaurb in #4946
Breaking Changes 🚨
- Graduate cloud-storage-bucket module to core modules and update references by @SwarnaBharathiMantena in #4927
Module Improvements 🔨
- Updating Kueue default version to 0.14.4 in A4X by @shubpal07 in #4850
Improvements 🛠
- Add NCCL test validation to G4 Integration tests by @kadupoornima in #4933
- Register job_completion output in test-gke-job.yml by @agrawalkhushi18 in #4957
Bug fixes 🐞
- Minor fix: Delegating gcloud command to localhost by @simrankaurb in #4937
Full Changelog: v1.74.0...v1.75.0
v1.74.0
What's Changed
Key New Features 🎉
- Add Google Cloud NetApp Volumes support by @okrause in #4583
- Add NCCL tests for G4 NPI by @kadupoornima in #4898
- Add TPU 7x blueprint files and changes in tpu-definition module by @agrawalkhushi18 in #4887
Module Improvements 🔨
- Add force_conflicts flag when applying manifests using kubectl by @SwarnaBharathiMantena in #4874
Improvements 🛠
- Modify the wait-for-startup-script to fix test failures by @agrawalkhushi18 in #4845
- Update recommended
FI_UNIVERSE_SIZEsetting for startup script by @linsword13 in #4782 - Add GCS updates to GKE A4X by @parulbajaj01 in #4864
- Graduating tpu v6e from community to core by @shubpal07 in #4909
Bug fixes 🐞
- Update the nccl-tcpxo-installer, nri-device-injector, and nccl-test for a3-megagpu-8g machines by @SwarnaBharathiMantena in #4902
- pin mypy version in precommit dep. to last stable version i.e 1.18.2 by @shubpal07 in #4913
Other changes
- Hotfix v1.73.1 (#4884) by @aslam-quad in #4910
New Contributors
- @kvenkatachala333 made their first contribution in #4912
Full Changelog: v1.73.1...v1.74.0
v1.73.1
What's Changed
Bug fixes 🐞
- Fixing gpu-test by @cboneti in #4869
- Upgraded nccl-plugin-gpudirecttcpx-dev to v1.0.14 and tcpgpudmarxd-dev to v1.0.20 via slurm-gcp repo.
Full Changelog: v1.73.0...v1.73.1
v1.73.0
What's Changed
Key New Features 🎉
- feat: Add GKE Inference Gateway support by @SinaChavoshi in #4699
- a3high single blueprint to use the tcpx patched kernel by @bytetwin in #4821
Improvements 🛠
- Initial Blueprint G4 by @LAVEEN in #4685
- Parameterise gIB NCCL RDMA plugin installer in gke a4x by @parulbajaj01 in #4843
New Contributors
- @SinaChavoshi made their first contribution in #4699
- @simrankaurb made their first contribution in #4837
Full Changelog: v1.72.0...v1.73.0
v1.72.0
What's Changed
Key New Features 🎉
- Integrating Managed lustre in TPU v6e by @shubpal07 in #4814
- Support sycomp storage by @gqiu-sycomp-com in #4798
Breaking Changes 🚨
- Enable Private Nodes by default in GKE Node Pool by @kadupoornima in #4682
Module Improvements 🔨
- Add tpu_topology as an output value for workload_policy by @agrawalkhushi18 in #4813
Improvements 🛠
- Refactor a4xhigh-slurm-blueprint.yaml by moving epilog and prolog to slurm-gcp by @Neelabh94 in #4733
- Update nccl-rdma manifest in gke a4x by @parulbajaj01 in #4817
- Adding integration test for GKE A4X by @vikramvs-gg in #4828
Bug fixes 🐞
- Fix default mount paths in Slurm controller README.md by @nikosavola in #4779
New Contributors
- @sudheer-quad made their first contribution in #4791
- @gqiu-sycomp-com made their first contribution in #4798
Full Changelog: v1.71.0...v1.72.0