v1.49.0
·
1786 commits
to release-candidate
since this release
Highlights
- TPU Support in GKE ndoepool module with example blueprint
- Support for Managed Lustre in pre-existing-network-storage module; Managed Lustre provisioning will be supported in a future Toolkit release
What's Changed
Key New Features 🎉
- add nvidia imex support by @ighosh98 in #3885
- TPU support with GKE nodepool module and TPU v4 2x2x2 example blueprint by @SwarnaBharathiMantena in #3817
- helm_install module implemented by @ighosh98 in #3933
- integrate support for multi-arch compliant jobset v0.8.1 by @ighosh98 in #3934
- Update vm-instance to support additional persistent disks by @tpdownes in #3935
- add support for workload policy by @ighosh98 in #3938
Breaking Changes 🚨
- Make login nodes deployable independently of "controller" by @mr0re1 in #3958
NOTE: Attempt to re-deploy pre-existing Slurm cluster with new gcluster version will cause login nodes to be destroyed. - DWS Flex Implementation will change with this release, if you would like to continue using the legacy implementation we've add
use_bulk_insertoptions to our dws_flex nodeset settings. For more on DWS Flex support in Slurm visit: https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/docs/slurm-dws-flex.md
New Modules 🧱
- Adding Managed Lustre to Cluster Toolkit by @cdunbar13 in #3950
Module Improvements 🔨
- [GKE] Add support to enable DNS based endpoint config by @mohitchaurasia91 in #3884
- Update filestore timeout config based on high capacity tier by @mohitchaurasia91 in #3900
- split terraform bundles for gpu operator by @ighosh98 in #3911
- split crd manifest out by @ighosh98 in #3913
- Add support for filestore instance description by @mohitchaurasia91 in #3953
- Fix Packer documentation for minimum necessary IAM roles by @tpdownes in #3960
- Fix workload_policy varible definition and usage. by @mohitchaurasia91 in #3963
- Add Managed Lustre to pre-existing-filestore module by @cdunbar13 in #3937
Improvements 🛠
- Update GKE version prefix for A3 Mega to v1.32.2 by @ighosh98 in #3874
- Add disk size vars for A4 by @parulbajaj01 in #3872
- Add comment description for variables in a4 blueprint by @parulbajaj01 in #3880
- Add kueue configuration support to a3 mega by @ighosh98 in #3860
- Update dra driver module by @ighosh98 in #3894
- Add MIG based DWS Flex support by @abbas1902 in #3903
- A4 Slurm: enable sudo in Slurm jobs for users with OS Admin Login role by @tpdownes in #3961
- Add sudo via OS Login to all A3 Slurm solutions by @RachaelSTamakloe in #3966
Version Updates ⏫
Bug fixes 🐞
- Fix filestore instance location var for REGIONAL tier by @mohitchaurasia91 in #3871
- add GCS updates to A3 Ultra by @ighosh98 in #3883
- GPU Operator Integration Redesign by @ighosh98 in #3892
- Add resource quota for gpu operator by @ighosh98 in #3895
- Update imex and nvidia DRA Driver configurations by @ighosh98 in #3902
- Update nccl installer for a4 by @ighosh98 in #3906
- Fix syntax errors for resource policy by @ighosh98 in #3954
- Revert network profile URI by @cdunbar13 in #3962
Full Changelog: v1.48.1...v1.49.0