Release v1.48.0
·
2008 commits
to release-candidate
since this release
Highlights
- The GKE nodepool module of Toolkit has been updated to support multiple nodepools. (PR#3826)
- Automatic Prolog/Epilog Slurm GPU Health Checks
- Kueue v0.11.1 manifest support
What's Changed
Key New Features 🎉
- Add a4-high-vm blueprint by @samskillman in #3751
- Cloud DNS config addition to GKE Cluster module by @SwarnaBharathiMantena in #3752
- Update Slurm image reference to new family (6-9) by @abbas1902 in #3740
- Adding Automatic Prolog/Epilog Slurm GPU Health Checks by @RachaelSTamakloe in #3781
- Created GPU Operator Manifest by @ighosh98 in #3814
- add kueue v0.11.1 manifest by @ighosh98 in #3833
- Support resource manager tags on instance template and attached disks by @annuay-google in #3829
- introduce feature to enable k8s beta apis by @ighosh98 in #3840
- Add support for Kueue 0.11.1 by @mwysokin in #3830
- Integrate gpu operator in kubectl by @ighosh98 in #3838
Module Improvements 🔨
- add support for enablePrivateNode at nodepool level by @chengcongdu in #3794
- Add nodeset name as a label to all nodeset instance templates by @annuay-google in #3787
- Fix network names backward compatibility for A3 Mega and A3 High by @sharabiani in #3811
- Support multiple nodepools creation in gke nodepool module by @SwarnaBharathiMantena in #3826
- Enable higher performance self-managed NFS server configurations by @ndebuhr in #3807
- add support for resource quota in gpu-operator namespace by @ighosh98 in #3855
Improvements 🛠
- A4 GKE integration test by @annuay-google in #3718
- A3U Slurm: enable nvidia-persistenced daemon by @tpdownes in #3698
- Remove experimental tag from GKE blueprints in readme by @parulbajaj01 in #3724
- NCCL integration tests by @annuay-google in #3697
- Add NeMo and HPL Slurm GCS System Benchmarks with Ramble by @samskillman in #3726
- GCS update to GKE A4 High blueprint by @SwarnaBharathiMantena in #3749
- Update Kueue documentation by @ighosh98 in #3786
- Add comment descriptions for a3U vars by @parulbajaj01 in #3783
- Improve job template naming by @ighosh98 in #3816
- Update GPU Operator manifest definition by @ighosh98 in #3820
- Unify gke a3 ultra blueprints by @ighosh98 in #3835
- Advanced network configuration support on notebook instance community module by @caetano-colin in #3671
- updating defaults for slurm chs prolog by @RachaelSTamakloe in #3843
- Add disk size vars in deployment file for a3U by @parulbajaj01 in #3812
- Update A3U slurm threads configuration by @ighosh98 in #3853
Deprecations 💤
- Reduce number of startup scripts. by @mr0re1 in #3770
- Add omnia deprecation warning and update A3U and A4 blueprints threads configurations by @ighosh98 in #3837
Version Updates ⏫
Bug fixes 🐞
- Fix issue 3748 (Error with stateful_ips iteration in MIG) by @rbekhtaoui in #3765
- Update urls to point to toolkit main by @ighosh98 in #3793
- Rollback name injection change in job template by @ighosh98 in #3821
- Add Rocky 9 compatibility for NFS by @samskillman in #3813
- fix gke a3-ultra blueprint by @ighosh98 in #3845
- Force retry use of gce service account by @samskillman in #3854
New Contributors
- @yuryu made their first contribution in #3728
- @DavidToneian-Google made their first contribution in #3766
- @rbekhtaoui made their first contribution in #3765
- @Shuang-cnt made their first contribution in #3799
- @sheepx86 made their first contribution in #3832
- @caetano-colin made their first contribution in #3671
- @ndebuhr made their first contribution in #3807
- @mwysokin made their first contribution in #3830
Full Changelog: v1.47.0...v1.48.0
What’s changed gets added automatically by GitHub.