Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
v1.4.1: Fix Application Specific Tutorials
What's Changed
- Fix spack setup scripts in tutorials. by @nick-stroud in #533
Full Changelog: v1.4.0...v1.4.1
v1.4.0: Advanced networking for Slurm V5, Version Updates, & Bug Fixes
Improvements
schedmd-slurm-gcp-v5-partition: Added option to enable gVNIC and TIER 1 networking.install_ansible script: Updated to provide a generalized process for installing python, pip & ansible on a variety of OS images.
Version updates
omnia-install: v1.0 -> v1.3 of DellHPC Omniainstall_ansible: v2.9 -> v2.11 of Ansibleschedmd-slurm-gcp-v5-partition,schedmd-slurm-gcp-v5-controller,schedmd-slurm-gcp-v5-login: v5.0.3 -> v5.1.0 of Slurm on GCP
What's Changed
- Remove exit 0 command from Spack install by @tpdownes in #483
- Log runner that is being executed by @nick-stroud in #484
- Bump cloud.google.com/go/compute from 1.8.0 to 1.9.0 by @dependabot in #486
- Upgrade to slurm-gcp v5.1 by @nick-stroud in #487
- Update examples and docs to list ID first, source second by @thiagosgobe in #488
- Expose bandwidth_tier on Slurm V5 compute nodes by @nick-stroud in #490
- Additional validation for blueprint_name by @kkr16 in #482
- Synchronize develop with release branch by @tpdownes in #493
- Fixing bug forcing 8chars cluster_names (vs 10). by @cboneti in #498
- Incorporate Release v1.3.0 into develop by @nick-stroud in #499
- Roll version on develop to v1.3.1 by @nick-stroud in #501
- Update ansible install script by @heyealex in #485
- Return correct code from ansible-local runners by @heyealex in #503
- Update Omnia version in omnia-install by @heyealex in #495
- fix: open in cloud shell misinterpreted variable substitution by @nick-stroud in #512
- Update Batch list command to match updated API by @nick-stroud in #513
- Roll version for minor release by @nick-stroud in #525
- Release v1.4.0 by @nick-stroud in #520
New Contributors
- @thiagosgobe made their first contribution in #488
Full Changelog: v1.3.0...v1.4.0
v1.3.0: Application specific tutorials for Gromacs, Openfoam, & WRF
Key New Features
- Application specific tutorials for Gromacs, Openfoam, & WRF that walk through running real workloads.
New Examples
slurm-gcp-v5-ubuntu2004.yaml: Creates a slurm cluster based on the ubuntu 20.04 slurm-gcp images.slurm-gcp-v5-hpc-centos7.yaml: Rename of theslurm-gcp-v5-cluster.yamlexample which uses the hpc-centos7 VM image.
Resource Improvements
- Slurm V5 controller and login node support enabling public ip addresses.
slurm-gcp-v5-*: Remove requirement to set theslurm_cluster_namein slurm-gcp-v5 modules.
What's Changed
- Additional validation of setting name conventions by @heyealex in #459
- Update vm-instance to terminate on maintenance when a GPU is attached by @nick-stroud in #460
- Use simplier gcloud image for project cleanup by @heyealex in #461
- Fixing formatting in go files to pass weekly build. by @cboneti in #463
- Bump cloud.google.com/go/compute from 1.7.0 to 1.8.0 by @dependabot in #467
- Bringing develop up to date with main post-release by @nick-stroud in #471
- Bump version patch number post release by @nick-stroud in #472
- Adds tutorials for Gromacs, Openfoam, & WRF that walkthrough running real workloads by @nick-stroud in #466
- Update tutorials to use native api enablement by @nick-stroud in #473
- Improve findability of modules, examples, and tutorials by @tpdownes in #475
- Remove reference to cache override from app tutorials by @nick-stroud in #477
- Making slurm_cluster_name optional by @cboneti in #476
- Change
sourceto.when calling startup scripts in shell runner by @heyealex in #474 - Change enabled repos by version in nfs-utils install by @heyealex in #478
- Adding Slurm-on-GCP V5 Ubuntu example by @cboneti in #479
- Fix: rename test file from 'build' to 'batch' by @nick-stroud in #480
- Adding option to enable public ips on Slurm-GCP V5 by @cboneti in #481
- Disabling Omnia tests temporarily. by @cboneti in #492
- Resolve parallel builds by @tpdownes in #494
- Rolling version to 1.3.0 by @nick-stroud in #497
- Release v1.3.0 by @heyealex in #496
Full Changelog: v1.2.1...v1.3.0
v1.2.1: Improved startup time when NFS mounting, Slurm V5 zone preferences, testing improvements, & bug fixes
Key New Features
schedmd-slurm-gcp-v5-partition: allows setting preferential and fully excluded zones
Improvements
- NFS client installation time on instance startup reduced by 96%.
- Cloud Batch integration testing and other integration testing improvements.
Version updates
github.com/daos-stack/google-cloud-daos: from 0.2.0 to 0.2.1github.com/SchedMD/slurm-gcp: from 5.0.2 to 5.0.3
What's Changed
- Bump github.com/spf13/afero from 1.9.0 to 1.9.2 by @dependabot in #429
- Bump patch release to 1.1.2 by @tpdownes in #430
- fix errors when missing deployment_name by @kkr16 in #428
- Update Intel DAOS community examples to use google-cloud-daos v0.2.1 by @markaolson in #427
- Add Cloud Batch job submission to integration test by @nick-stroud in #431
- Add check for startup script failure, montoring by @heyealex in #432
- Update Batch list instructions now that Batch response is brief by @nick-stroud in #434
- Add hello world integration tests to demonstrate interaction between test files by @nick-stroud in #433
- Update Batch integration test to run in series by @nick-stroud in #436
- Rename spack post deploy test to match other post deploy tests by @nick-stroud in #440
- Enable ansible lint pre-commit hook by @nick-stroud in #435
- Make Packer test more reliable by @nick-stroud in #442
- Add zone policy variables to slurm partition by @heyealex in #438
- Fix ansible-lint errors in spack test by @nick-stroud in #443
- Breakout startup wait to new file & update Batch test by @nick-stroud in #437
- Add test-mount to Batch integration & move variables into custom_vars by @nick-stroud in #444
- Update develop to version 1.2.0 to keep in sync with main by @heyealex in #453
- Bring develop up to date with main by @heyealex in #452
- Update slurm-gcp modules to v5.0.3 by @heyealex in #449
- Remove deprecated interpolation-only expression from nfs-server output by @heyealex in #457
- Decrease overhead of nfs client package installation by @heyealex in #454
- Filter for deployment name in TCP connections widget by @heyealex in #456
- Update version to 1.3.0 by @heyealex in #458
- Merge changes from main into release branch by @heyealex in #462
- Fixing formatting in go files to pass weekly build. by @cboneti in #465
- Roll back release version patch by @nick-stroud in #468
- Release v1.2.1 by @nick-stroud in #469
Full Changelog: v1.2.0...v1.2.1
v1.2.0: HTCondor autoscaling, explicitly defined IP ranges in VPC module
Key New Features
- Autoscaling in HTCondor.
- Explicitly defined IP ranges in the VPC module.
New Resources
htcondor-execute-point: Creates an instance template and Managed Instance Group (MIG) for creating autoscaled compute nodes. Outputs a runner for configuring the autoscaler to scale the MIG.
Improvements
- Allow explicitly defined IP ranges in the VPC module.
- wait-for-startup module will wait for startup script completion when VMS are replaced.
- Add autoscaler to HTCondor modules.
- Docker support for HTCondor nodes.
- HTCondor Pool example added to community examples.
- HTCondor tutorial added.
Deprecations
- The following variables in the VPC module are deprecated:
primary_subnetwork,additional_subnetworks,subnetwork_size. See the VPC README for more information.
What's Changed
- Add strict positional arguments checking to the create and expand by @danielahlin in #391
- Fix link to login node in modules README by @heyealex in #407
- Allow explicitly-defined IP ranges in VPCs by @tpdownes in #392
- Match VPC README note to Toolkit style by @tpdownes in #409
- Bring develop up to date with main by @nick-stroud in #405
- Update HTCondor installation module by @tpdownes in #412
- Reduce HTCondor SchedD update interval by @tpdownes in #408
- Patch: Fix link to login module in modules/README.md by @heyealex in #410
- Bump gopkg.in/yaml.v3 from 3.0.0 to 3.0.1 by @dependabot in #399
- Add always wait option to wait-for-startup module by @nick-stroud in #390
- Bump github.com/zclconf/go-cty from 1.9.1 to 1.10.0 by @dependabot in #400
- Add existing HTCondor autoscaler to repo by @tpdownes in #413
- Bump github.com/hashicorp/go-getter from 1.6.1 to 1.6.2 by @dependabot in #414
- Bump github.com/spf13/afero from 1.6.0 to 1.8.2 by @dependabot in #403
- Bump github.com/spf13/cobra from 1.2.1 to 1.5.0 by @dependabot in #401
- Bump cloud.google.com/go/compute from 1.5.0 to 1.7.0 by @dependabot in #402
- Bump github.com/hashicorp/hcl/v2 from 2.10.1 to 2.13.0 by @dependabot in #415
- eliminate duplicate git clone for firewall module by @kkr16 in #411
- Bump github.com/otiai10/copy from 1.6.0 to 1.7.0 by @dependabot in #416
- Install HTCondor autoscaler into filesystem and fix node deletion bug by @tpdownes in #417
- Support HTCondor execute points by @tpdownes in #418
- Print instance ID and information in daily tests by @heyealex in #420
- Enhance HTCondor pool support by @tpdownes in #421
- Improve ansible installation reliability by @tpdownes in #406
- Add public example for HTCondor Pool by @tpdownes in #419
- Bump github.com/spf13/afero from 1.8.2 to 1.9.0 by @dependabot in #422
- Ignore threads_per_core for unsupported machine types in vm-instance by @kkr16 in #382
- Add basic Cloud Batch integration test by @nick-stroud in #423
- Update pre-commit hooks by @tpdownes in #424
- Update to version 1.1.1 by @heyealex in #425
- Add HTCondor tutorial by @tpdownes in #426
- Updating DNN community module to Cloud 6.0.1 by @tpdownes in #450
- Release v1.2.0 by @heyealex in #448
New Contributors
- @danielahlin made their first contribution in #391
Full Changelog: v1.1.0...v1.2.0
v1.1.0: Google Cloud Batch, Slurm V5, Jumbo Frames, and Advanced Networking in Slurm V4
Key New Features
- Google Cloud Batch support: read more.
- Slurm V5 support & example blueprint.
- Slurm V4 partitions now support advanced networking features such as gVNIC adapters and high egress (Tier 1) bandwidth.
- Slurm V4 partitions now support placement groups for all Compute Engine machine families that support them (A2, C2, C2D, N2, N2D).
- VPC module supports jumbo frames for higher bandwidth and lower latency performance.
New Resources
schedmd-slurm-gcp-v5-partition: Creates a partition to be used by a slurm-controller.schedmd-slurm-gcp-v5-controller: Creates a Slurm controller node using slurm-gcp.schedmd-slurm-gcp-v5-login: Creates a Slurm login node using slurm-gcp.cloud-batch-job: Creates a Google Cloud Batch job template that works with other Toolkit modules.cloud-batch-login-node: Creates a VM that can be used for submission of Google Cloud Batch jobs.htcondor-configure: Creates Toolkit runners and service accounts to configure an HTCondor pool.htcondor-install: Creates a startup script to install HTCondor and exports a list of required APIs.
Version updates
github.com/hashicorp/go-getter: from 1.5.11 to 1.6.1github.com/SchedMD/slurm-gcp//tf/modules/controller/: from 4.1.8 to 4.2
What's Changed
- Add external IP output to vm-instance module by @tpdownes in #353
- Default to not disabling services upon destroy by @tpdownes in #351
- Support extra args for ansible playbooks by @tpdownes in #352
- Bump github.com/hashicorp/go-getter from 1.5.11 to 1.6.1 by @tpdownes in #350
- Create dependabot configuration file by @tpdownes in #354
- Add support for Slurm to
usethestartup_scriptmodule by @nick-stroud in #349 - Adopt Slurm v4.2.0 module by @tpdownes in #356
- Upgrade to yaml.v3 by @nick-stroud in #347
- Improve Packer module by @tpdownes in #355
- Update VPC module to support setting MTU by @tpdownes in #363
- Add HTCondor Install module by @tpdownes in #359
- Add HTCondor Configure module by @tpdownes in #360
- Reliably detect when nodes fail to be scaled in by @tpdownes in #364
- Fix rare failure modes of monitoring test by @tpdownes in #366
- Improve detection of Slurm startup by @tpdownes in #367
- Install compatible protobuf for older Python by @tpdownes in #370
- Add security setting for go-getter by @mittz in #371
- Add headers to quota sections in README for linking by @nick-stroud in #369
- Add HTCondor Pool blueprint (experimental) by @tpdownes in #361
- Improve Slurm partition module documentation by @tpdownes in #372
- Adopt Google Private Access by default by @tpdownes in #373
- Add integration test for HTCondor by @tpdownes in #362
- Patch omnia-install to continue working with 1.0 by @heyealex in #374
- Update spack resource environments and flags by @douglasjacobsen in #346
- Add variable for slurm UID in omnia-install by @heyealex in #375
- Add provider_meta to htcondor-configure module by @tpdownes in #379
- Extend periodic cleanup to reset Filestore API by @tpdownes in #380
- Add slurm-gcp v5 controller module by @heyealex in #378
- Fix Cloud Build Filestore cleanup by @tpdownes in #383
- Update minimum Packer release by @tpdownes in #384
- Modules/slurm gcp v5 partition by @heyealex in #381
- fix: install_nfs_client_runner was using 'content' instead of 'source' by @nick-stroud in #387
- Maintenance of VPC module by @tpdownes in #386
- Address bug in Shared VPC Filestore blueprint by @tpdownes in #389
- Add slurm-gcp v5 login node module by @heyealex in #388
- Add support for Cloud Batch by @nick-stroud in #394
- Rename documentation to reference Google Cloud Batch by @nick-stroud in #397
- Add community example using slurm-gcp v5 modules by @heyealex in #393
- Update to version v1.1.0 by @nick-stroud in #398
- Release v1.1.0 by @nick-stroud in #396
Full Changelog: v1.0.0...v1.1.0
v1.0.0: General Availability
Key New Features
- Support for DAOS
- Shared VPC example
- Doc updates
Version updates
- Slurm partition, controller, and login: v4.1.8
What's Changed
- Update indentation to allow numbered lists by @heyealex in #306
- Update community module documentation by @heyealex in #313
- Changes to SSH key metadata should not trigger action by Terraform by @tpdownes in #316
- Add example for custom firewall rule applied to SSH bastion by @tpdownes in #314
- DAOS examples by @cboneti in #288
- Change global variables to deployment variables by @heyealex in #317
- Add more usage information to cmd README by @heyealex in #311
- Adopt new release of Slurm GCP by @tpdownes in #291
- Edits to examples/README by @heyealex in #318
- Set enable-oslogin: TRUE by default in VM instance module by @mittz in #319
- Updates to the main README file by @heyealex in #322
- Minor update to enable_oslogin error text by @tpdownes in #323
- Fix a typo in spack buildcache flag by @douglasjacobsen in #325
- Update the top level modules README by @heyealex in #321
- Update image building examples to use latest Slurm image family by @tpdownes in #326
- Update documentation for core modules by @heyealex in #320
- README doc review for root, cmd, and examples by @nick-stroud in #327
- Fix examples README TOC links by @nick-stroud in #328
- De-indent code blocks in intel examples readme by @heyealex in #331
- Update intel select blueprint schema by @nick-stroud in #329
- Improve example documentation by @tpdownes in #335
- Improve documentation for GitHub modules by @tpdownes in #334
- Update DAOS examples to point to google-cloud-daos v0.2.0 by @markaolson in #332
- Update find and install commands to be BSD compatible by @nick-stroud in #336
- Update version to 0.7.3 by @heyealex in #338
- Minor documentation fixes in modules/... by @nick-stroud in #337
- Update license year to 2022 by @nick-stroud in #340
- Version 0.7.3 by @heyealex in #341
- Filestore connect mode and Shared VPC example by @tpdownes in #330
- Improve error output when validation fails by @mittz in #333
- Link to Google Cloud Docs and add Open in Cloud Shell by @nick-stroud in #342
- Update version to 1.0.0 by @nick-stroud in #343
- Version 1.0.0 by @nick-stroud in #344
New Contributors
- @markaolson made their first contribution in #332
Full Changelog: v0.7.2-alpha...v1.0.0
v0.7.2-alpha: New features in `vm-instance`, updated documentation
Key New Features
- Spot provisioning and
threads_per_coresupport in VM Instance module - Updated and improved documentation
Resource Improvements
vm-instance: Spot provisioning supportvm-instance: Option to setthreads_per_coreto enable or disable Simultaneous Multithreading (SMT)vpc: Better support for supplying custom primary subnetworkvpc: Better dependency trackingstartup-scripts: Better dependency tracking
Improvements
- Updated Documentation, improvements to navigation in large README files
make installandmake install-userfor installing the binary globally or locally.- Issue template added for reporting bugs in the HPC Toolkit
Bug Fixes
- Fixed: Terraform state doesn't update when overwriting a blueprint
What's Changed
- Support Spot provisioning in VM instance module by @tpdownes in #283
- Enable VPC module to accept subnetwork_name input variable by @tpdownes in #285
- Add threads-per-node option for vm-instance by @heyealex in #290
- Reduce 'suspend_time' in example to minimize destroy leaving behind compute nodes by @nick-stroud in #292
- Fix: terraform.tfstate.backup was written to terraform.tfstate during overwrite by @nick-stroud in #294
- Add
make installoption for root and user by @heyealex in #293 - Update quota documentation to match new defaults for filestore module by @nick-stroud in #297
- Add implicit dependencies in startup-scripts by @tpdownes in #298
- Add explicit dependencies in VPC module by @tpdownes in #295
- Add issue template by @heyealex in #300
- Added a TOC to examples/README, re-sorted examples by @cboneti in #301
- Update PD quota to match current example config by @nick-stroud in #302
- Update Intel Select tutorial to use new schema by @nick-stroud in #303
- Fix make tests by @tpdownes in #304
- Update name of previous resource groups folder to match new schema by @nick-stroud in #305
- Add provider_meta blocks by @nick-stroud in #309
- Update cmd README by @heyealex in #307
- Update to version 0.7.2-alpha by @heyealex in #310
- Release 0.7.2-alpha by @heyealex in #312
Full Changelog: v0.7.1-alpha...v0.7.2-alpha
v0.7.1-alpha: Documentation Additions, Updated Defaults, Bug Fixes, and Intel Select Example
v0.7.1-alpha: Documentation Additions, Updated Defaults, Bug Fixes, and Intel Select Example
Pre-release
Pre-release
Key New Features
- Improved documentation.
- Improved defaults on Filestore and Slurm.
- Additional modules allow specifying
project_idindependently from the globalproject_id. - Spack install dir updated to avoid conflict with Slurm.
- Internal schema rename to match changes released in 0.7.0-alpha.
New Examples
What's Changed
- Documentation fixes by @tpdownes in #267
- Set default filestore size to lowest possible by @tpdownes in #268
- Rename internal schema data structures and variable names by @heyealex in #258
- Update modules to accept project_id as variable by @mittz in #272
- Update docs and defaults for spack install dir by @heyealex in #270
- Add troubleshooting tip for compute SA permissions by @heyealex in #271
- Update writer to run in group order by @heyealex in #274
- Update DDN EXAscaler naming and tags by @heyealex in #277
- Update slurm defaults to match recommendations by @heyealex in #275
- Lower filestore default tier to Basic HDD by @tpdownes in #276
- Add instructions for installing ansible in runners by @heyealex in #279
- Add Intel blueprints and Slurm job by @fertinaz-intel in #249
- Add links to community examples, document badges by @heyealex in #278
- Point to tutorials for quickstart in README by @heyealex in #273
- Intel blueprint updates by @tpdownes in #280
- Fix name of VM created by Intel Select Solution example by @tpdownes in #281
- Add support documentation to community modules by @heyealex in #282
- Refactor/pkg name update by @heyealex in #269
- Update to version 0.7.1-alpha by @nick-stroud in #287
- Partial Revert "Point to tutorials for quickstart in README" by @nick-stroud in #289
- Release 0.7.1-alpha by @nick-stroud in #286
New Contributors
- @fertinaz-intel made their first contribution in #249
Full Changelog: v0.7.0-alpha...v0.7.1-alpha
v0.7.0-alpha: Updated schema and component names, added community folder, new command line options
v0.7.0-alpha: Updated schema and component names, added community folder, new command line options
Pre-release
Pre-release
Key New Features
- Updated HPC Toolkit naming and schema with significant interface changes (read more below)
- Moved community contributions to community folder
- Overwrite flag (-w) optionally overwrites existing deployment folder while maintaining terraform state
- Terraform Backend can be configured from command line (--backend-configs)
- Recognition of the output of ghpc as a deployment, rather than blueprint:
ghpc createnow creates a folder withdeployment_nameinstead ofblueprint_name
Naming changes
- Config YAML or Input YAML is now referred to as the HPC Blueprint
- Resource Groups are now Deployment Groups
- Blueprint Folder is now Deployment Folder
- Resources are now HPC Modules
- simple-instance is now vm-instance - Underlying module is the same
Blueprint YAML Schema Update
vars.deployment_nameis used byghpcfor creating the deployment folder name, rather thanblueprint_nameresource_groupsis nowdeployment_groupsresourcesis nowmodules, and modules are stored inmodules/andcommunity/modules/- Sourcing embedded modules starts with
modulesorcommunity/modules
Example:
deployment_group: # Was resource_groups:
modules: # Was Resources
- source: modules/... # Was `- source: resources/...`Improvements
- Addition of "Community" folder
- Overwrite option (
-w) for creating a deployment in the same directory, retaining the terraform state and keeping a backup of one prior deployment. - Improved instructions for deploying after create
- Support for startup-script with Packer resource
- Command Line Flag for specifying terraform state backend config (
--backend-config) - More reliable project ID validation
What's Changed
- Cleanup prior to create update by @nick-stroud in #219
- Restore tfstate on create overwrite by @nick-stroud in #220
- Overwrite logic for create by @nick-stroud in #221
- Improve Packer template by @tpdownes in #224
- Improve Terraform instructions to user by @tpdownes in #225
- Add overwrite-blueprint argument to create command by @nick-stroud in #222
- Improve formatting of overwrite error by @nick-stroud in #227
- Improve functionality and documentation of Packer resource by @tpdownes in #228
- Create standard gitignore file in blueprint directory by @mittz in #223
- Create flag for specifying backend config by @mittz in #232
- Add section on how to see billing reports and fix typo by @mittz in #229
- Basic Cloud Shell Tutorial by @nick-stroud in #226
- Cloud Shell Tutorial - Merge to Develop by @nick-stroud in #233
- Update CLI usage instructions to use new naming convention by @nick-stroud in #236
- Improve instructions to use GitHub client in Google Cloud Shell by @tpdownes in #235
- Update all user facing references to resources by @heyealex in #237
- Update integration tests to use 'deployment_name' by @nick-stroud in #242
- Adding tutorial for Intel Select Solutions by @cboneti in #246
- Reimplement TestProjectExists with Compute Engine API by @mittz in #247
- Community Directory Reorg by @nick-stroud in #241
- Update flat list of modules by @nick-stroud in #239
- Update schema to deployment_groups and modules by @heyealex in #243
- Revert pre-commit PR validation to sequential exec by @heyealex in #251
- Update terminology for blueprint file and deployment directory by @nick-stroud in #245
- Merge tutorial from main to develop by @nick-stroud in #250
- Change name of
simple-instancetovm-instanceby @heyealex in #252 - Standardize Slurm image variables by @nick-stroud in #253
- Update Packer documentation in main README by @tpdownes in #255
- Image building example for Slurm cluster by @nick-stroud in #254
- Update outdated reference to "simple instance" by @heyealex in #257
- Update create_blueprint.sh to create_deployment.sh by @heyealex in #256
- Error message for schema changes in v0.7.0a by @cboneti in #263
- Embed community modules by @heyealex in #264
- Add link to Lustre documentation in module readme by @mittz in #265
- Revert builder to not split pre-commit hooks by @heyealex in #261
- Update to version 0.7.0-alpha by @heyealex in #266
- Release 0.7.0-alpha by @heyealex in #259
Full Changelog: v0.6.0-alpha...v0.7.0-alpha