GoogleCloudPlatform
diff --git a/‎Makefile‎
Lines changed: 1 addition & 1 deletion b/‎Makefile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 19 additions & 249 deletions b/‎README.md‎
Lines changed: 19 additions & 249 deletions
diff --git a/‎cmd/README.md‎
Lines changed: 10 additions & 1 deletion b/‎cmd/README.md‎
Lines changed: 10 additions & 1 deletion
diff --git a/‎cmd/root.go‎
Lines changed: 1 addition & 1 deletion b/‎cmd/root.go‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎community/examples/AMD/README.md‎
Lines changed: 13 additions & 1 deletion b/‎community/examples/AMD/README.md‎
Lines changed: 13 additions & 1 deletion
@@ -1,7 +1,7 @@
 # PREAMBLE
 MIN_PACKER_VERSION=1.6 # for building images
 MIN_TERRAFORM_VERSION=1.0 # for deploying modules
-MIN_GOLANG_VERSION=1.16 # for building ghpc
+MIN_GOLANG_VERSION=1.18 # for building ghpc
 
 .PHONY: install install-user tests format add-google-license install-dev-deps \
         warn-go-missing warn-terraform-missing warn-packer-missing \
 
@@ -101,71 +101,26 @@ Be aware that Cloud Shell has [several limitations][cloud-shell-limitations],
 in particular an inactivity timeout that will close running shells after 20
 minutes. Please consider it only for blueprints that are quickly deployed.
 
-## Blueprint Warnings and Errors
-
-By default, each blueprint is configured with a number of "validator" functions
-which perform basic tests of your deployment variables. If `project_id`,
-`region`, and `zone` are defined as deployment variables, then the following
-validators are enabled:
-
-```yaml
-validators:
-- validator: test_project_exists
-  inputs:
-    project_id: $(vars.project_id)
-- validator: test_region_exists
-  inputs:
-    project_id: $(vars.project_id)
-    region: $(vars.region)
-- validator: test_zone_exists
-  inputs:
-    project_id: $(vars.project_id)
-    zone: $(vars.zone)
-- validator: test_zone_in_region
-  inputs:
-    project_id: $(vars.project_id)
-    zone: $(vars.zone)
-    region: $(vars.region)
-```
-
-This configures validators that check the validity of the project ID, region,
-and zone. Additionally, it checks that the zone is in the region. Validators can
-be overwritten, however they are limited to the set of functions defined above.
-
-Validators can be explicitly set to the empty list:
+## VM Image Support
 
-```yaml
-validators: []
-```
+The HPC Toolkit officially supports the following VM images:
 
-They can also be set to 3 differing levels of behavior using the command-line
-`--validation-level` flag` for the `create` and `expand` commands:
+* HPC CentOS 7
+* Ubuntu 20.04 LTS
 
-* `"ERROR"`: If any validator fails, the deployment directory will not be
-  written. Error messages will be printed to the screen that indicate which
-  validator(s) failed and how.
-* `"WARNING"` (default): The deployment directory will be written even if any
-  validators fail. Warning messages will be printed to the screen that indicate
-  which validator(s) failed and how.
-* `"IGNORE"`: Do not execute any validators, even if they are explicitly defined
-  in a `validators` block or the default set is implicitly added.
+For more information on these and other images, see
+[docs/vm-images.md](docs/vm-images.md).
 
-For example, this command will set all validators to `WARNING` behavior:
+## Blueprint Validation
 
-```shell
-./ghpc create --validation-level WARNING examples/hpc-cluster-small.yaml
-```
-
-The flag can be shortened to `-l` as shown below using `IGNORE` to disable all
-validators.
-
-```shell
-./ghpc create -l IGNORE examples/hpc-cluster-small.yaml
-```
+The Toolkit contains "validator" functions that perform basic tests of the
+blueprint to ensure that deployment variables are valid and that the HPC
+environment can be provisioned in your Google Cloud project. Further information
+can be found in [dedicated documentation](docs/blueprint-validation.md).
 
 ## Enable GCP APIs
 
-In a new GCP project there are several apis that must be enabled to deploy your
+In a new GCP project there are several APIs that must be enabled to deploy your
 HPC cluster. These will be caught when you perform `terraform apply` but you can
 save time by enabling them upfront.
 
@@ -204,194 +159,9 @@ In the right side, expand the Filters view and then filter by label, specifying
 
 ## Troubleshooting
 
-### Network is unreachable (Slurm V5)
-
-Slurm requires access to google APIs to function. This can be achieved through one of the following methods:
-
-1. Create a [Cloud NAT](https://cloud.google.com/nat) (preferred).
-2. Setting `disable_controller_public_ips: false` &
-   `disable_login_public_ips: false` on the controller and login nodes
-   respectively.
-3. Enable
-   [private access to Google APIs](https://cloud.google.com/vpc/docs/private-access-options).
-
-By default the Toolkit VPC module will create an associated Cloud NAT so this is
-typically seen when working with the pre-existing-vpc module. If no access
-exists you will see the following errors:
-
-When you ssh into the login node or controller you will see the following
-message:
-
-```text
-*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***
-```
-
-> **_NOTE:_**: Many different potential issues could be indicated by the above
-> message, so be sure to verify issue in logs.
-
-To confirm the issue, ssh onto the controller and call `sudo cat /slurm/scripts/setup.log`. Look for
-the following logs:
-
-```text
-google_metadata_script_runner: startup-script: ERROR: [Errno 101] Network is unreachable
-google_metadata_script_runner: startup-script: OSError: [Errno 101] Network is unreachable
-google_metadata_script_runner: startup-script: ERROR: Aborting setup...
-google_metadata_script_runner: startup-script exit status 0
-google_metadata_script_runner: Finished running startup scripts.
-```
-
-You may also notice mount failure logs on the login node:
-
-```text
-INFO: Waiting for '/usr/local/etc/slurm' to be mounted...
-INFO: Waiting for '/home' to be mounted...
-INFO: Waiting for '/opt/apps' to be mounted...
-INFO: Waiting for '/etc/munge' to be mounted...
-ERROR: mount of path '/usr/local/etc/slurm' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/usr/local/etc/slurm']' returned non-zero exit status 32.
-ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
-ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
-ERROR: mount of path '/etc/munge' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/etc/munge']' returned non-zero exit status 32.
-```
-
-> **_NOTE:_**: The above logs only indicate that something went wrong with the
-> startup of the controller. Check logs on the controller to be sure it is a
-> network issue.
-
-### Failure to Create Auto Scale Nodes (Slurm)
-
-If your deployment succeeds but your jobs fail with the following error:
-
-```shell
-$ srun -N 6 -p compute hostname
-srun: PrologSlurmctld failed, job killed
-srun: Force Terminated job 2
-srun: error: Job allocation 2 has been revoked
-```
-
-Possible causes could be [insufficient quota](#insufficient-quota) or
-[placement groups](#placement-groups). Also see the
-[Slurm user guide](https://docs.google.com/document/u/1/d/e/2PACX-1vS0I0IcgVvby98Rdo91nUjd7E9u83oIMCM4arne-9_IdBg6BdV1lBpUcSje_PyHcbAaErC1rY7p4u1g/pub).
-
-#### Insufficient Quota
-
-It may be that you have sufficient quota to deploy your cluster but insufficient
-quota to bring up the compute nodes.
-
-You can confirm this by SSHing into the `controller` VM and checking the
-`resume.log` file:
-
-```shell
-$ cat /var/log/slurm/resume.log
-...
-resume.py ERROR: ... "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.". Details: "[{'message': "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.", 'domain': 'usageLimits', 'reason': 'quotaExceeded'}]">
-```
-
-The solution here is to [request more of the specified quota](#gcp-quotas),
-`C2 CPUs` in the example above. Alternatively, you could switch the partition's
-[machine type][partition-machine-type], to one which has sufficient quota.
-
-[partition-machine-type]: community/modules/compute/SchedMD-slurm-on-gcp-partition/README.md#input_machine_type
-
-#### Placement Groups (Slurm)
-
-By default, placement groups (also called affinity groups) are enabled on the
-compute partition. This places VMs close to each other to achieve lower network
-latency. If it is not possible to provide the requested number of VMs in the
-same placement group, the job may fail to run.
-
-Again, you can confirm this by SSHing into the `controller` VM and checking the
-`resume.log` file:
-
-```shell
-$ cat /var/log/slurm/resume.log
-...
-resume.py ERROR: group operation failed: Requested minimum count of 6 VMs could not be created.
-```
-
-One way to resolve this is to set [enable_placement][partition-enable-placement]
-to `false` on the partition in question.
-
-[partition-enable-placement]: https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/community/modules/compute/SchedMD-slurm-on-gcp-partition#input_enable_placement
-
-#### VMs Get Stuck in Status Staging When Using Placement Groups With vm-instance
-
-If VMs get stuck in `status: staging` when using the `vm-instance` module with
-placement enabled, it may be because you need to allow terraform to make more
-concurrent requests. See
-[this note](modules/compute/vm-instance/README.md#placement) in the vm-instance
-README.
-
-#### Insufficient Service Account Permissions
-
-By default, the slurm controller, login and compute nodes use the
-[Google Compute Engine Service Account (GCE SA)][def-compute-sa]. If this
-service account or a custom SA used by the Slurm modules does not have
-sufficient permissions, configuring the controller or running a job in Slurm may
-fail.
-
-If configuration of the Slurm controller fails, the error can be
-seen by viewing the startup script on the controller:
-
-```shell
-sudo journalctl -u google-startup-scripts.service | less
-```
-
-An error similar to the following indicates missing permissions for the serivce
-account:
-
-```shell
-Required 'compute.machineTypes.get' permission for ...
-```
-
-To solve this error, ensure your service account has the
-`compute.instanceAdmin.v1` IAM role:
-
-```shell
-SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>
-
-gcloud projects add-iam-policy-binding ${PROJECT_ID} \
-    --member=serviceAccount:${SA_ADDRESS} --role=roles/compute.instanceAdmin.v1
-```
-
-If Slurm failed to run a job, view the resume log on the controller instance
-with the following command:
-
-```shell
-sudo cat /var/log/slurm/resume.log
-```
-
-An error in `resume.log` simlar to the following indicates a permissions issue
-as well:
-
-```shell
-The user does not have access to service account 'PROJECT_NUMBER-compute@developer.gserviceaccount.com'.  User: ''.  Ask a project owner to grant you the iam.serviceAccountUser role on the service account": ['slurm-hpc-small-compute-0-0']
-```
-
-As indicated, the service account must have the compute.serviceAccountUser IAM
-role. This can be set with the following command:
-
-```shell
-SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>
-
-gcloud projects add-iam-policy-binding ${PROJECT_ID} \
-    --member=serviceAccount:${SA_ADDRESS} --role=roles/iam.serviceAccountUser
-```
-
-If the GCE SA is being used and cannot be updated, a new service account can be
-created and used with the correct permissions. Instructions for how to do this
-can be found in the [Slurm on Google Cloud User Guide][slurm-on-gcp-ug],
-specifically the section titled "Create Service Accounts".
-
-After creating the service account, it can be set via the
-`compute_node_service_account` and `controller_service_account` settings on the
-[slurm-on-gcp controller module][slurm-on-gcp-con] and the
-"login_service_account" setting on the
-[slurm-on-gcp login module][slurm-on-gcp-login].
+### Slurm Clusters
 
-[def-compute-sa]: https://cloud.google.com/compute/docs/access/service-accounts#default_service_account
-[slurm-on-gcp-ug]: https://goo.gle/slurm-gcp-user-guide
-[slurm-on-gcp-con]: community/modules/scheduler/SchedMD-slurm-on-gcp-controller/README.md
-[slurm-on-gcp-login]: community/modules/scheduler/SchedMD-slurm-on-gcp-login-node/README.md
+Please see the dedicated [troubleshooting guide for Slurm](docs/slurm-troubleshooting.md).
 
 ### Terraform Deployment
 
@@ -405,8 +175,8 @@ message. Here are some common reasons for the deployment to fail:
   [Enable GCP APIs](#enable-gcp-apis).
 * **Insufficient Quota:** The GCP project does not have enough quota to
   provision the requested resources. See [GCP Quotas](#gcp-quotas).
-* **Filestore resource limit:** When regularly deploying filestore instances
-  with a new vpc you may see an error during deployment such as:
+* **Filestore resource limit:** When regularly deploying Filestore instances
+  with a new VPC you may see an error during deployment such as:
   `System limit for internal resources has been reached`. See
   [this doc](https://cloud.google.com/filestore/docs/troubleshooting#system_limit_for_internal_resources_has_been_reached_error_when_creating_an_instance)
   for the solution.
@@ -437,7 +207,7 @@ network. These resources should be deleted manually. The first message indicates
 that a new VM has been added to a subnetwork within the VPC network. The second
 message indicates that a new firewall rule has been added to the VPC network.
 If your error message does not look like these, examine it carefully to identify
-the type of resouce to delete and its unique name. In the two messages above,
+the type of resource to delete and its unique name. In the two messages above,
 the resource names appear toward the end of the error message. The following
 links will take you directly to the areas within the Cloud Console for managing
 VMs and Firewall rules. Make certain that your project ID is selected in the
@@ -572,6 +342,6 @@ If developing on a mac, a workaround is to install GNU tooling by installing
 
 ### Contributing
 
-Please refer to the [contributing file](CONTRIBUTING.md) in our github repo, or
-to
+Please refer to the [contributing file](CONTRIBUTING.md) in our GitHub
+repository, or to
 [Google’s Open Source documentation](https://opensource.google/docs/releasing/template/CONTRIBUTING/#).
@@ -61,7 +61,16 @@ ghpc --version
 
 + `-l, --validation-level string`: sets validation level to one of ("ERROR", "WARNING", "IGNORE") (default "WARNING").
 
-+ `--vars strings`: comma-separated list of name=value variables to override YAML configuration. Can be used multiple times.
++ `--vars strings`: comma-separated list of name=value variables to override YAML configuration. Can be used multiple times. Arrays or maps containing comma-separated values must be enclosed in double quotes. The double quotes may require escaping depending on the shell used. Examples below have been tested using a `bash` shell:
+  + `--vars foo=bar,baz=2`
+  + `--vars bar=2 --vars baz=3.14`
+  + `--vars foo=true`
+  + `--vars "foo={bar: baz}"`
+  + `--vars "\"foo={bar: baz, qux: quux}\""`
+  + `--vars "\"foo={bar: baz}\"",\"b=[foo,3,3.14]\"`
+  + `--vars "\"a={foo: [bar, baz]}\"",\"b=[foo,3,3.14]\"`
+  + `--vars \"b=[foo,3,3.14]\"`
+  + `--vars \"b=[[foo,bar],3,3.14]\"`
 
 ### Example - create
 
 
@@ -42,7 +42,7 @@ HPC deployments on the Google Cloud Platform.`,
 				log.Fatalf("cmd.Help function failed: %s", err)
 			}
 		},
-		Version:     "v1.6.0",
+		Version:     "v1.7.0",
 		Annotations: annotation,
 	}
 )
 
@@ -58,7 +58,19 @@ ghpc create --vars project_id=<<PROJECT_ID>> hpc-cluster-amd-slurmv5.yaml
 It will create a directory containing a Terraform module. Follow the printed
 instructions to execute Terraform.
 
-### Login to the cluster and complete installation of AOCC
+### Run an OpenFOAM test suite
+
+Browse to the [Cloud Console][console] and use the SSH feature to access the
+Slurm login node. A script has been provisioned which will activate your
+OpenFOAM environment and run a test suite of applications. The output of this
+test suite will appear in `openfoam_test` under your home directory. To execute
+the test suite, run:
+
+```shell
+bash /var/tmp/openfoam_test.sh
+```
+
+### Complete installation of AOCC
 
 Because AOCC requires acceptance of a license, we advise a manual step to
 install AOCC and OpenMPI compiled with AOCC. You can browse to the
Original file line number	Diff line number	Diff line change
@@ -42,7 +42,7 @@ HPC deployments on the Google Cloud Platform.`,
`42`	`42`	`log.Fatalf("cmd.Help function failed: %s", err)`
`43`	`43`	`}`
`44`	`44`	`},`
`45`		`- Version: "v1.6.0",`
	`45`	`+ Version: "v1.7.0",`
`46`	`46`	`Annotations: annotation,`
`47`	`47`	`}`
`48`	`48`	`)`