Skip to content

Commit f4ed7c1

Browse files
authored
Merge pull request #649 from GoogleCloudPlatform/release-candidate
Release v1.7.0
2 parents 54270c1 + 0682ebc commit f4ed7c1

File tree

100 files changed

+2647
-1005
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

100 files changed

+2647
-1005
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# PREAMBLE
22
MIN_PACKER_VERSION=1.6 # for building images
33
MIN_TERRAFORM_VERSION=1.0 # for deploying modules
4-
MIN_GOLANG_VERSION=1.16 # for building ghpc
4+
MIN_GOLANG_VERSION=1.18 # for building ghpc
55

66
.PHONY: install install-user tests format add-google-license install-dev-deps \
77
warn-go-missing warn-terraform-missing warn-packer-missing \

README.md

Lines changed: 19 additions & 249 deletions
Original file line numberDiff line numberDiff line change
@@ -101,71 +101,26 @@ Be aware that Cloud Shell has [several limitations][cloud-shell-limitations],
101101
in particular an inactivity timeout that will close running shells after 20
102102
minutes. Please consider it only for blueprints that are quickly deployed.
103103

104-
## Blueprint Warnings and Errors
105-
106-
By default, each blueprint is configured with a number of "validator" functions
107-
which perform basic tests of your deployment variables. If `project_id`,
108-
`region`, and `zone` are defined as deployment variables, then the following
109-
validators are enabled:
110-
111-
```yaml
112-
validators:
113-
- validator: test_project_exists
114-
inputs:
115-
project_id: $(vars.project_id)
116-
- validator: test_region_exists
117-
inputs:
118-
project_id: $(vars.project_id)
119-
region: $(vars.region)
120-
- validator: test_zone_exists
121-
inputs:
122-
project_id: $(vars.project_id)
123-
zone: $(vars.zone)
124-
- validator: test_zone_in_region
125-
inputs:
126-
project_id: $(vars.project_id)
127-
zone: $(vars.zone)
128-
region: $(vars.region)
129-
```
130-
131-
This configures validators that check the validity of the project ID, region,
132-
and zone. Additionally, it checks that the zone is in the region. Validators can
133-
be overwritten, however they are limited to the set of functions defined above.
134-
135-
Validators can be explicitly set to the empty list:
104+
## VM Image Support
136105

137-
```yaml
138-
validators: []
139-
```
106+
The HPC Toolkit officially supports the following VM images:
140107

141-
They can also be set to 3 differing levels of behavior using the command-line
142-
`--validation-level` flag` for the `create` and `expand` commands:
108+
* HPC CentOS 7
109+
* Ubuntu 20.04 LTS
143110

144-
* `"ERROR"`: If any validator fails, the deployment directory will not be
145-
written. Error messages will be printed to the screen that indicate which
146-
validator(s) failed and how.
147-
* `"WARNING"` (default): The deployment directory will be written even if any
148-
validators fail. Warning messages will be printed to the screen that indicate
149-
which validator(s) failed and how.
150-
* `"IGNORE"`: Do not execute any validators, even if they are explicitly defined
151-
in a `validators` block or the default set is implicitly added.
111+
For more information on these and other images, see
112+
[docs/vm-images.md](docs/vm-images.md).
152113

153-
For example, this command will set all validators to `WARNING` behavior:
114+
## Blueprint Validation
154115

155-
```shell
156-
./ghpc create --validation-level WARNING examples/hpc-cluster-small.yaml
157-
```
158-
159-
The flag can be shortened to `-l` as shown below using `IGNORE` to disable all
160-
validators.
161-
162-
```shell
163-
./ghpc create -l IGNORE examples/hpc-cluster-small.yaml
164-
```
116+
The Toolkit contains "validator" functions that perform basic tests of the
117+
blueprint to ensure that deployment variables are valid and that the HPC
118+
environment can be provisioned in your Google Cloud project. Further information
119+
can be found in [dedicated documentation](docs/blueprint-validation.md).
165120

166121
## Enable GCP APIs
167122

168-
In a new GCP project there are several apis that must be enabled to deploy your
123+
In a new GCP project there are several APIs that must be enabled to deploy your
169124
HPC cluster. These will be caught when you perform `terraform apply` but you can
170125
save time by enabling them upfront.
171126

@@ -204,194 +159,9 @@ In the right side, expand the Filters view and then filter by label, specifying
204159

205160
## Troubleshooting
206161

207-
### Network is unreachable (Slurm V5)
208-
209-
Slurm requires access to google APIs to function. This can be achieved through one of the following methods:
210-
211-
1. Create a [Cloud NAT](https://cloud.google.com/nat) (preferred).
212-
2. Setting `disable_controller_public_ips: false` &
213-
`disable_login_public_ips: false` on the controller and login nodes
214-
respectively.
215-
3. Enable
216-
[private access to Google APIs](https://cloud.google.com/vpc/docs/private-access-options).
217-
218-
By default the Toolkit VPC module will create an associated Cloud NAT so this is
219-
typically seen when working with the pre-existing-vpc module. If no access
220-
exists you will see the following errors:
221-
222-
When you ssh into the login node or controller you will see the following
223-
message:
224-
225-
```text
226-
*** Slurm setup failed! Please view log: /slurm/scripts/setup.log ***
227-
```
228-
229-
> **_NOTE:_**: Many different potential issues could be indicated by the above
230-
> message, so be sure to verify issue in logs.
231-
232-
To confirm the issue, ssh onto the controller and call `sudo cat /slurm/scripts/setup.log`. Look for
233-
the following logs:
234-
235-
```text
236-
google_metadata_script_runner: startup-script: ERROR: [Errno 101] Network is unreachable
237-
google_metadata_script_runner: startup-script: OSError: [Errno 101] Network is unreachable
238-
google_metadata_script_runner: startup-script: ERROR: Aborting setup...
239-
google_metadata_script_runner: startup-script exit status 0
240-
google_metadata_script_runner: Finished running startup scripts.
241-
```
242-
243-
You may also notice mount failure logs on the login node:
244-
245-
```text
246-
INFO: Waiting for '/usr/local/etc/slurm' to be mounted...
247-
INFO: Waiting for '/home' to be mounted...
248-
INFO: Waiting for '/opt/apps' to be mounted...
249-
INFO: Waiting for '/etc/munge' to be mounted...
250-
ERROR: mount of path '/usr/local/etc/slurm' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/usr/local/etc/slurm']' returned non-zero exit status 32.
251-
ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
252-
ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
253-
ERROR: mount of path '/etc/munge' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/etc/munge']' returned non-zero exit status 32.
254-
```
255-
256-
> **_NOTE:_**: The above logs only indicate that something went wrong with the
257-
> startup of the controller. Check logs on the controller to be sure it is a
258-
> network issue.
259-
260-
### Failure to Create Auto Scale Nodes (Slurm)
261-
262-
If your deployment succeeds but your jobs fail with the following error:
263-
264-
```shell
265-
$ srun -N 6 -p compute hostname
266-
srun: PrologSlurmctld failed, job killed
267-
srun: Force Terminated job 2
268-
srun: error: Job allocation 2 has been revoked
269-
```
270-
271-
Possible causes could be [insufficient quota](#insufficient-quota) or
272-
[placement groups](#placement-groups). Also see the
273-
[Slurm user guide](https://docs.google.com/document/u/1/d/e/2PACX-1vS0I0IcgVvby98Rdo91nUjd7E9u83oIMCM4arne-9_IdBg6BdV1lBpUcSje_PyHcbAaErC1rY7p4u1g/pub).
274-
275-
#### Insufficient Quota
276-
277-
It may be that you have sufficient quota to deploy your cluster but insufficient
278-
quota to bring up the compute nodes.
279-
280-
You can confirm this by SSHing into the `controller` VM and checking the
281-
`resume.log` file:
282-
283-
```shell
284-
$ cat /var/log/slurm/resume.log
285-
...
286-
resume.py ERROR: ... "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.". Details: "[{'message': "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.", 'domain': 'usageLimits', 'reason': 'quotaExceeded'}]">
287-
```
288-
289-
The solution here is to [request more of the specified quota](#gcp-quotas),
290-
`C2 CPUs` in the example above. Alternatively, you could switch the partition's
291-
[machine type][partition-machine-type], to one which has sufficient quota.
292-
293-
[partition-machine-type]: community/modules/compute/SchedMD-slurm-on-gcp-partition/README.md#input_machine_type
294-
295-
#### Placement Groups (Slurm)
296-
297-
By default, placement groups (also called affinity groups) are enabled on the
298-
compute partition. This places VMs close to each other to achieve lower network
299-
latency. If it is not possible to provide the requested number of VMs in the
300-
same placement group, the job may fail to run.
301-
302-
Again, you can confirm this by SSHing into the `controller` VM and checking the
303-
`resume.log` file:
304-
305-
```shell
306-
$ cat /var/log/slurm/resume.log
307-
...
308-
resume.py ERROR: group operation failed: Requested minimum count of 6 VMs could not be created.
309-
```
310-
311-
One way to resolve this is to set [enable_placement][partition-enable-placement]
312-
to `false` on the partition in question.
313-
314-
[partition-enable-placement]: https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/community/modules/compute/SchedMD-slurm-on-gcp-partition#input_enable_placement
315-
316-
#### VMs Get Stuck in Status Staging When Using Placement Groups With vm-instance
317-
318-
If VMs get stuck in `status: staging` when using the `vm-instance` module with
319-
placement enabled, it may be because you need to allow terraform to make more
320-
concurrent requests. See
321-
[this note](modules/compute/vm-instance/README.md#placement) in the vm-instance
322-
README.
323-
324-
#### Insufficient Service Account Permissions
325-
326-
By default, the slurm controller, login and compute nodes use the
327-
[Google Compute Engine Service Account (GCE SA)][def-compute-sa]. If this
328-
service account or a custom SA used by the Slurm modules does not have
329-
sufficient permissions, configuring the controller or running a job in Slurm may
330-
fail.
331-
332-
If configuration of the Slurm controller fails, the error can be
333-
seen by viewing the startup script on the controller:
334-
335-
```shell
336-
sudo journalctl -u google-startup-scripts.service | less
337-
```
338-
339-
An error similar to the following indicates missing permissions for the serivce
340-
account:
341-
342-
```shell
343-
Required 'compute.machineTypes.get' permission for ...
344-
```
345-
346-
To solve this error, ensure your service account has the
347-
`compute.instanceAdmin.v1` IAM role:
348-
349-
```shell
350-
SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>
351-
352-
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
353-
--member=serviceAccount:${SA_ADDRESS} --role=roles/compute.instanceAdmin.v1
354-
```
355-
356-
If Slurm failed to run a job, view the resume log on the controller instance
357-
with the following command:
358-
359-
```shell
360-
sudo cat /var/log/slurm/resume.log
361-
```
362-
363-
An error in `resume.log` simlar to the following indicates a permissions issue
364-
as well:
365-
366-
```shell
367-
The user does not have access to service account 'PROJECT_NUMBER-compute@developer.gserviceaccount.com'. User: ''. Ask a project owner to grant you the iam.serviceAccountUser role on the service account": ['slurm-hpc-small-compute-0-0']
368-
```
369-
370-
As indicated, the service account must have the compute.serviceAccountUser IAM
371-
role. This can be set with the following command:
372-
373-
```shell
374-
SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>
375-
376-
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
377-
--member=serviceAccount:${SA_ADDRESS} --role=roles/iam.serviceAccountUser
378-
```
379-
380-
If the GCE SA is being used and cannot be updated, a new service account can be
381-
created and used with the correct permissions. Instructions for how to do this
382-
can be found in the [Slurm on Google Cloud User Guide][slurm-on-gcp-ug],
383-
specifically the section titled "Create Service Accounts".
384-
385-
After creating the service account, it can be set via the
386-
`compute_node_service_account` and `controller_service_account` settings on the
387-
[slurm-on-gcp controller module][slurm-on-gcp-con] and the
388-
"login_service_account" setting on the
389-
[slurm-on-gcp login module][slurm-on-gcp-login].
162+
### Slurm Clusters
390163

391-
[def-compute-sa]: https://cloud.google.com/compute/docs/access/service-accounts#default_service_account
392-
[slurm-on-gcp-ug]: https://goo.gle/slurm-gcp-user-guide
393-
[slurm-on-gcp-con]: community/modules/scheduler/SchedMD-slurm-on-gcp-controller/README.md
394-
[slurm-on-gcp-login]: community/modules/scheduler/SchedMD-slurm-on-gcp-login-node/README.md
164+
Please see the dedicated [troubleshooting guide for Slurm](docs/slurm-troubleshooting.md).
395165

396166
### Terraform Deployment
397167

@@ -405,8 +175,8 @@ message. Here are some common reasons for the deployment to fail:
405175
[Enable GCP APIs](#enable-gcp-apis).
406176
* **Insufficient Quota:** The GCP project does not have enough quota to
407177
provision the requested resources. See [GCP Quotas](#gcp-quotas).
408-
* **Filestore resource limit:** When regularly deploying filestore instances
409-
with a new vpc you may see an error during deployment such as:
178+
* **Filestore resource limit:** When regularly deploying Filestore instances
179+
with a new VPC you may see an error during deployment such as:
410180
`System limit for internal resources has been reached`. See
411181
[this doc](https://cloud.google.com/filestore/docs/troubleshooting#system_limit_for_internal_resources_has_been_reached_error_when_creating_an_instance)
412182
for the solution.
@@ -437,7 +207,7 @@ network. These resources should be deleted manually. The first message indicates
437207
that a new VM has been added to a subnetwork within the VPC network. The second
438208
message indicates that a new firewall rule has been added to the VPC network.
439209
If your error message does not look like these, examine it carefully to identify
440-
the type of resouce to delete and its unique name. In the two messages above,
210+
the type of resource to delete and its unique name. In the two messages above,
441211
the resource names appear toward the end of the error message. The following
442212
links will take you directly to the areas within the Cloud Console for managing
443213
VMs and Firewall rules. Make certain that your project ID is selected in the
@@ -572,6 +342,6 @@ If developing on a mac, a workaround is to install GNU tooling by installing
572342

573343
### Contributing
574344

575-
Please refer to the [contributing file](CONTRIBUTING.md) in our github repo, or
576-
to
345+
Please refer to the [contributing file](CONTRIBUTING.md) in our GitHub
346+
repository, or to
577347
[Google’s Open Source documentation](https://opensource.google/docs/releasing/template/CONTRIBUTING/#).

cmd/README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,16 @@ ghpc --version
6161

6262
+ `-l, --validation-level string`: sets validation level to one of ("ERROR", "WARNING", "IGNORE") (default "WARNING").
6363

64-
+ `--vars strings`: comma-separated list of name=value variables to override YAML configuration. Can be used multiple times.
64+
+ `--vars strings`: comma-separated list of name=value variables to override YAML configuration. Can be used multiple times. Arrays or maps containing comma-separated values must be enclosed in double quotes. The double quotes may require escaping depending on the shell used. Examples below have been tested using a `bash` shell:
65+
+ `--vars foo=bar,baz=2`
66+
+ `--vars bar=2 --vars baz=3.14`
67+
+ `--vars foo=true`
68+
+ `--vars "foo={bar: baz}"`
69+
+ `--vars "\"foo={bar: baz, qux: quux}\""`
70+
+ `--vars "\"foo={bar: baz}\"",\"b=[foo,3,3.14]\"`
71+
+ `--vars "\"a={foo: [bar, baz]}\"",\"b=[foo,3,3.14]\"`
72+
+ `--vars \"b=[foo,3,3.14]\"`
73+
+ `--vars \"b=[[foo,bar],3,3.14]\"`
6574

6675
### Example - create
6776

cmd/root.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ HPC deployments on the Google Cloud Platform.`,
4242
log.Fatalf("cmd.Help function failed: %s", err)
4343
}
4444
},
45-
Version: "v1.6.0",
45+
Version: "v1.7.0",
4646
Annotations: annotation,
4747
}
4848
)

community/examples/AMD/README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,19 @@ ghpc create --vars project_id=<<PROJECT_ID>> hpc-cluster-amd-slurmv5.yaml
5858
It will create a directory containing a Terraform module. Follow the printed
5959
instructions to execute Terraform.
6060
61-
### Login to the cluster and complete installation of AOCC
61+
### Run an OpenFOAM test suite
62+
63+
Browse to the [Cloud Console][console] and use the SSH feature to access the
64+
Slurm login node. A script has been provisioned which will activate your
65+
OpenFOAM environment and run a test suite of applications. The output of this
66+
test suite will appear in `openfoam_test` under your home directory. To execute
67+
the test suite, run:
68+
69+
```shell
70+
bash /var/tmp/openfoam_test.sh
71+
```
72+
73+
### Complete installation of AOCC
6274
6375
Because AOCC requires acceptance of a license, we advise a manual step to
6476
install AOCC and OpenMPI compiled with AOCC. You can browse to the

0 commit comments

Comments
 (0)