You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You may also notice mount failure logs on the login node:
244
-
245
-
```text
246
-
INFO: Waiting for '/usr/local/etc/slurm' to be mounted...
247
-
INFO: Waiting for '/home' to be mounted...
248
-
INFO: Waiting for '/opt/apps' to be mounted...
249
-
INFO: Waiting for '/etc/munge' to be mounted...
250
-
ERROR: mount of path '/usr/local/etc/slurm' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/usr/local/etc/slurm']' returned non-zero exit status 32.
251
-
ERROR: mount of path '/opt/apps' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/opt/apps']' returned non-zero exit status 32.
252
-
ERROR: mount of path '/home' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/home']' returned non-zero exit status 32.
253
-
ERROR: mount of path '/etc/munge' failed: <class 'subprocess.CalledProcessError'>: Command '['mount', '/etc/munge']' returned non-zero exit status 32.
254
-
```
255
-
256
-
> **_NOTE:_**: The above logs only indicate that something went wrong with the
257
-
> startup of the controller. Check logs on the controller to be sure it is a
258
-
> network issue.
259
-
260
-
### Failure to Create Auto Scale Nodes (Slurm)
261
-
262
-
If your deployment succeeds but your jobs fail with the following error:
263
-
264
-
```shell
265
-
$ srun -N 6 -p compute hostname
266
-
srun: PrologSlurmctld failed, job killed
267
-
srun: Force Terminated job 2
268
-
srun: error: Job allocation 2 has been revoked
269
-
```
270
-
271
-
Possible causes could be [insufficient quota](#insufficient-quota) or
272
-
[placement groups](#placement-groups). Also see the
273
-
[Slurm user guide](https://docs.google.com/document/u/1/d/e/2PACX-1vS0I0IcgVvby98Rdo91nUjd7E9u83oIMCM4arne-9_IdBg6BdV1lBpUcSje_PyHcbAaErC1rY7p4u1g/pub).
274
-
275
-
#### Insufficient Quota
276
-
277
-
It may be that you have sufficient quota to deploy your cluster but insufficient
278
-
quota to bring up the compute nodes.
279
-
280
-
You can confirm this by SSHing into the `controller` VM and checking the
281
-
`resume.log` file:
282
-
283
-
```shell
284
-
$ cat /var/log/slurm/resume.log
285
-
...
286
-
resume.py ERROR: ... "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.". Details: "[{'message': "Quota 'C2_CPUS' exceeded. Limit: 300.0 in region europe-west4.", 'domain': 'usageLimits', 'reason': 'quotaExceeded'}]">
287
-
```
288
-
289
-
The solution here is to [request more of the specified quota](#gcp-quotas),
290
-
`C2 CPUs`in the example above. Alternatively, you could switch the partition's
291
-
[machine type][partition-machine-type], to one which has sufficient quota.
If Slurm failed to run a job, view the resume log on the controller instance
357
-
with the following command:
358
-
359
-
```shell
360
-
sudo cat /var/log/slurm/resume.log
361
-
```
362
-
363
-
An error in `resume.log` simlar to the following indicates a permissions issue
364
-
as well:
365
-
366
-
```shell
367
-
The user does not have access to service account 'PROJECT_NUMBER-compute@developer.gserviceaccount.com'. User: ''. Ask a project owner to grant you the iam.serviceAccountUser role on the service account": ['slurm-hpc-small-compute-0-0']
368
-
```
369
-
370
-
As indicated, the service account must have the compute.serviceAccountUser IAM
Copy file name to clipboardExpand all lines: cmd/README.md
+10-1Lines changed: 10 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,7 +61,16 @@ ghpc --version
61
61
62
62
+`-l, --validation-level string`: sets validation level to one of ("ERROR", "WARNING", "IGNORE") (default "WARNING").
63
63
64
-
+`--vars strings`: comma-separated list of name=value variables to override YAML configuration. Can be used multiple times.
64
+
+`--vars strings`: comma-separated list of name=value variables to override YAML configuration. Can be used multiple times. Arrays or maps containing comma-separated values must be enclosed in double quotes. The double quotes may require escaping depending on the shell used. Examples below have been tested using a `bash` shell:
0 commit comments