Skip to content

Commit 9eaa870

Browse files
authored
Merge pull request #286 from GoogleCloudPlatform/develop
Release 0.7.1-alpha
2 parents 20481b4 + 6fdf1a4 commit 9eaa870

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

79 files changed

+2377
-1706
lines changed

README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,78 @@ to `false` on the partition in question.
406406

407407
[partition-enable-placement]: https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/community/modules/compute/SchedMD-slurm-on-gcp-partition#input_enable_placement
408408

409+
#### Insufficient Service Account Permissions
410+
411+
By default, the slurm controller, login and compute nodes use the
412+
[Google Compute Engine Service Account (GCE SA)][def-compute-sa]. If this
413+
service account or a custom SA used by the Slurm modules does not have
414+
sufficient permissions, configuring the controller or running a job in Slurm may
415+
fail.
416+
417+
If configuration of the Slurm controller fails, the error can be
418+
seen by viewing the startup script on the controller:
419+
420+
```shell
421+
sudo journalctl -u google-startup-scripts.service | less
422+
```
423+
424+
An error similar to the following indicates missing permissions for the serivce
425+
account:
426+
427+
```shell
428+
Required 'compute.machineTypes.get' permission for ...
429+
```
430+
431+
To solve this error, ensure your service account has the
432+
`compute.instanceAdmin.v1` IAM role:
433+
434+
```shell
435+
SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>
436+
437+
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
438+
--member=serviceAccount:${SA_ADDRESS} --role=roles/compute.instanceAdmin.v1
439+
```
440+
441+
If Slurm failed to run a job, view the resume log on the controller instance
442+
with the following command:
443+
444+
```shell
445+
sudo cat /var/log/slurm/resume.log
446+
```
447+
448+
An error in `resume.log` simlar to the following indicates a permissions issue
449+
as well:
450+
451+
```shell
452+
The user does not have access to service account 'PROJECT_NUMBER-compute@developer.gserviceaccount.com'. User: ''. Ask a project owner to grant you the iam.serviceAccountUser role on the service account": ['slurm-hpc-small-compute-0-0']
453+
```
454+
455+
As indicated, the service account must have the compute.serviceAccountUser IAM
456+
role. This can be set with the following command:
457+
458+
```shell
459+
SA_ADDRESS=<SET SERVICE ACCOUNT ADDRESS HERE>
460+
461+
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
462+
--member=serviceAccount:${SA_ADDRESS} --role=roles/iam.serviceAccountUser
463+
```
464+
465+
If the GCE SA is being used and cannot be updated, a new service account can be
466+
created and used with the correct permissions. Instructions for how to do this
467+
can be found in the [Slurm on Google Cloud User Guide][slurm-on-gcp-ug],
468+
specifically the section titled "Create Service Accounts".
469+
470+
After creating the service account, it can be set via the
471+
"compute_node_service_account" and "controller_service_account" settings on the
472+
[slurm-on-gcp controller module][slurm-on-gcp-con] and the
473+
"login_service_account" setting on the
474+
[slurm-on-gcp login module][slurm-on-gcp-login].
475+
476+
[def-compute-sa]: https://cloud.google.com/compute/docs/access/service-accounts#default_service_account
477+
[slurm-on-gcp-ug]: https://goo.gle/slurm-gcp-user-guide
478+
[slurm-on-gcp-con]: community/modules/scheduler/SchedMD-slurm-on-gcp-controller/README.md
479+
[slurm-on-gcp-login]: community/modules/scheduler/SchedMD-slurm-on-gcp-login-node/README.md
480+
409481
### Terraform Deployment
410482

411483
When `terraform apply` fails, Terraform generally provides a useful error

cmd/create.go

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ import (
2121
"errors"
2222
"fmt"
2323
"hpc-toolkit/pkg/config"
24-
"hpc-toolkit/pkg/reswriter"
24+
"hpc-toolkit/pkg/modulewriter"
2525
"log"
2626

2727
"github.com/spf13/cobra"
@@ -77,19 +77,19 @@ func runCreateCmd(cmd *cobra.Command, args []string) {
7777
bpFilename = args[0]
7878
}
7979

80-
blueprintConfig := config.NewBlueprintConfig(bpFilename)
81-
if err := blueprintConfig.SetCLIVariables(cliVariables); err != nil {
80+
deploymentConfig := config.NewDeploymentConfig(bpFilename)
81+
if err := deploymentConfig.SetCLIVariables(cliVariables); err != nil {
8282
log.Fatalf("Failed to set the variables at CLI: %v", err)
8383
}
84-
if err := blueprintConfig.SetBackendConfig(cliBEConfigVars); err != nil {
84+
if err := deploymentConfig.SetBackendConfig(cliBEConfigVars); err != nil {
8585
log.Fatalf("Failed to set the backend config at CLI: %v", err)
8686
}
87-
if err := blueprintConfig.SetValidationLevel(validationLevel); err != nil {
87+
if err := deploymentConfig.SetValidationLevel(validationLevel); err != nil {
8888
log.Fatal(err)
8989
}
90-
blueprintConfig.ExpandConfig()
91-
if err := reswriter.WriteBlueprint(&blueprintConfig.Config, outputDir, overwriteDeployment); err != nil {
92-
var target *reswriter.OverwriteDeniedError
90+
deploymentConfig.ExpandConfig()
91+
if err := modulewriter.WriteDeployment(&deploymentConfig.Config, outputDir, overwriteDeployment); err != nil {
92+
var target *modulewriter.OverwriteDeniedError
9393
if errors.As(err, &target) {
9494
fmt.Printf("\n%s\n", err.Error())
9595
} else {

cmd/expand.go

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -58,18 +58,18 @@ func runExpandCmd(cmd *cobra.Command, args []string) {
5858
bpFilename = args[0]
5959
}
6060

61-
blueprintConfig := config.NewBlueprintConfig(bpFilename)
62-
if err := blueprintConfig.SetCLIVariables(cliVariables); err != nil {
61+
deploymentConfig := config.NewDeploymentConfig(bpFilename)
62+
if err := deploymentConfig.SetCLIVariables(cliVariables); err != nil {
6363
log.Fatalf("Failed to set the variables at CLI: %v", err)
6464
}
65-
if err := blueprintConfig.SetBackendConfig(cliBEConfigVars); err != nil {
65+
if err := deploymentConfig.SetBackendConfig(cliBEConfigVars); err != nil {
6666
log.Fatalf("Failed to set the backend config at CLI: %v", err)
6767
}
68-
if err := blueprintConfig.SetValidationLevel(validationLevel); err != nil {
68+
if err := deploymentConfig.SetValidationLevel(validationLevel); err != nil {
6969
log.Fatal(err)
7070
}
71-
blueprintConfig.ExpandConfig()
72-
blueprintConfig.ExportYamlConfig(outputFilename)
71+
deploymentConfig.ExpandConfig()
72+
deploymentConfig.ExportBlueprint(outputFilename)
7373
fmt.Printf(
7474
"Expanded Environment Definition created successfully, saved as %s.\n", outputFilename)
7575
}

cmd/root.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ HPC deployments on the Google Cloud Platform.`,
3434
log.Fatalf("cmd.Help function failed: %s", err)
3535
}
3636
},
37-
Version: "v0.7.0-alpha (private preview)",
37+
Version: "v0.7.1-alpha (private preview)",
3838
}
3939
)
4040

community/examples/intel/README.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Intel Solutions for the HPC Toolkit
2+
3+
## Intel-Optimized Slurm Cluster
4+
5+
This document is adapted from a [Cloud Shell tutorial][tutorial] developed to
6+
demonstrate Intel Select Solutions within the Toolkit. It expands upon that
7+
tutorial by building custom images that save provisioning time and improve
8+
reliability when scaling up compute nodes.
9+
10+
The Google Cloud [HPC VM Image][hpcvmimage] has a built-in feature enabling it
11+
to install a Google Cloud-tested release of Intel compilers and libraries that
12+
are known to achieve optimal performance on Google Cloud.
13+
14+
[tutorial]: ../../../docs/tutorials/intel-select/intel-select.md
15+
[hpcvmimage]: https://cloud.google.com/compute/docs/instances/create-hpc-vm
16+
17+
## Provisioning the Intel-optimized Slurm cluster
18+
19+
Identify a project to work in and substitute its unique id wherever you see
20+
`<<PROJECT_ID>>` in the instructions below.
21+
22+
## Initial Setup
23+
24+
Before provisioning any infrastructure in this project you should follow the
25+
Toolkit guidance to enable [APIs][apis] and establish minimum resource
26+
[quotas][quotas]. In particular, the following APIs should be enabled
27+
28+
* file.googleapis.com (Cloud Filestore)
29+
* compute.googleapis.com (Google Compute Engine)
30+
31+
[apis]: ../../../README.md#enable-gcp-apis
32+
[quotas]: ../../../README.md#gcp-quotas
33+
34+
And the following available quota is required in the region used by the cluster:
35+
36+
* Filestore: 2560GB
37+
* C2 CPUs: 6000 (fully-scaled "compute" partition)
38+
* This quota is not necessary at initial deployment, but will be required to
39+
successfully scale the partition to its maximum size
40+
* C2 CPUs: 4 (login node)
41+
42+
## Deploying the Blueprint
43+
44+
Use `ghpc` to provision the blueprint, supplying your project ID:
45+
46+
```shell
47+
ghpc create --vars project_id=<<PROJECT_ID>> hpc-cluster-intel-select.yaml
48+
```
49+
50+
It will create a set of directories containing Terraform modules and Packer
51+
templates. **Please ignore the printed instructions** in favor of the following:
52+
53+
1. Provision the network and startup scripts that install Intel software.
54+
55+
```shell
56+
terraform -chdir=hpc-intel-select/primary init
57+
terraform -chdir=hpc-intel-select/primary validate
58+
terraform -chdir=hpc-intel-select/primary apply
59+
```
60+
61+
1. Capture the startup scripts to files that will be used by Packer to build the
62+
images.
63+
64+
```shell
65+
terraform -chdir=hpc-intel-select/primary output \
66+
-raw startup_script_startup_controller > \
67+
hpc-intel-select/packer/controller-image/startup_script.sh
68+
terraform -chdir=hpc-intel-select/primary output \
69+
-raw startup_script_startup_compute > \
70+
hpc-intel-select/packer/compute-image/startup_script.sh
71+
```
72+
73+
1. Build the custom Slurm controller image. While this step is executing, you
74+
may begin the next step in parallel.
75+
76+
```shell
77+
cd hpc-intel-select/packer/controller-image
78+
packer init .
79+
packer validate .
80+
packer build -var startup_script_file=startup_script.sh .
81+
```
82+
83+
1. Build the custom Slurm image for login and compute nodes
84+
85+
```shell
86+
cd -
87+
cd hpc-intel-select/packer/compute-image
88+
packer init .
89+
packer validate .
90+
packer build -var startup_script_file=startup_script.sh .
91+
```
92+
93+
1. Provision the Slurm cluster
94+
95+
```shell
96+
cd -
97+
terraform -chdir=hpc-intel-select/cluster init
98+
terraform -chdir=hpc-intel-select/cluster validate
99+
terraform -chdir=hpc-intel-select/cluster apply
100+
```
101+
102+
## Connecting to the login node
103+
104+
Once the startup script has completed and Slurm reports readiness, connect to the login node.
105+
106+
1. Open the following URL in a new tab. This will take you to `Compute Engine` >
107+
`VM instances` in the Google Cloud Console:
108+
109+
```text
110+
https://console.cloud.google.com/compute
111+
```
112+
113+
Ensure that you select the project in which you are provisioning the cluster.
114+
115+
1. Click on the `SSH` button associated with the `slurm-hpc-intel-select-login0`
116+
instance.
117+
118+
This will open a separate pop up window with a terminal into our newly created
119+
Slurm login VM.
120+
121+
## Access the cluster and provision an example job
122+
123+
**The commands below should be run on the login node.**
124+
125+
1. Create a default ssh key to be able to ssh between nodes:
126+
127+
```shell
128+
ssh-keygen -q -N '' -f ~/.ssh/id_rsa
129+
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
130+
chmod 0600 ~/.ssh/authorized_keys
131+
```
132+
133+
1. Submit an example job:
134+
135+
```shell
136+
cp /var/tmp/dgemm_job.sh .
137+
sbatch dgemm_job.sh
138+
```
139+
140+
## Delete the infrastructure when not in use
141+
142+
> **_NOTE:_** If the Slurm controller is shut down before the auto-scale nodes
143+
> are destroyed then they will be left running.
144+
145+
Open your browser to the VM instances page and ensure that nodes named "compute"
146+
have been shutdown and deleted by the Slurm autoscaler. Delete the remaining
147+
infrastructure in reverse order of creation:
148+
149+
```shell
150+
terraform -chdir=hpc-intel-select/cluster destroy
151+
terraform -chdir=hpc-intel-select/primary destroy
152+
```

0 commit comments

Comments
 (0)