Skip to content

Commit b0a5f6f

Browse files
authored
Merge pull request #344 from GoogleCloudPlatform/develop
Version 1.0.0
2 parents c194064 + 465b471 commit b0a5f6f

File tree

21 files changed

+218
-230
lines changed

21 files changed

+218
-230
lines changed

README.md

Lines changed: 28 additions & 201 deletions
Original file line numberDiff line numberDiff line change
@@ -10,183 +10,37 @@ networking, storage, etc.) following Google Cloud best-practices, in a repeatabl
1010
manner. The HPC Toolkit is designed to be highly customizable and extensible,
1111
and intends to address the HPC deployment needs of a broad range of customers.
1212

13-
## Installation
13+
More information can be found on the
14+
[Google Cloud Docs](https://cloud.google.com/hpc-toolkit/docs/overview).
1415

15-
These instructions assume you are using
16-
[Cloud Shell](https://cloud.google.com/shell) which comes with the
17-
[dependencies](#dependencies) pre-installed.
16+
## Quickstart
1817

19-
To use the HPC-Toolkit, you must clone the project from GitHub and build the
20-
`ghpc` binary.
18+
Running through the
19+
[quickstart tutorial](https://cloud.google.com/hpc-toolkit/docs/quickstarts/slurm-cluster)
20+
is the recommended path to get started with the HPC Toolkit.
2121

22-
1. Execute `gh auth login`
23-
* Select GitHub.com
24-
* Select HTTPS
25-
* Select Yes for "Authenticate Git with your GitHub credentials?"
26-
* Select "Login with a web browser"
27-
* Copy the one time code presented in the terminal
28-
* Press [enter]
29-
* Click the link https://github.com/login/device presented in the terminal
22+
Find a full list of tutorials [here](docs/tutorials/README.md).
3023

31-
A web browser will open, paste the one time code into the web browser prompt.
32-
Continue to log into GitHub, then return to the terminal. You should see a
33-
message that includes "Authentication complete."
24+
---
3425

35-
You can now clone the Toolkit:
26+
If a self directed path is preferred, you can use the following commands to
27+
build the `ghpc` binary:
3628

3729
```shell
38-
gh repo clone GoogleCloudPlatform/hpc-toolkit
30+
git clone git@github.com:GoogleCloudPlatform/hpc-toolkit.git
31+
cd hpc-toolkit
32+
make
33+
./ghpc --version
34+
./ghpc --help
3935
```
4036

41-
Finally, build the toolkit.
42-
43-
```shell
44-
cd hpc-toolkit && make
45-
```
46-
47-
You should now have a binary named `ghpc` in the project root directory.
48-
Optionally, you can run `./ghpc --version` to verify the build.
49-
50-
## Quick Start
51-
52-
To create an HPC deployment, an HPC blueprint file needs to be written or
53-
adapted from one of the [core examples](examples/) or
54-
[community examples](community/examples/).
55-
56-
These instructions will use
57-
[examples/hpc-cluster-small.yaml](examples/hpc-cluster-small.yaml), which is a
58-
good starting point and creates a deployment containing:
59-
60-
* a new network
61-
* a filestore instance
62-
* a slurm login node
63-
* a slurm controller
64-
65-
> **_NOTE:_** More information on the example blueprints can be found in
66-
> [examples/README.md](examples/README.md).
67-
68-
These instructions assume you are using
69-
[Cloud Shell](https://cloud.google.com/shell) in the context of the GCP project
70-
you wish to deploy in, and that you are in the root directory of the hpc-toolkit
71-
repo cloned during [installation](#installation).
72-
73-
Run the ghpc binary with the following command:
74-
75-
```shell
76-
./ghpc create examples/hpc-cluster-small.yaml --vars "project_id=${GOOGLE_CLOUD_PROJECT}"
77-
```
78-
79-
> **_NOTE:_** The `--vars` argument supports comma-separated list of name=value
80-
> variables to override blueprint variables. This feature only supports
81-
> variables of string type.
82-
83-
This will create a deployment directory named `hpc-small/`.
84-
85-
After successfully running `ghpc create`, a short message displaying how to
86-
proceed is displayed. For the `hpc-cluster-small` example, the message will
87-
appear similar to:
88-
89-
```shell
90-
terraform -chdir=hpc-cluster-small/primary init
91-
terraform -chdir=hpc-cluster-small/primary validate
92-
terraform -chdir=hpc-cluster-small/primary apply
93-
```
94-
95-
Use these commands to run terraform and deploy your cluster. If the `apply` is
96-
successful, a message similar to the following will be displayed:
97-
98-
```shell
99-
Apply complete! Resources: 13 added, 0 changed, 0 destroyed.
100-
```
101-
102-
> **_NOTE:_** Before you run this for the first time you may need to enable some
103-
> APIs and possibly request additional quotas. See
104-
> [Enable GCP APIs](#enable-gcp-apis) and
105-
> [Small Example Quotas](examples/README.md#hpc-cluster-smallyaml).\
106-
> **_NOTE:_** If not using cloud shell you may need to set up
107-
> [GCP Credentials](#gcp-credentials).\
108-
> **_NOTE:_** Cloud Shell times out after 20 minutes of inactivity. This example
109-
> deploys in about 5 minutes but for more complex deployments it may be
110-
> necessary to deploy (`terraform apply`) from a cloud VM. The same process
111-
> above can be used, although [dependencies](#dependencies) will need to be
112-
> installed first.
113-
114-
Once successfully deployed, take the following steps to run a job:
115-
116-
* First navigate to `Compute Engine` > `VM instances` in the Google Cloud Console.
117-
* Next click on the `SSH` button associated with the `slurm-hpc-small-login0` instance.
118-
* Finally run the `hostname` command on 3 nodes by running the following command in the shell popup:
119-
120-
```shell
121-
$ srun -N 3 hostname
122-
slurm-hpc-slurm-small-debug-0-0
123-
slurm-hpc-slurm-small-debug-0-1
124-
slurm-hpc-slurm-small-debug-0-2
125-
```
126-
127-
By default, this runs the job on the `debug` partition. See details in
128-
[examples/](examples/README.md#compute-partition) for how to run on the more
129-
performant `compute` partition.
130-
131-
This example does not contain any Packer-based modules but for completeness,
132-
you can use the following command to deploy a Packer-based deployment group:
133-
134-
```shell
135-
cd <deployment-directory>/<packer-group>/<custom-vm-image>
136-
packer init .
137-
packer validate .
138-
packer build .
139-
```
37+
> **_NOTE:_** You may need to [install dependencies](#dependencies) first.
14038
14139
## HPC Toolkit Components
14240

143-
The HPC Toolkit has been designed to simplify the process of deploying an HPC
144-
cluster on Google Cloud. The block diagram below describes the individual
145-
components of the HPC toolkit.
146-
147-
```mermaid
148-
graph LR
149-
subgraph HPC Environment Configuration
150-
A(1. Provided Blueprint Examples) --> B(2. HPC Blueprint)
151-
end
152-
B --> D
153-
subgraph Creating an HPC Deployment
154-
C(3. Modules, eg. Terraform, Scripts) --> D(4. ghpc Engine)
155-
D --> E(5. Deployment Directory)
156-
end
157-
subgraph Google Cloud
158-
E --> F(6. HPC environment on GCP)
159-
end
160-
```
161-
162-
1. **Provided Blueprint Examples** – A set of vetted reference blueprints can be
163-
found in the ./examples and ./community/examples directories. These can be
164-
used to create a predefined deployment for a cluster or as a starting point
165-
for creating a custom deployment.
166-
2. **HPC Blueprint** – The primary interface to the HPC Toolkit is an HPC
167-
Blueprint file. This is a YAML file that defines which modules to use and how
168-
to customize them.
169-
3. **HPC Modules** – The building blocks of a deployment directory are the
170-
modules. Modules can be found in the ./modules and community/modules
171-
directories. They are composed of terraform, packer and/or script files that
172-
meet the expectations of the gHPC engine.
173-
4. **gHPC Engine** – The gHPC engine converts the blueprint file into a
174-
self-contained deployment directory.
175-
5. **Deployment Directory** – A self-contained directory that can be used to
176-
deploy a cluster onto Google Cloud. This is the output of the gHPC engine.
177-
6. **HPC environment on GCP** – After deployment, an HPC environment will be
178-
available in Google Cloud.
179-
180-
Users can configure a set of modules, and using the gHPC Engine of the HPC
181-
Toolkit, they can produce a deployment directory with instructions for
182-
deploying. Terraform is the primary method for defining the modules behind the
183-
HPC cluster, but other modules based on tools like ansible and Packer are
184-
available.
185-
186-
The HPC Toolkit can provide extra flexibility to configure a cluster to the
187-
specifications of a customer by making the deployment directory available and
188-
editable before deploying. Any HPC customer seeking a quick on-ramp to building
189-
out their infrastructure on GCP can benefit from this.
41+
Learn about the components that make up the HPC Toolkit and more on how it works
42+
on the
43+
[Google Cloud Docs Product Overview](https://cloud.google.com/hpc-toolkit/docs/overview#components).
19044

19145
## GCP Credentials
19246

@@ -309,23 +163,18 @@ In a new GCP project there are several apis that must be enabled to deploy your
309163
HPC cluster. These will be caught when you perform `terraform apply` but you can
310164
save time by enabling them upfront.
311165

312-
List of APIs to enable ([instructions](https://cloud.google.com/apis/docs/getting-started#enabling_apis)):
313-
314-
* Compute Engine API
315-
* Cloud Filestore API
316-
* Cloud Runtime Configuration API - _needed for `high-io` example_
166+
See
167+
[Google Cloud Docs](https://cloud.google.com/hpc-toolkit/docs/setup/configure-environment#enable-apis)
168+
for instructions.
317169

318170
## GCP Quotas
319171

320172
You may need to request additional quota to be able to deploy and use your HPC
321-
cluster. For example, by default the `SchedMD-slurm-on-gcp-partition` module
322-
uses `c2-standard-60` VMs for compute nodes. Default quota for C2 CPUs may be as
323-
low as 8, which would prevent even a single node from being started.
324-
325-
Required quotas will be based on your custom HPC configuration. Minimum quotas
326-
have been [documented](examples/README.md#example-blueprints) for the provided examples.
173+
cluster.
327174

328-
Quotas can be inspected and requested at `IAM & Admin` > `Quotas`.
175+
See
176+
[Google Cloud Docs](https://cloud.google.com/hpc-toolkit/docs/setup/hpc-blueprint#request-quota)
177+
for more information.
329178

330179
## Billing Reports
331180

@@ -581,30 +430,8 @@ hpc-small/
581430

582431
## Dependencies
583432

584-
Much of the HPC Toolkit deployment is built using Terraform and Packer, and
585-
therefore they must be available in the same machine calling the toolkit. In
586-
addition, building the HPC Toolkit from source requires git, make, and Go to be
587-
installed.
588-
589-
List of dependencies:
590-
591-
* Terraform: version>=1.0.0 - [install instructions](https://www.terraform.io/downloads.html)
592-
* Packer: version>=1.6.0 - [install instructions](https://www.packer.io/downloads)
593-
* golang: version>=1.16 - [install instructions](https://golang.org/doc/install)
594-
* To setup GOPATH and development environment: `export PATH=$PATH:$(go env GOPATH)/bin`
595-
* make
596-
* git
597-
598-
### MacOS Additional Dependencies
599-
600-
On macOS, `make` is packaged with the Xcode command line developer tools. To
601-
install, run the following command:
602-
603-
```shell
604-
xcode-select --install
605-
```
606-
607-
Alternatively you can build `ghpc` directly using `go build ghpc.go`.
433+
See
434+
[Cloud Docs on Installing Dependencies](https://cloud.google.com/hpc-toolkit/docs/setup/install-dependencies).
608435

609436
### Notes on Packer
610437

cmd/root.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ HPC deployments on the Google Cloud Platform.`,
3434
log.Fatalf("cmd.Help function failed: %s", err)
3535
}
3636
},
37-
Version: "v0.7.3-alpha (private preview)",
37+
Version: "v1.0.0",
3838
}
3939
)
4040

community/examples/README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,12 @@ Examples using Intel HPC technologies can be found in the
1515

1616
### spack-gromacs.yaml
1717

18-
[See description in core](../../examples/README.md#community-spack-gromacsyaml)
18+
[See description in core](../../examples/README.md#spack-gromacsyaml--)
1919

2020
### omnia-cluster.yaml
2121

22-
[See description in core](../../examples/README.md#community-omnia-clusteryaml)
22+
[See description in core](../../examples/README.md#omnia-clusteryaml--)
23+
24+
### hpc-cluster-small-sharedvpc.yaml
25+
26+
[See description in core](../../examples/README.md#hpc-cluster-small-sharedvpcyaml--)
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Copyright 2021 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
---
16+
17+
blueprint_name: hpc-cluster-small-sharedvpc
18+
19+
# IMPORTANT NOTES
20+
#
21+
# 1. This blueprint expects a Shared VPC to exist and has already been shared
22+
# from a Host project to a Service project.
23+
# 2. It also anticipates that the custom steps for provisioning a Filestore
24+
# instance in a Shared VPC in a service project have been completed:
25+
#
26+
# https://cloud.google.com/filestore/docs/shared-vpc
27+
#
28+
# 3. Replace project_id, host_project_id, network_name, subnetwork_name with
29+
# valid values in your environment
30+
31+
vars:
32+
project_id: ## Set GCP Project ID Here ##
33+
host_project_id: your-host-project
34+
network_name: your-shared-network
35+
subnetwork_name: your-shared-subnetwork
36+
deployment_name: hpc-small-shared-vpc
37+
region: us-central1
38+
zone: us-central1-c
39+
40+
deployment_groups:
41+
- group: primary
42+
modules:
43+
- source: modules/network/pre-existing-vpc
44+
kind: terraform
45+
id: network1
46+
settings:
47+
project_id: $(vars.host_project_id)
48+
49+
- source: modules/file-system/filestore
50+
kind: terraform
51+
id: homefs
52+
use: [network1]
53+
settings:
54+
local_mount: /home
55+
project_id: $(vars.host_project_id)
56+
connect_mode: PRIVATE_SERVICE_ACCESS
57+
58+
59+
# This debug_partition will work out of the box without requesting additional GCP quota.
60+
- source: community/modules/compute/SchedMD-slurm-on-gcp-partition
61+
kind: terraform
62+
id: debug_partition
63+
use:
64+
- network1
65+
- homefs
66+
settings:
67+
partition_name: debug
68+
max_node_count: 4
69+
enable_placement: false
70+
exclusive: false
71+
machine_type: n2-standard-2
72+
73+
# This compute_partition is far more performant than debug_partition but may require requesting GCP quotas first.
74+
- source: community/modules/compute/SchedMD-slurm-on-gcp-partition
75+
kind: terraform
76+
id: compute_partition
77+
use:
78+
- network1
79+
- homefs
80+
settings:
81+
partition_name: compute
82+
max_node_count: 20
83+
84+
- source: community/modules/scheduler/SchedMD-slurm-on-gcp-controller
85+
kind: terraform
86+
id: slurm_controller
87+
use:
88+
- network1
89+
- homefs
90+
- debug_partition # debug partition will be default as it is listed first
91+
- compute_partition
92+
settings:
93+
login_node_count: 1
94+
shared_vpc_host_project: $(vars.host_project_id)
95+
96+
- source: community/modules/scheduler/SchedMD-slurm-on-gcp-login-node
97+
kind: terraform
98+
id: slurm_login
99+
use:
100+
- network1
101+
- homefs
102+
- slurm_controller
103+
settings:
104+
shared_vpc_host_project: $(vars.host_project_id)

0 commit comments

Comments
 (0)