Skip to content

Commit 1c6b5a6

Browse files
add GKE H4D blueprint and integration test
1 parent 119d97c commit 1c6b5a6

File tree

7 files changed

+392
-1
lines changed

7 files changed

+392
-1
lines changed

examples/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ md_toc github examples/README.md | sed -e "s/\s-\s/ * /"
5959
* [tutorial-fluent.yaml](#tutorial-fluentyaml--) ![community-badge] ![experimental-badge]
6060
* [gke-tpu-v6](#gke-tpu-v6--) ![community-badge] ![experimental-badge]
6161
* [xpk-n2-filestore](#xpk-n2-filestore--) ![community-badge] ![experimental-badge]
62+
* [gke-h4d](#gke-h4d-) ![core-badge]
6263
* [Blueprint Schema](#blueprint-schema)
6364
* [Writing an HPC Blueprint](#writing-an-hpc-blueprint)
6465
* [Blueprint Boilerplate](#blueprint-boilerplate)
@@ -1453,6 +1454,12 @@ python3 xpk.py info --cluster xpk-01
14531454

14541455
[xpk-n2-filestore]: ../community/examples/xpk-n2-filestore/xpk-n2-filestore.yaml
14551456

1457+
### [gke-h4d] ![core-badge]
1458+
1459+
This blueprint uses GKE to provision a Kubernetes cluster and a H4D node pool, along with networks and service accounts. Information about H4D machines can be found [here](https://cloud.google.com/blog/products/compute/new-h4d-vms-optimized-for-hpc). The deployment instructions can be found in the [README](/examples/gke-h4d/README.md).
1460+
1461+
[gke-h4d]: ../examples/gke-h4d
1462+
14561463
## Blueprint Schema
14571464

14581465
Similar documentation can be found on

examples/gke-h4d/README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# GKE H4D Blueprint
2+
3+
This blueprint uses GKE to provision a Kubernetes cluster and a H4D node pool, along with networks and service accounts. Information about H4D machines can be found [here](https://cloud.google.com/blog/products/compute/new-h4d-vms-optimized-for-hpc).
4+
5+
> **_NOTE:_** The required GKE version for H4D support is >= 1.32.3-gke.1170000.
6+
7+
## Steps to deploy the H4D blueprint
8+
9+
1. Install Cluster Toolkit
10+
1. Install [dependencies](https://cloud.google.com/cluster-toolkit/docs/setup/install-dependencies).
11+
1. Set up [Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/setup/configure-environment).
12+
1. Switch to the Cluster Toolkit directory
13+
14+
```sh
15+
cd cluster-toolkit
16+
```
17+
18+
1. Get the IP address for your host machine
19+
20+
```sh
21+
curl ifconfig.me
22+
```
23+
24+
1. Update the vars block of the `gke-h4d-deployment.yaml` file.
25+
1. `project_id`: ID of the project where you are deploying the cluster.
26+
1. `deployment_name`: Name of the deployment.
27+
1. `region`: Compute region used for the deployment.
28+
1. `zone`: Compute zone used for the deployment.
29+
1. `static_node_count`: Number of nodes to create.
30+
1. `authorized_cidr`: update the IP address in `<your-ip-address>/32`.
31+
1. Build the Cluster Toolkit binary
32+
33+
```sh
34+
make
35+
```
36+
37+
1. Provision the GKE cluster
38+
39+
```sh
40+
./gcluster deploy -d examples/gke-h4d/gke-h4d-deployment.yaml examples/gke-h4d/gke-h4d.yaml
41+
```
42+
43+
These four options are displayed:
44+
45+
```sh
46+
(D)isplay full proposed changes,
47+
(A)pply proposed changes,
48+
(S)top and exit,
49+
(C)ontinue without applying
50+
```
51+
52+
Type `a` and hit enter to create the cluster.
53+
54+
## Clean Up
55+
To destroy all resources associated with creating the GKE cluster, run the following command:
56+
57+
```sh
58+
./gcluster destroy CLUSTER-NAME
59+
```
60+
61+
Replace `CLUSTER-NAME` with the `deployment_name` used in the blueprint vars block.
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Copyright 2025 "Google LLC"
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
terraform_backend_defaults:
16+
type: gcs
17+
configuration:
18+
# The GCS bucket used for storing terraform state
19+
bucket: BUCKET_NAME
20+
21+
vars:
22+
# Your GCP Project ID
23+
project_id: PROJECT_ID
24+
25+
# This should be unique across all of your Cluster
26+
# Toolkit Deployments.
27+
deployment_name: DEPLOYMENT_NAME
28+
29+
# The GCP Region used for this deployment.
30+
region: COMPUTE_REGION
31+
32+
# The GCP Zone used for this deployment.
33+
zone: COMPUTE_ZONE
34+
35+
# The number of nodes to be created
36+
static_node_count: NODE_COUNT
37+
38+
# Cidr block containing the IP of the machine calling terraform.
39+
# The following line must be updated for this example to work.
40+
authorized_cidr: IP_ADDRESS/SUFFIX

examples/gke-h4d/gke-h4d.yaml

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Copyright 2025 "Google LLC"
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
blueprint_name: gke-h4d
16+
17+
vars:
18+
# The following variables should be over-written in the deployment.yaml file.
19+
# Your GCP Project ID
20+
project_id:
21+
22+
# This should be unique across all of your Cluster
23+
# Toolkit Deployments.
24+
deployment_name: gke-h4d
25+
26+
# The GCP Region used for this deployment.
27+
region:
28+
29+
# The GCP Zone used for this deployment.
30+
zone:
31+
32+
# The number of nodes to be created.
33+
static_node_count:
34+
35+
# Cidr block containing the IP of the machine calling terraform.
36+
# The following line must be updated for this example to work.
37+
authorized_cidr:
38+
39+
system_node_pool_disk_size_gb: 100
40+
h4d_node_pool_disk_size_gb: 100
41+
42+
43+
deployment_groups:
44+
- group: primary
45+
modules:
46+
- id: gke-h4d-net
47+
source: modules/network/vpc
48+
settings:
49+
network_name: $(vars.deployment_name)-net
50+
subnetworks:
51+
- subnet_name: $(vars.deployment_name)-sub
52+
subnet_region: $(vars.region)
53+
subnet_ip: 192.168.0.0/24
54+
secondary_ranges_list:
55+
- subnetwork_name: $(vars.deployment_name)-sub
56+
ranges:
57+
- range_name: pods
58+
ip_cidr_range: 10.64.0.0/19
59+
- range_name: services
60+
ip_cidr_range: 10.65.0.0/19
61+
firewall_rules:
62+
- name: $(vars.deployment_name)-internal
63+
ranges: [192.168.0.0/24]
64+
allow:
65+
- protocol: tcp
66+
ports: ["0-65535"]
67+
- protocol: udp
68+
ports: ["0-65535"]
69+
- protocol: icmp
70+
71+
- id: gke-h4d-rdma-net
72+
source: modules/network/vpc
73+
settings:
74+
network_name: $(vars.deployment_name)-rdma-net
75+
network_profile: https://www.googleapis.com/compute/beta/projects/$(vars.project_id)/global/networkProfiles/$(vars.zone)-vpc-falcon
76+
network_routing_mode: REGIONAL
77+
enable_cloud_router: false
78+
enable_cloud_nat: false
79+
subnetworks:
80+
- subnet_name: $(vars.deployment_name)-rdma-sub
81+
subnet_region: $(vars.region)
82+
subnet_ip: 192.168.1.0/24
83+
region: $(vars.region)
84+
85+
- id: node_pool_service_account
86+
source: community/modules/project/service-account
87+
settings:
88+
name: gke-np-sa
89+
project_roles:
90+
- logging.logWriter
91+
- monitoring.metricWriter
92+
- monitoring.viewer
93+
- stackdriver.resourceMetadata.writer
94+
- storage.objectViewer
95+
- artifactregistry.reader
96+
97+
- id: workload_service_account
98+
source: community/modules/project/service-account
99+
settings:
100+
name: gke-wl-sa
101+
project_roles:
102+
- logging.logWriter
103+
- monitoring.metricWriter
104+
- monitoring.viewer
105+
- stackdriver.resourceMetadata.writer
106+
- storage.objectAdmin
107+
- artifactregistry.reader
108+
109+
- id: h4d-cluster
110+
source: modules/scheduler/gke-cluster
111+
use: [gke-h4d-net, workload_service_account]
112+
settings:
113+
system_node_pool_machine_type: "e2-standard-16"
114+
system_node_pool_disk_size_gb: $(vars.system_node_pool_disk_size_gb)
115+
system_node_pool_taints: []
116+
enable_multi_networking: true
117+
enable_dcgm_monitoring: true
118+
gcp_public_cidrs_access_enabled: false
119+
enable_private_endpoint: false # Allows access from authorized public IPs
120+
configure_workload_identity_sa: true
121+
master_authorized_networks:
122+
- cidr_block: $(vars.authorized_cidr) # Allows your machine to run the kubectl command. Required for multi network setup.
123+
display_name: "kubectl-access-network"
124+
additional_networks:
125+
$(concat(
126+
[{
127+
network=gke-h4d-rdma-net.network_name,
128+
subnetwork=gke-h4d-rdma-net.subnetwork_name,
129+
subnetwork_project=vars.project_id,
130+
nic_type="IRDMA",
131+
queue_count=null,
132+
network_ip=null,
133+
stack_type=null,
134+
access_config=[{nat_ip=null, public_ptr_domain_name=null, network_tier=null}],
135+
ipv6_access_config=[],
136+
alias_ip_range=[]
137+
}]
138+
))
139+
# Cluster versions cannot be updated through the toolkit after creation
140+
# Please manage cluster version from the Google Cloud Console directly
141+
version_prefix: "1.32."
142+
release_channel: RAPID
143+
maintenance_exclusions:
144+
- name: no-minor-or-node-upgrades-indefinite
145+
start_time: "2024-12-01T00:00:00Z"
146+
end_time: "2025-12-22T00:00:00Z"
147+
exclusion_scope: NO_MINOR_OR_NODE_UPGRADES
148+
outputs: [instructions]
149+
150+
- id: h4d-pool
151+
source: modules/compute/gke-node-pool
152+
use: [h4d-cluster, node_pool_service_account]
153+
settings:
154+
machine_type: h4d-highmem-192-lssd
155+
auto_upgrade: true
156+
zones: [$(vars.zone)]
157+
disk_size_gb: $(vars.h4d_node_pool_disk_size_gb)
158+
static_node_count: $(vars.static_node_count)
159+
additional_networks:
160+
$(concat(
161+
[{
162+
network=gke-h4d-rdma-net.network_name,
163+
subnetwork=gke-h4d-rdma-net.subnetwork_name,
164+
subnetwork_project=vars.project_id,
165+
nic_type="IRDMA",
166+
queue_count=null,
167+
network_ip=null,
168+
stack_type=null,
169+
access_config=[{nat_ip=null, public_ptr_domain_name=null, network_tier=null}],
170+
ipv6_access_config=[],
171+
alias_ip_range=[]
172+
}]
173+
))
174+
outputs: [instructions]
175+
176+
# Install Kueue and Jobset
177+
- id: workload-manager-install
178+
source: modules/management/kubectl-apply
179+
use: [h4d-cluster]
180+
settings:
181+
kueue:
182+
install: true
183+
jobset:
184+
install: true
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
---
16+
tags:
17+
- gke
18+
- m.gke-cluster
19+
- m.gke-node-pool
20+
- m.service-account
21+
- m.vpc
22+
- m.kubectl-apply
23+
24+
timeout: 14400s # 4hr
25+
steps:
26+
- id: gke-h4d
27+
name: us-central1-docker.pkg.dev/$PROJECT_ID/hpc-toolkit-repo/test-runner
28+
entrypoint: /bin/bash
29+
env:
30+
- "ANSIBLE_HOST_KEY_CHECKING=false"
31+
- "ANSIBLE_CONFIG=/workspace/tools/cloud-build/ansible.cfg"
32+
args:
33+
- -c
34+
- |
35+
set -x -e
36+
cd /workspace && make
37+
BUILD_ID_FULL=$BUILD_ID
38+
BUILD_ID_SHORT=$${BUILD_ID_FULL:0:6}
39+
EXAMPLE_BP=examples/gke-h4d/gke-h4d.yaml
40+
41+
# adding vm to act as remote node
42+
echo ' - id: remote-node' >> $${EXAMPLE_BP}
43+
echo ' source: modules/compute/vm-instance' >> $${EXAMPLE_BP}
44+
echo ' use: [gke-h4d-net]' >> $${EXAMPLE_BP}
45+
echo ' settings:' >> $${EXAMPLE_BP}
46+
echo ' machine_type: e2-standard-2' >> $${EXAMPLE_BP}
47+
echo ' name_prefix: remote-node' >> $${EXAMPLE_BP}
48+
echo ' add_deployment_name_before_prefix: true' >> $${EXAMPLE_BP}
49+
echo ''
50+
echo ' - id: job_template_hostname' >> $${EXAMPLE_BP}
51+
echo ' source: modules/compute/gke-job-template' >> $${EXAMPLE_BP}
52+
echo ' use: [h4d-pool]' >> $${EXAMPLE_BP}
53+
echo ' settings:' >> $${EXAMPLE_BP}
54+
echo ' image: busybox' >> $${EXAMPLE_BP}
55+
echo ' command:' >> $${EXAMPLE_BP}
56+
echo ' - echo' >> $${EXAMPLE_BP}
57+
echo ' - Hello World' >> $${EXAMPLE_BP}
58+
echo ' node_count: 1' >> $${EXAMPLE_BP}
59+
echo ' outputs: [instructions]' >> $${EXAMPLE_BP}
60+
61+
ansible-playbook tools/cloud-build/daily-tests/ansible_playbooks/base-integration-test.yml \
62+
--user=sa_106486320838376751393 --extra-vars="project=${PROJECT_ID} build=$${BUILD_ID_SHORT}" \
63+
--extra-vars="@tools/cloud-build/daily-tests/tests/gke-h4d.yml"

0 commit comments

Comments
 (0)