Skip to content

Commit 29ff7e2

Browse files
committed
Adding GKE CPU Direct TCPX example
1 parent b9f0419 commit 29ff7e2

File tree

3 files changed

+625
-0
lines changed

3 files changed

+625
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# GkeGpuDirectTCPXCluster
2+
3+
This example deploys [Google Cloud GPU supercomputer](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx) which is accelerator-optimized for scalable, massive models. The RGD is installed by platform administrators who facilitate ML infrastructure for self service teams.
4+
5+
The cluster has:
6+
* Eight NVIDIA H100 GPUs per machine.
7+
* Up to 200 Gbps bandwidth on the primary NIC.
8+
* Secondary NICs (up to four on A3 High machine types), each supporting up to 200 Gbps bandwidth for GPU data transfer.
9+
10+
This deployment maximizes network bandwidth and throughput for high-performance GPU workloads in Google Kubernetes Engine (GKE) Standard clusters by using, GPUDirect-TCPX, gVNIC, and multi-networking.
11+
12+
* GKE cluster
13+
* Container Node Pools
14+
* Network
15+
* Subnetwork
16+
* GKE Network and NetworkParams
17+
18+
Everything related to these resources would be hidden from the end user, simplifying their experience.
19+
20+
![GKE GPU A3Mega](gke-gpudirect-a3mega.png)
21+
22+
<!--
23+
meta {
24+
title "Gke GpuDirect TCPX Cluster"
25+
}
26+
27+
elements {
28+
gcp {
29+
group k8sconfig {
30+
name "Kubernetes Manifests"
31+
card kubernetes as config1 {
32+
name "Network"
33+
}
34+
card kubernetes as config2 {
35+
name "GKENetworkParamSet "
36+
}
37+
}
38+
39+
group Network {
40+
card firewall as fw1 {
41+
name "firewall 1"
42+
}
43+
card firewall as fw2 {
44+
name "firewall 2"
45+
}
46+
card firewall as fw3 {
47+
name "firewall 3"
48+
}
49+
card firewall as fw4 {
50+
name "firewall 4"
51+
}
52+
53+
card network as net1 {
54+
name "net 1"
55+
}
56+
card network as net2 {
57+
name "net 2"
58+
}
59+
card network as net3 {
60+
name "net 3"
61+
}
62+
card network as net4 {
63+
name "net 4"
64+
}
65+
card network as snet1 {
66+
name "subnet 1"
67+
}
68+
card network as snet2 {
69+
name "subnet 2"
70+
}
71+
card network as snet3 {
72+
name "subnet 3"
73+
}
74+
card network as snet4 {
75+
name "subnet 4"
76+
}
77+
}
78+
group GKE {
79+
card gke as cluster {
80+
name "cluster"
81+
}
82+
83+
group default {
84+
name "Default Nodepool"
85+
card gke as defaultNodepool {
86+
name "nodepool"
87+
}
88+
card gce as generalVM {
89+
name "e2-medium"
90+
}
91+
}
92+
93+
group gpu {
94+
name "GPU Nodepool"
95+
card gke as gpuNodepool {
96+
name "nodepool "
97+
}
98+
card gce as gpuVM {
99+
name "a3-highgpu-8g"
100+
}
101+
card gpu as nvidia {
102+
name "Nvidia H100"
103+
}
104+
}
105+
106+
}
107+
108+
}
109+
}
110+
111+
paths {
112+
fw1 -\-> net1
113+
fw2 -\-> net2
114+
fw3 -\-> net3
115+
fw4 -\-> net4
116+
117+
net1 -\-> snet1
118+
net2 -\-> snet2
119+
net3 -\-> snet3
120+
net4 -\-> snet4
121+
122+
config1 -\-> config2
123+
124+
defaultNodepool -\-> generalVM
125+
gpuNodepool -\-> gpuVM
126+
gpuVM -\-> nvidia
127+
128+
Network -right-> GKE
129+
k8sconfig -right-> cluster
130+
}
131+
-->
132+
133+
134+
## End User: GkeGpuDirectTCPXCluster
135+
136+
The administrator needs to install the RGD first.
137+
The end user creates a `GkeGpuDirectTCPXCluster` resource something like this:
138+
139+
```yaml
140+
apiVersion: kro.run/v1alpha1
141+
kind: GkeGpuDirectTCPXCluster
142+
metadata:
143+
name: gpu-demo
144+
namespace: config-connector
145+
spec:
146+
name: gpu-demo # Name used for all resources created as part of this RGD
147+
location: us-central1 # Region where the GCP resources are created
148+
```
149+
150+
They can then check the status of the applied resource:
151+
152+
```
153+
kubectl get gkegpudirecttcpxcluster
154+
kubectl get gkegpudirecttcpxcluster gpu-demo -n config-connector -o yaml
155+
```
156+
157+
Navigate to GKE Cluster page in the GCP Console and verify the cluster creation.
158+
159+
Once done, the user can delete the `GkeGpuDirectTCPXCluster` instance:
160+
161+
```
162+
kubectl delete gkegpudirecttcpxcluster gpu-demo -n config-connector
163+
```
164+
165+
## Administrator: ResourceGraphDefinition
166+
The administrator needs to install the RGD in the cluster first before the user can consume it:
167+
168+
```
169+
kubectl apply -f rgd.yaml
170+
```
171+
172+
Validate the RGD is installed correctly:
173+
174+
```
175+
kubectl get rgd gkegpudirecttcpxcluster.kro.run
176+
```
177+
178+
Once all user created instances are deleted, the administrator can choose to deleted the RGD.
Loading

0 commit comments

Comments
 (0)