GkeGpuDirectTCPXCluster

This example deploys Google Cloud GPU supercomputer which is accelerator-optimized for scalable, massive models. The RGD is installed by platform administrators who facilitate ML infrastructure for self service teams.

The cluster has:

Eight NVIDIA H100 GPUs per machine.
Up to 200 Gbps bandwidth on the primary NIC.
Secondary NICs (up to four on A3 High machine types), each supporting up to 200 Gbps bandwidth for GPU data transfer.

This deployment maximizes network bandwidth and throughput for high-performance GPU workloads in Google Kubernetes Engine (GKE) Standard clusters by using, GPUDirect-TCPX, gVNIC, and multi-networking.

GKE cluster
Container Node Pools
Network
Subnetwork
GKE Network and NetworkParams

Everything related to these resources would be hidden from the end user, simplifying their experience.

End User: GkeGpuDirectTCPXCluster

The administrator needs to install the RGD first. The end user creates a GkeGpuDirectTCPXCluster resource something like this:

apiVersion: kro.run/v1alpha1
kind: GkeGpuDirectTCPXCluster
metadata:
  name: gpu-demo
  namespace: config-connector
spec:
  name: gpu-demo        # Name used for all resources created as part of this RGD
  location: us-central1 # Region where the GCP resources are created

They can then check the status of the applied resource:

kubectl get gkegpudirecttcpxcluster
kubectl get gkegpudirecttcpxcluster gpu-demo -n config-connector -o yaml

Navigate to GKE Cluster page in the GCP Console and verify the cluster creation.

Once done, the user can delete the GkeGpuDirectTCPXCluster instance:

kubectl delete gkegpudirecttcpxcluster gpu-demo -n config-connector

Administrator: ResourceGraphDefinition

The administrator needs to install the RGD in the cluster first before the user can consume it:

kubectl apply -f rgd.yaml

Validate the RGD is installed correctly:

kubectl get rgd gkegpudirecttcpxcluster.kro.run

Once all user created instances are deleted, the administrator can choose to deleted the RGD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!