Skip to content

Latest commit

 

History

History
178 lines (145 loc) · 4.08 KB

File metadata and controls

178 lines (145 loc) · 4.08 KB

GkeGpuDirectTCPXCluster

This example deploys Google Cloud GPU supercomputer which is accelerator-optimized for scalable, massive models. The RGD is installed by platform administrators who facilitate ML infrastructure for self service teams.

The cluster has:

  • Eight NVIDIA H100 GPUs per machine.
  • Up to 200 Gbps bandwidth on the primary NIC.
  • Secondary NICs (up to four on A3 High machine types), each supporting up to 200 Gbps bandwidth for GPU data transfer.

This deployment maximizes network bandwidth and throughput for high-performance GPU workloads in Google Kubernetes Engine (GKE) Standard clusters by using, GPUDirect-TCPX, gVNIC, and multi-networking.

  • GKE cluster
  • Container Node Pools
  • Network
  • Subnetwork
  • GKE Network and NetworkParams

Everything related to these resources would be hidden from the end user, simplifying their experience.

GKE GPU A3Mega

End User: GkeGpuDirectTCPXCluster

The administrator needs to install the RGD first. The end user creates a GkeGpuDirectTCPXCluster resource something like this:

apiVersion: kro.run/v1alpha1
kind: GkeGpuDirectTCPXCluster
metadata:
  name: gpu-demo
  namespace: config-connector
spec:
  name: gpu-demo        # Name used for all resources created as part of this RGD
  location: us-central1 # Region where the GCP resources are created

They can then check the status of the applied resource:

kubectl get gkegpudirecttcpxcluster
kubectl get gkegpudirecttcpxcluster gpu-demo -n config-connector -o yaml

Navigate to GKE Cluster page in the GCP Console and verify the cluster creation.

Once done, the user can delete the GkeGpuDirectTCPXCluster instance:

kubectl delete gkegpudirecttcpxcluster gpu-demo -n config-connector

Administrator: ResourceGraphDefinition

The administrator needs to install the RGD in the cluster first before the user can consume it:

kubectl apply -f rgd.yaml

Validate the RGD is installed correctly:

kubectl get rgd gkegpudirecttcpxcluster.kro.run

Once all user created instances are deleted, the administrator can choose to deleted the RGD.