This example deploys Google Cloud GPU supercomputer which is accelerator-optimized for scalable, massive models. The RGD is installed by platform administrators who facilitate ML infrastructure for self service teams.
The cluster has:
- Eight NVIDIA H100 GPUs per machine.
- Up to 200 Gbps bandwidth on the primary NIC.
- Secondary NICs (up to four on A3 High machine types), each supporting up to 200 Gbps bandwidth for GPU data transfer.
This deployment maximizes network bandwidth and throughput for high-performance GPU workloads in Google Kubernetes Engine (GKE) Standard clusters by using, GPUDirect-TCPX, gVNIC, and multi-networking.
- GKE cluster
- Container Node Pools
- Network
- Subnetwork
- GKE Network and NetworkParams
Everything related to these resources would be hidden from the end user, simplifying their experience.
The administrator needs to install the RGD first.
The end user creates a GkeGpuDirectTCPXCluster
resource something like this:
apiVersion: kro.run/v1alpha1
kind: GkeGpuDirectTCPXCluster
metadata:
name: gpu-demo
namespace: config-connector
spec:
name: gpu-demo # Name used for all resources created as part of this RGD
location: us-central1 # Region where the GCP resources are created
They can then check the status of the applied resource:
kubectl get gkegpudirecttcpxcluster
kubectl get gkegpudirecttcpxcluster gpu-demo -n config-connector -o yaml
Navigate to GKE Cluster page in the GCP Console and verify the cluster creation.
Once done, the user can delete the GkeGpuDirectTCPXCluster
instance:
kubectl delete gkegpudirecttcpxcluster gpu-demo -n config-connector
The administrator needs to install the RGD in the cluster first before the user can consume it:
kubectl apply -f rgd.yaml
Validate the RGD is installed correctly:
kubectl get rgd gkegpudirecttcpxcluster.kro.run
Once all user created instances are deleted, the administrator can choose to deleted the RGD.