Some NVIDIA GPU allow to use a slice of it via MIG profiles
Very usefull if you have a big GPU like an H100 96G and want to optimize efficiency usage of it
Instead of giving the full GPU to a single user, it can be sliced in pieces (2-7) and then "user A" uses a few slices, another "user B" uses a few other slices and share the resouces
Now, we should have an example on how to request those resources
In Kubernetes on a pod to request a full gpu, looks like this
apiVersion: v1
kind: Pod
metadata:
name: gpu-full
spec:
restartPolicy: OnFailure
nodeSelector:
kubernetes.io/hostname: x2o-k8s-cluster-ai-2
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
And to request for a MIG slice for a 7th slice for an H100 (NOTE that each GPU can have different profiles)
It can look like this
apiVersion: v1
kind: Pod
metadata:
name: gpu-mig
spec:
restartPolicy: OnFailure
nodeSelector:
kubernetes.io/hostname: x2o-k8s-cluster-ai-1
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/mig-1g.12gb: 1
Now, how should that be mapped in config options for the spark submit job?
Just updating the gpu_pod_template.yaml will make the exec pod request both resources
Limits:
memory: 5734Mi
nvidia.com/gpu: 1
nvidia.com/mig-1g.12gb: 1
Requests:
cpu: 3
memory: 5734Mi
nvidia.com/gpu: 1
nvidia.com/mig-1g.12gb: 1
Some NVIDIA GPU allow to use a slice of it via MIG profiles
Very usefull if you have a big GPU like an H100 96G and want to optimize efficiency usage of it
Instead of giving the full GPU to a single user, it can be sliced in pieces (2-7) and then "user A" uses a few slices, another "user B" uses a few other slices and share the resouces
Now, we should have an example on how to request those resources
In Kubernetes on a pod to request a full gpu, looks like this
And to request for a MIG slice for a 7th slice for an H100 (NOTE that each GPU can have different profiles)
It can look like this
Now, how should that be mapped in config options for the spark submit job?
Just updating the gpu_pod_template.yaml will make the exec pod request both resources