Skip to content

Docs / Add to GPU example how to request GPU MIG slices #126

@gustavosr98

Description

@gustavosr98

Some NVIDIA GPU allow to use a slice of it via MIG profiles
Very usefull if you have a big GPU like an H100 96G and want to optimize efficiency usage of it
Instead of giving the full GPU to a single user, it can be sliced in pieces (2-7) and then "user A" uses a few slices, another "user B" uses a few other slices and share the resouces

Now, we should have an example on how to request those resources

In Kubernetes on a pod to request a full gpu, looks like this

apiVersion: v1
kind: Pod
metadata:
  name: gpu-full
spec:
  restartPolicy: OnFailure
  nodeSelector:
    kubernetes.io/hostname: x2o-k8s-cluster-ai-2
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1

And to request for a MIG slice for a 7th slice for an H100 (NOTE that each GPU can have different profiles)
It can look like this

apiVersion: v1
kind: Pod
metadata:
  name: gpu-mig
spec:
  restartPolicy: OnFailure
  nodeSelector:
    kubernetes.io/hostname: x2o-k8s-cluster-ai-1
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/mig-1g.12gb: 1

Now, how should that be mapped in config options for the spark submit job?
Just updating the gpu_pod_template.yaml will make the exec pod request both resources

    Limits:
      memory:                  5734Mi
      nvidia.com/gpu:          1
      nvidia.com/mig-1g.12gb:  1
    Requests:
      cpu:                     3
      memory:                  5734Mi
      nvidia.com/gpu:          1
      nvidia.com/mig-1g.12gb:  1

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions