Skip to content

Latest commit

 

History

History
111 lines (105 loc) · 4.06 KB

enabling-gpu-time-slicing.adoc

File metadata and controls

111 lines (105 loc) · 4.06 KB

Enabling GPU time slicing

To enable GPU time slicing in {productname-short}, you must configure the NVIDIA GPU Operator to allow multiple workloads to share a single GPU.

Prerequisites
  • You have logged in to {openshift-platform}.

  • You have the cluster-admin role in {openshift-platform}.

  • You have installed and configured the NVIDIA GPU Operator.

  • The relevant nodes in your deployment contain NVIDIA GPUs.

  • The GPU in your deployment supports time slicing.

  • You installed the OpenShift command line interface (oc) as described in Installing the OpenShift CLI.

Procedure
  1. Create a config map named time-slicing-config in the namespace that is used by the GPU operator. For NVIDIA GPUs, this is the nvidia-gpu-operator namespace.

    1. Log in to the {openshift-platform} web console as a cluster administrator.

    2. In the Administrator perspective, navigate to WorkloadsConfigMaps.

    3. On the ConfigMap details page, click the Create Config Map button.

    4. On the Create Config Map page, for Configure via, select YAML view.

    5. In the Data field, enter the YAML code for the relevant GPU. Here is an example of a time-slicing-config config map for an NVIDIA T4 GPU:

      Note
      • You can change the number of replicas to control the number of GPU slices available for each physical GPU.

      • Increasing replicas might increase the risk of Out of Memory (OOM) errors if workloads exceed available GPU memory.

      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: time-slicing-config
      data:
        tesla-t4: |-
          version: v1
          flags:
            migStrategy: none
          sharing:
            timeSlicing:
              renameByDefault: false
              failRequestsGreaterThanOne: false
              resources:
                - name: nvidia.com/gpu
                  replicas: 4
    6. Click Create.

  2. Update the gpu-cluster-policy cluster policy to reference the time-slicing-config config map:

    1. In the Administrator perspective, navigate to OperatorsInstalled Operators.

    2. Search for the NVIDIA GPU Operator, and then click the Operator name to open the Operator details page.

    3. Click the ClusterPolicy tab.

    4. Select the gpu-cluster-policy resource from the list to open the ClusterPolicy details page.

    5. Click the YAML tab and update the spec.devicePlugin section to reference the time-slicing-config config map. Here is an example of a gpu-cluster-policy cluster policy for an NVIDIA T4 GPU:

      apiVersion: nvidia.com/v1
      kind: ClusterPolicy
      metadata:
        name: gpu-cluster-policy
      spec:
          devicePlugin:
            config:
              default: tesla-t4
              name: time-slicing-config
    6. Click Save.

  3. Label the relevant machine set to apply time slicing:

    1. In the Administrator perspective, navigate to ComputeMachine Sets.

    2. Select the machine set for GPU time slicing from the list.

    3. On the MachineSet details page, click the YAML tab and update the spec.template.spec.metadata.labels section to label the relevant machine set. Here is an example of a machine set with the appropriate machine label for an NVIDIA T4 GPU:

        spec:
          template:
            spec:
              metadata:
                labels:
                  nvidia.com/device-plugin.config: tesla-t4
    4. Click Save.

Verification
  1. Log in to the OpenShift CLI.

  2. Verify that you have applied the config map correctly:

    oc get configmap time-slicing-config -n nvidia-gpu-operator -o yaml
  3. Check that the cluster policy includes the time-slicing configuration:

    oc get clusterpolicy gpu-cluster-policy -o yaml
  4. Ensure that the label is applied to nodes:

    oc get nodes --show-labels | grep nvidia.com/device-plugin.config
Note

If workloads do not appear to be sharing the GPU, verify that the NVIDIA device plugin is running and that the correct labels are applied.