Skip to content

Latest commit

 

History

History
144 lines (131 loc) · 6.56 KB

enabling-nvidia-gpus.adoc

File metadata and controls

144 lines (131 loc) · 6.56 KB

Enabling NVIDIA GPUs

Before you can use NVIDIA GPUs in {productname-short}, you must install the NVIDIA GPU Operator.

Procedure
  1. To enable GPU support on an OpenShift cluster, follow the instructions here: NVIDIA GPU Operator on {org-name} OpenShift Container Platform in the NVIDIA documentation.

    Important

    After you install the Node Feature Discovery (NFD) Operator, you must create an instance of NodeFeatureDiscovery. In addition, after you install the NVIDIA GPU Operator, you must create a ClusterPolicy and populate it with default values.

  2. Delete the migration-gpu-status ConfigMap.

    1. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate ConfigMap.

    2. Search for the migration-gpu-status ConfigMap.

    3. Click the action menu (⋮) and select Delete ConfigMap from the list.

      The Delete ConfigMap dialog appears.

    4. Inspect the dialog and confirm that you are deleting the correct ConfigMap.

    5. Click Delete.

  3. Restart the dashboard replicaset.

    1. Click WorkloadsDeployments.

    2. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate deployment.

    3. Search for the rhods-dashboard deployment.

    4. Click the action menu (⋮) and select Restart Rollout from the list.

    5. Wait until the Status column indicates that all pods in the rollout have fully restarted.

Verification
  • The reset migration-gpu-status instance is present on the Instances tab on the AcceleratorProfile custom resource definition (CRD) details page.

  • From the Administrator perspective, go to the OperatorsInstalled Operators page. Confirm that the following Operators appear:

    • NVIDIA GPU

    • Node Feature Discovery (NFD)

    • Kernel Module Management (KMM)

  • The GPU is correctly detected a few minutes after full installation of the Node Feature Discovery (NFD) and NVIDIA GPU Operators. The {openshift-platform} command line interface (CLI) displays the appropriate output for the GPU worker node. For example:

    # Expected output when the GPU is detected properly
    oc describe node <node name>
    ...
    Capacity:
      cpu:                4
      ephemeral-storage:  313981932Ki
      hugepages-1Gi:      0
      hugepages-2Mi:      0
      memory:             16076568Ki
      nvidia.com/gpu:     1
      pods:               250
    Allocatable:
      cpu:                3920m
      ephemeral-storage:  288292006229
      hugepages-1Gi:      0
      hugepages-2Mi:      0
      memory:             12828440Ki
      nvidia.com/gpu:     1
      pods:               250

After installing the NVIDIA GPU Operator, create an accelerator profile as described in Working with accelerator profiles.