|
1 | 1 | # NVIDIA DRA Driver for GPUs |
2 | 2 |
|
3 | | -This pack installs the NVIDIA Dynamic Resource Allocation (DRA) Driver for GPUs, enabling flexible GPU allocation in Kubernetes 1.32+. |
| 3 | +The [NVIDIA DRA Driver](https://github.com/NVIDIA/k8s-dra-driver-gpu) enables Dynamic Resource Allocation (DRA) for GPUs in Kubernetes 1.32+. This pack works with Palette to provide flexible GPU allocation using DeviceClass and ResourceClaim resources, replacing the traditional device plugin approach with a modern, CEL-based device selection mechanism. |
4 | 4 |
|
5 | | -## Overview |
6 | | - |
7 | | -DRA is a Kubernetes feature that provides flexible request and sharing of hardware resources like GPUs. The NVIDIA DRA Driver replaces the traditional NVIDIA device plugin approach with a more modern, CEL-based device selection mechanism. |
8 | 5 |
|
9 | 6 | ## Prerequisites |
10 | 7 |
|
11 | | -- Kubernetes 1.32 or newer (DRA is GA in 1.34+) |
12 | | -- NVIDIA GPU Operator 25.3.0+ (for driver management and CDI support) |
13 | | -- CDI enabled in the container runtime (containerd/CRI-O) |
14 | | -- Node Feature Discovery (NFD) for GPU detection |
| 8 | +- Kubernetes 1.32 or newer (DRA is GA in 1.34+). |
| 9 | +- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) 25.3.0+ for driver management and CDI support. |
| 10 | +- CDI enabled in the container runtime (containerd/CRI-O). |
| 11 | +- [Node Feature Discovery](https://kubernetes-sigs.github.io/node-feature-discovery/) (NFD) for GPU detection. |
15 | 12 |
|
16 | | -## Key Features |
17 | 13 |
|
18 | | -- **Dynamic GPU Allocation**: Request GPUs using DeviceClass and ResourceClaim resources |
19 | | -- **CEL-based Selection**: Filter GPUs by attributes using Common Expression Language |
20 | | -- **GPU Sharing**: Multiple pods can share access to the same GPU |
21 | | -- **ComputeDomains**: Support for Multi-Node NVLink (MNNVL) on GB200 systems |
| 14 | +## Parameters |
22 | 15 |
|
23 | | -## Configuration |
| 16 | +To deploy the NVIDIA DRA Driver, you can configure the following parameters in the pack's YAML. |
24 | 17 |
|
25 | | -### Driver Root Path |
| 18 | +| **Name** | **Description** | **Type** | **Default Value** | **Required** | |
| 19 | +|---|---|---|---|---| |
| 20 | +| `nvidiaDriverRoot` | Path to NVIDIA driver installation. Use `/run/nvidia/driver` with GPU Operator, `/` for host-installed drivers. | String | `/run/nvidia/driver` | No | |
| 21 | +| `resources.gpus.enabled` | Enable GPU allocation via DRA. | Boolean | `true` | No | |
| 22 | +| `resources.computeDomains.enabled` | Enable ComputeDomains for Multi-Node NVLink (MNNVL) on GB200 systems. | Boolean | `false` | No | |
| 23 | +| `image.tag` | DRA driver image tag. | String | `v25.8.1` | No | |
| 24 | +| `logVerbosity` | Log verbosity level (0-7, higher = more verbose). | String | `4` | No | |
| 25 | +| `webhook.enabled` | Enable admission webhook for advanced validation. | Boolean | `false` | No | |
26 | 26 |
|
27 | | -When using with GPU Operator (recommended): |
28 | | -```yaml |
29 | | -nvidiaDriverRoot: /run/nvidia/driver |
30 | | -``` |
| 27 | +Refer to the [NVIDIA DRA Driver Helm chart](https://github.com/NVIDIA/k8s-dra-driver-gpu) for the complete list of configurable parameters. |
31 | 28 |
|
32 | | -When drivers are installed directly on host: |
33 | | -```yaml |
34 | | -nvidiaDriverRoot: / |
35 | | -``` |
36 | 29 |
|
37 | | -### Enable/Disable Resources |
| 30 | +## Upgrade |
| 31 | + |
| 32 | +N/A - This is the initial release of the NVIDIA DRA Driver pack. |
38 | 33 |
|
39 | | -```yaml |
40 | | -resources: |
41 | | - gpus: |
42 | | - enabled: true # Enable GPU allocation |
43 | | - computeDomains: |
44 | | - enabled: false # Enable for MNNVL systems |
45 | | -``` |
46 | 34 |
|
47 | 35 | ## Usage |
48 | 36 |
|
| 37 | +To use the NVIDIA DRA Driver pack, first create a new [add-on cluster profile](https://docs.spectrocloud.com/profiles/cluster-profiles/create-cluster-profiles/create-addon-profile/), search for the **NVIDIA DRA Driver for GPUs** pack, and configure the driver root path based on your environment: |
| 38 | + |
| 39 | +```yaml |
| 40 | +charts: |
| 41 | + nvidia-dra-driver-gpu: |
| 42 | + nvidiaDriverRoot: /run/nvidia/driver # Use "/" if drivers installed on host |
| 43 | +``` |
| 44 | +
|
49 | 45 | After installation, the DRA driver creates: |
50 | 46 | - A default `DeviceClass` named `gpu.nvidia.com` |
51 | 47 | - `ResourceSlice` objects representing available GPUs on each node |
52 | 48 |
|
53 | | -### Example ResourceClaimTemplate |
| 49 | +To request a GPU for your workload, create a ResourceClaimTemplate and reference it in your Pod. Click on the **Add Manifest** button to create a new manifest layer with the following content: |
54 | 50 |
|
55 | 51 | ```yaml |
56 | 52 | apiVersion: resource.k8s.io/v1 |
|
63 | 59 | requests: |
64 | 60 | - name: gpu |
65 | 61 | deviceClassName: gpu.nvidia.com |
66 | | -``` |
67 | | - |
68 | | -### Example Pod Using DRA |
69 | | - |
70 | | -```yaml |
| 62 | +--- |
71 | 63 | apiVersion: v1 |
72 | 64 | kind: Pod |
73 | 65 | metadata: |
|
84 | 76 | resourceClaimTemplateName: gpu-claim |
85 | 77 | ``` |
86 | 78 |
|
87 | | -## Documentation |
| 79 | +Once you have configured the NVIDIA DRA Driver pack, you can add it to an existing cluster profile, as an add-on profile, or as a new add-on layer to a deployed cluster. |
| 80 | + |
| 81 | + |
| 82 | +## References |
88 | 83 |
|
89 | 84 | - [NVIDIA DRA Driver Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-intro-install.html) |
90 | 85 | - [Kubernetes DRA Documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) |
91 | | -- [GitHub Repository](https://github.com/NVIDIA/k8s-dra-driver-gpu) |
| 86 | +- [NVIDIA DRA Driver on GitHub](https://github.com/NVIDIA/k8s-dra-driver-gpu) |
| 87 | +- [NVIDIA GPU Operator Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) |
0 commit comments