Skip to content

Commit d628baa

Browse files
committed
forde-commit-1770160653
1 parent f3a2ad0 commit d628baa

File tree

2 files changed

+35
-39
lines changed

2 files changed

+35
-39
lines changed
Lines changed: 35 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,52 @@
11
# NVIDIA DRA Driver for GPUs
22

3-
This pack installs the NVIDIA Dynamic Resource Allocation (DRA) Driver for GPUs, enabling flexible GPU allocation in Kubernetes 1.32+.
3+
The [NVIDIA DRA Driver](https://github.com/NVIDIA/k8s-dra-driver-gpu) enables Dynamic Resource Allocation (DRA) for GPUs in Kubernetes 1.32+. This pack works with Palette to provide flexible GPU allocation using DeviceClass and ResourceClaim resources, replacing the traditional device plugin approach with a modern, CEL-based device selection mechanism.
44

5-
## Overview
6-
7-
DRA is a Kubernetes feature that provides flexible request and sharing of hardware resources like GPUs. The NVIDIA DRA Driver replaces the traditional NVIDIA device plugin approach with a more modern, CEL-based device selection mechanism.
85

96
## Prerequisites
107

11-
- Kubernetes 1.32 or newer (DRA is GA in 1.34+)
12-
- NVIDIA GPU Operator 25.3.0+ (for driver management and CDI support)
13-
- CDI enabled in the container runtime (containerd/CRI-O)
14-
- Node Feature Discovery (NFD) for GPU detection
8+
- Kubernetes 1.32 or newer (DRA is GA in 1.34+).
9+
- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) 25.3.0+ for driver management and CDI support.
10+
- CDI enabled in the container runtime (containerd/CRI-O).
11+
- [Node Feature Discovery](https://kubernetes-sigs.github.io/node-feature-discovery/) (NFD) for GPU detection.
1512

16-
## Key Features
1713

18-
- **Dynamic GPU Allocation**: Request GPUs using DeviceClass and ResourceClaim resources
19-
- **CEL-based Selection**: Filter GPUs by attributes using Common Expression Language
20-
- **GPU Sharing**: Multiple pods can share access to the same GPU
21-
- **ComputeDomains**: Support for Multi-Node NVLink (MNNVL) on GB200 systems
14+
## Parameters
2215

23-
## Configuration
16+
To deploy the NVIDIA DRA Driver, you can configure the following parameters in the pack's YAML.
2417

25-
### Driver Root Path
18+
| **Name** | **Description** | **Type** | **Default Value** | **Required** |
19+
|---|---|---|---|---|
20+
| `nvidiaDriverRoot` | Path to NVIDIA driver installation. Use `/run/nvidia/driver` with GPU Operator, `/` for host-installed drivers. | String | `/run/nvidia/driver` | No |
21+
| `resources.gpus.enabled` | Enable GPU allocation via DRA. | Boolean | `true` | No |
22+
| `resources.computeDomains.enabled` | Enable ComputeDomains for Multi-Node NVLink (MNNVL) on GB200 systems. | Boolean | `false` | No |
23+
| `image.tag` | DRA driver image tag. | String | `v25.8.1` | No |
24+
| `logVerbosity` | Log verbosity level (0-7, higher = more verbose). | String | `4` | No |
25+
| `webhook.enabled` | Enable admission webhook for advanced validation. | Boolean | `false` | No |
2626

27-
When using with GPU Operator (recommended):
28-
```yaml
29-
nvidiaDriverRoot: /run/nvidia/driver
30-
```
27+
Refer to the [NVIDIA DRA Driver Helm chart](https://github.com/NVIDIA/k8s-dra-driver-gpu) for the complete list of configurable parameters.
3128

32-
When drivers are installed directly on host:
33-
```yaml
34-
nvidiaDriverRoot: /
35-
```
3629

37-
### Enable/Disable Resources
30+
## Upgrade
31+
32+
N/A - This is the initial release of the NVIDIA DRA Driver pack.
3833

39-
```yaml
40-
resources:
41-
gpus:
42-
enabled: true # Enable GPU allocation
43-
computeDomains:
44-
enabled: false # Enable for MNNVL systems
45-
```
4634

4735
## Usage
4836

37+
To use the NVIDIA DRA Driver pack, first create a new [add-on cluster profile](https://docs.spectrocloud.com/profiles/cluster-profiles/create-cluster-profiles/create-addon-profile/), search for the **NVIDIA DRA Driver for GPUs** pack, and configure the driver root path based on your environment:
38+
39+
```yaml
40+
charts:
41+
nvidia-dra-driver-gpu:
42+
nvidiaDriverRoot: /run/nvidia/driver # Use "/" if drivers installed on host
43+
```
44+
4945
After installation, the DRA driver creates:
5046
- A default `DeviceClass` named `gpu.nvidia.com`
5147
- `ResourceSlice` objects representing available GPUs on each node
5248

53-
### Example ResourceClaimTemplate
49+
To request a GPU for your workload, create a ResourceClaimTemplate and reference it in your Pod. Click on the **Add Manifest** button to create a new manifest layer with the following content:
5450

5551
```yaml
5652
apiVersion: resource.k8s.io/v1
@@ -63,11 +59,7 @@ spec:
6359
requests:
6460
- name: gpu
6561
deviceClassName: gpu.nvidia.com
66-
```
67-
68-
### Example Pod Using DRA
69-
70-
```yaml
62+
---
7163
apiVersion: v1
7264
kind: Pod
7365
metadata:
@@ -84,8 +76,12 @@ spec:
8476
resourceClaimTemplateName: gpu-claim
8577
```
8678

87-
## Documentation
79+
Once you have configured the NVIDIA DRA Driver pack, you can add it to an existing cluster profile, as an add-on profile, or as a new add-on layer to a deployed cluster.
80+
81+
82+
## References
8883

8984
- [NVIDIA DRA Driver Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-intro-install.html)
9085
- [Kubernetes DRA Documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
91-
- [GitHub Repository](https://github.com/NVIDIA/k8s-dra-driver-gpu)
86+
- [NVIDIA DRA Driver on GitHub](https://github.com/NVIDIA/k8s-dra-driver-gpu)
87+
- [NVIDIA GPU Operator Documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html)
-4.77 KB
Loading

0 commit comments

Comments
 (0)