Skip to content

Commit eee64b5

Browse files
committed
Add information for nvidia gpu 25.10
Signed-off-by: manuelbuil <[email protected]>
1 parent 1e8316a commit eee64b5

File tree

1 file changed

+36
-4
lines changed

1 file changed

+36
-4
lines changed

docs/add-ons/gpu_operators.md

Lines changed: 36 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,10 @@ The following three commands should return a correct output if the kernel driver
4848
### Operator installation ###
4949
5050
Once the OS is ready and RKE2 is running, install the GPU Operator with the following yaml manifest:
51+
52+
<Tabs groupId="GPUoperator" queryString>
53+
<TabItem value="v25.3.x">
54+
5155
```yaml
5256
apiVersion: helm.cattle.io/v1
5357
kind: HelmChart
@@ -74,12 +78,40 @@ spec:
7478
- name: DEVICE_LIST_STRATEGY
7579
value: volume-mounts
7680
```
77-
:::warning
78-
The NVIDIA operator restarts containerd with a hangup call which restarts RKE2
81+
:::info
82+
The envvars `ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED`, `ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS` and `DEVICE_LIST_STRATEGY` are required to properly isolate GPU resources as explained in this nvidia [doc](https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit?tab=t.0)
7983
:::
8084

85+
</TabItem>
86+
<TabItem value="v25.10.x" default>
87+
88+
```yaml
89+
apiVersion: helm.cattle.io/v1
90+
kind: HelmChart
91+
metadata:
92+
name: gpu-operator
93+
namespace: kube-system
94+
spec:
95+
repo: https://helm.ngc.nvidia.com/nvidia
96+
chart: gpu-operator
97+
version: v25.10.0
98+
targetNamespace: gpu-operator
99+
createNamespace: true
100+
valuesContent: |-
101+
toolkit:
102+
env:
103+
- name: CONTAINERD_SOCKET
104+
value: /run/k3s/containerd/containerd.sock
105+
```
106+
81107
:::info
82-
The envvars `ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED`, `ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS` and `DEVICE_LIST_STRATEGY` are required to properly isolate GPU resources as explained in this nvidia [doc](https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit?tab=t.0)
108+
NVIDIA GPU Operator v25.10.x uses [Container Device Interface (CDI) specification](https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md) and that simplifies operations: we don't need to pass extra envvars to comply with the security requirements and the workloads don't need to pass the `runtimeClassName: nvidia` anymore
109+
:::
110+
</TabItem>
111+
</Tabs>
112+
113+
:::warning
114+
The NVIDIA operator restarts containerd with a hangup call which restarts RKE2
83115
:::
84116

85117
After one minute approximately, you can make the following checks to verify that everything worked as expected:
@@ -121,7 +153,7 @@ After one minute approximately, you can make the following checks to verify that
121153
namespace: default
122154
spec:
123155
restartPolicy: OnFailure
124-
runtimeClassName: nvidia
156+
# runtimeClassName: nvidia <== Only needed for v25.3.x
125157
containers:
126158
- name: cuda-container
127159
image: nvcr.io/nvidia/k8s/cuda-sample:nbody

0 commit comments

Comments
 (0)