You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/add-ons/gpu_operators.md
+36-4Lines changed: 36 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,6 +48,10 @@ The following three commands should return a correct output if the kernel driver
48
48
### Operator installation ###
49
49
50
50
Once the OS is ready and RKE2 is running, install the GPU Operator with the following yaml manifest:
51
+
52
+
<Tabs groupId="GPUoperator" queryString>
53
+
<TabItem value="v25.3.x">
54
+
51
55
```yaml
52
56
apiVersion: helm.cattle.io/v1
53
57
kind: HelmChart
@@ -74,12 +78,40 @@ spec:
74
78
- name: DEVICE_LIST_STRATEGY
75
79
value: volume-mounts
76
80
```
77
-
:::warning
78
-
The NVIDIA operator restarts containerd with a hangup call which restarts RKE2
81
+
:::info
82
+
The envvars `ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED`, `ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS` and `DEVICE_LIST_STRATEGY` are required to properly isolate GPU resources as explained in this nvidia [doc](https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit?tab=t.0)
79
83
:::
80
84
85
+
</TabItem>
86
+
<TabItemvalue="v25.10.x"default>
87
+
88
+
```yaml
89
+
apiVersion: helm.cattle.io/v1
90
+
kind: HelmChart
91
+
metadata:
92
+
name: gpu-operator
93
+
namespace: kube-system
94
+
spec:
95
+
repo: https://helm.ngc.nvidia.com/nvidia
96
+
chart: gpu-operator
97
+
version: v25.10.0
98
+
targetNamespace: gpu-operator
99
+
createNamespace: true
100
+
valuesContent: |-
101
+
toolkit:
102
+
env:
103
+
- name: CONTAINERD_SOCKET
104
+
value: /run/k3s/containerd/containerd.sock
105
+
```
106
+
81
107
:::info
82
-
The envvars `ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED`, `ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS` and `DEVICE_LIST_STRATEGY` are required to properly isolate GPU resources as explained in this nvidia [doc](https://docs.google.com/document/d/1zy0key-EL6JH50MZgwg96RPYxxXXnVUdxLZwGiyqLd8/edit?tab=t.0)
108
+
NVIDIA GPU Operator v25.10.x uses [Container Device Interface (CDI) specification](https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md) and that simplifies operations: we don't need to pass extra envvars to comply with the security requirements and the workloads don't need to pass the `runtimeClassName: nvidia` anymore
109
+
:::
110
+
</TabItem>
111
+
</Tabs>
112
+
113
+
:::warning
114
+
The NVIDIA operator restarts containerd with a hangup call which restarts RKE2
83
115
:::
84
116
85
117
After one minute approximately, you can make the following checks to verify that everything worked as expected:
@@ -121,7 +153,7 @@ After one minute approximately, you can make the following checks to verify that
121
153
namespace: default
122
154
spec:
123
155
restartPolicy: OnFailure
124
-
runtimeClassName: nvidia
156
+
# runtimeClassName: nvidia <== Only needed for v25.3.x
0 commit comments