Skip to content

Commit 9ca0486

Browse files
first release
1 parent f6e787a commit 9ca0486

File tree

1 file changed

+378
-0
lines changed

1 file changed

+378
-0
lines changed
Lines changed: 378 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,378 @@
1+
2+
# Troubleshooting get nvidia GPU access to ephemeral container with CDI enabled
3+
4+
5+
abcdesktop uses `ephemeral container` or `pod` as applications. nvidia adds support for [Container Device Interface](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html).
6+
7+
8+
9+
## installed packages
10+
11+
12+
nvidia-container-toolkit=1.18.1-1
13+
14+
```
15+
libnvidia-container-tools 1.18.1-1 amd64 NVIDIA container runtime library (command-line tools)
16+
nvidia-container-toolkit 1.18.1-1 amd64 NVIDIA Container toolkit
17+
nvidia-container-toolkit-base 1.18.1-1 amd64 NVIDIA Container Toolkit Base
18+
```
19+
20+
21+
nvidia/gpu-operator version=v25.10.1
22+
23+
```bash
24+
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v25.10.1 --set driver.enabled=false --set toolkit.enabled=false
25+
```
26+
27+
> Note: in this case nvidia driver and the nvidia toolkit is already installed
28+
29+
30+
cdi enabled
31+
32+
```bash
33+
kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/cdi/enabled", "value":true}]'
34+
```
35+
36+
Read the `clusterpolicies.nvidia.com/cluster-policy`
37+
38+
```bash
39+
kubectl get clusterpolicies.nvidia.com/cluster-policy -o json | jq ."spec.cdi"
40+
```
41+
42+
You should get
43+
44+
```json
45+
{
46+
"default": false,
47+
"enabled": true
48+
}
49+
```
50+
51+
52+
## Troubleshooting with a simple-pod with `nvidia.com/gpu`
53+
54+
55+
- Create a file simple-pod.yaml
56+
57+
```yaml
58+
apiVersion: v1
59+
kind: Pod
60+
metadata:
61+
name: simple-pod
62+
spec:
63+
containers:
64+
- name: cuda-container
65+
image: ubuntu
66+
command: ["sh", "-c", "sleep 3600"]
67+
resources:
68+
limits:
69+
nvidia.com/gpu: 1
70+
requests:
71+
nvidia.com/gpu: 1
72+
restartPolicy: Never
73+
```
74+
75+
- Create `simple-pod`
76+
77+
```bash
78+
kubectl create -f simple-pod.yaml
79+
pod/simple-pod created
80+
```
81+
82+
- Exec command in `simple-pod`
83+
84+
List devices
85+
86+
```bash
87+
kubectl exec -it simple-pod -- ls -la /dev/dri
88+
```
89+
90+
The devices `/dev/dri/card1` and `/dev/dri/renderD128` are listed
91+
92+
```
93+
total 0
94+
drwxr-xr-x 2 root root 80 Jan 12 17:16 .
95+
drwxr-xr-x 6 root root 480 Jan 12 17:16 ..
96+
crw-rw---- 1 root root 226, 1 Jan 12 17:16 card1
97+
crw-rw---- 1 root root 226, 128 Jan 12 17:16 renderD128
98+
```
99+
100+
Get GPU UUID from `nvidia-smi` command line
101+
102+
```bash
103+
kubectl exec -it simple-pod -- nvidia-smi -L
104+
```
105+
106+
It returns in this my own case
107+
108+
```
109+
GPU 0: Quadro M620 (UUID: GPU-b5aebea9-8a25-fb21-631b-7e5da5a60ccb)
110+
```
111+
112+
113+
### Create an ephemeral container inside `simple-pod`
114+
115+
Create a yaml file `custom-profile-nvidia-gpu.yaml` and replace `GPU-b5aebea9-8a25-fb21-631b-7e5da5a60ccb` by your own gpu uuid
116+
117+
```yaml
118+
env:
119+
- name: NVIDIA_VISIBLE_DEVICES
120+
value: GPU-b5aebea9-8a25-fb21-631b-7e5da5a60ccb
121+
- name: NVIDIA_DRIVER_CAPABILITIES
122+
value: all
123+
```
124+
125+
Run a debug ephemeral container in `simple-pod`
126+
127+
```bash
128+
kubectl debug -it simple-pod --image=ubuntu --target=cuda-container --profile=general --custom=custom-profile-nvidia-gpu.yaml -- nvidia-smi -L
129+
```
130+
131+
> This command failed, `/usr/bin/nvidia-smi` doesn't exist
132+
133+
134+
Run a debug ephemeral container in `simple-pod`
135+
136+
```bash
137+
kubectl debug -it simple-pod --image=ubuntu --target=cuda-container --profile=general --custom=custom-profile-nvidia-gpu.yaml -- ls -la /dev
138+
```
139+
140+
```
141+
Targeting container "cuda-container". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
142+
Defaulting debug container name to debugger-nzrlt.
143+
total 4
144+
drwxr-xr-x 5 root root 380 Jan 12 17:47 .
145+
drwxr-xr-x 1 root root 4096 Jan 12 17:47 ..
146+
crw--w---- 1 root tty 136, 0 Jan 12 17:47 console
147+
lrwxrwxrwx 1 root root 11 Jan 12 17:47 core -> /proc/kcore
148+
lrwxrwxrwx 1 root root 13 Jan 12 17:47 fd -> /proc/self/fd
149+
crw-rw-rw- 1 root root 1, 7 Jan 12 17:47 full
150+
drwxrwxrwt 2 root root 40 Jan 12 17:16 mqueue
151+
crw-rw-rw- 1 root root 1, 3 Jan 12 17:47 null
152+
lrwxrwxrwx 1 root root 8 Jan 12 17:47 ptmx -> pts/ptmx
153+
drwxr-xr-x 2 root root 0 Jan 12 17:47 pts
154+
crw-rw-rw- 1 root root 1, 8 Jan 12 17:47 random
155+
drwxrwxrwt 2 root root 40 Jan 12 17:16 shm
156+
lrwxrwxrwx 1 root root 15 Jan 12 17:47 stderr -> /proc/self/fd/2
157+
lrwxrwxrwx 1 root root 15 Jan 12 17:47 stdin -> /proc/self/fd/0
158+
lrwxrwxrwx 1 root root 15 Jan 12 17:47 stdout -> /proc/self/fd/1
159+
-rw-rw-rw- 1 root root 0 Jan 12 17:47 termination-log
160+
crw-rw-rw- 1 root root 5, 0 Jan 12 17:47 tty
161+
crw-rw-rw- 1 root root 1, 9 Jan 12 17:47 urandom
162+
crw-rw-rw- 1 root root 1, 5 Jan 12 17:47 zero
163+
```
164+
165+
> There is no `/dev/dri` and no `nvidia` devices for this ephemeral container
166+
> This is bad
167+
168+
169+
### Delete `simple-pod`
170+
171+
```bash
172+
kubectl delete -f simple-pod.yaml
173+
```
174+
175+
```
176+
pod "simple-pod" deleted from default namespace
177+
```
178+
179+
180+
181+
## Create a nvidia-pod with `nvidia.com/gpu` and `runtimeClassName`
182+
183+
- Create a file `nvidia-pod.yaml`
184+
185+
186+
```yaml
187+
apiVersion: v1
188+
kind: Pod
189+
metadata:
190+
name: nvidia-pod
191+
spec:
192+
runtimeClassName: nvidia
193+
containers:
194+
- name: cuda-container
195+
image: ubuntu
196+
command: ["sh", "-c", "sleep 3600"]
197+
resources:
198+
limits:
199+
nvidia.com/gpu: 1
200+
requests:
201+
nvidia.com/gpu: 1
202+
restartPolicy: Never
203+
```
204+
205+
> `runtimeClassName` is set to `nvidia`
206+
> to read all classname `kubectl get runtimeclasses` returns by default `nvidia`, `nvidia-cdi`, `nvidia-legacy` (if CDI support is enabled)
207+
208+
209+
- Create `nvidia-pod`
210+
211+
```
212+
kubectl create -f nvidia-pod.yaml
213+
pod/nvidia-pod created
214+
```
215+
216+
- Exec command in `nvidia-pod`
217+
218+
```bash
219+
kubectl exec -it nvidia-pod -- ls -la /dev/dri
220+
```
221+
222+
The devices `/dev/dri/card1` and `/dev/dri/renderD128` are listed
223+
224+
```
225+
total 0
226+
drwxr-xr-x 2 root root 80 Jan 12 17:57 .
227+
drwxr-xr-x 6 root root 480 Jan 12 17:57 ..
228+
crw-rw---- 1 root root 226, 1 Jan 12 17:57 card1
229+
crw-rw---- 1 root root 226, 128 Jan 12 17:57 renderD128
230+
```
231+
232+
Get GPU UUID from `nvidia-smi` command line
233+
234+
```
235+
kubectl exec -it nvidia-pod -- nvidia-smi -L
236+
```
237+
238+
It returns in this my own case
239+
240+
```
241+
GPU 0: Quadro M620 (UUID: GPU-b5aebea9-8a25-fb21-631b-7e5da5a60ccb)
242+
```
243+
244+
245+
### Create an ephemeral container inside `simple-pod`
246+
247+
Create a yaml file `custom-profile-nvidia-gpu.yaml` and replace `GPU-b5aebea9-8a25-fb21-631b-7e5da5a60ccb` by your own gpu uuid
248+
249+
```
250+
env:
251+
- name: NVIDIA_VISIBLE_DEVICES
252+
value: GPU-b5aebea9-8a25-fb21-631b-7e5da5a60ccb
253+
- name: NVIDIA_DRIVER_CAPABILITIES
254+
value: all
255+
```
256+
257+
Run a debug ephemeral container in `nvidia-pod`
258+
259+
```
260+
kubectl debug -it nvidia-pod --image=ubuntu --target=cuda-container --profile=general --custom=custom-profile-nvidia-gpu.yaml -- nvidia-smi -L
261+
```
262+
263+
This command `nvidia-smi -L` works, and returns the GPU UUID
264+
265+
```
266+
Targeting container "cuda-container". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
267+
Defaulting debug container name to debugger-hm826.
268+
GPU 0: Quadro M620 (UUID: GPU-b5aebea9-8a25-fb21-631b-7e5da5a60ccb)
269+
```
270+
271+
272+
Run a debug ephemeral container in `nvidia-pod`
273+
274+
```
275+
kubectl debug -it nvidia-pod --image=ubuntu --target=cuda-container --profile=general --custom=custom-profile-nvidia-gpu.yaml -- ls -la /dev
276+
```
277+
278+
```
279+
Targeting container "cuda-container". If you don't see processes from this container it may be because the container runtime doesn't support this feature.
280+
Defaulting debug container name to debugger-jv2k5.
281+
total 4
282+
drwxr-xr-x 6 root root 500 Jan 12 18:03 .
283+
drwxr-xr-x 1 root root 4096 Jan 12 18:03 ..
284+
crw--w---- 1 root tty 136, 0 Jan 12 18:03 console
285+
lrwxrwxrwx 1 root root 11 Jan 12 18:03 core -> /proc/kcore
286+
drwxr-xr-x 3 root root 100 Jan 12 18:03 dri
287+
lrwxrwxrwx 1 root root 13 Jan 12 18:03 fd -> /proc/self/fd
288+
crw-rw-rw- 1 root root 1, 7 Jan 12 18:03 full
289+
drwxrwxrwt 2 root root 40 Jan 12 17:57 mqueue
290+
crw-rw-rw- 1 root root 1, 3 Jan 12 18:03 null
291+
crw-rw-rw- 1 root root 195, 254 Jan 12 18:03 nvidia-modeset
292+
crw-rw-rw- 1 root root 235, 0 Jan 12 18:03 nvidia-uvm
293+
crw-rw-rw- 1 root root 235, 1 Jan 12 18:03 nvidia-uvm-tools
294+
crw-rw-rw- 1 root root 195, 0 Jan 12 18:03 nvidia0
295+
crw-rw-rw- 1 root root 195, 255 Jan 12 18:03 nvidiactl
296+
lrwxrwxrwx 1 root root 8 Jan 12 18:03 ptmx -> pts/ptmx
297+
drwxr-xr-x 2 root root 0 Jan 12 18:03 pts
298+
crw-rw-rw- 1 root root 1, 8 Jan 12 18:03 random
299+
drwxrwxrwt 2 root root 40 Jan 12 17:57 shm
300+
lrwxrwxrwx 1 root root 15 Jan 12 18:03 stderr -> /proc/self/fd/2
301+
lrwxrwxrwx 1 root root 15 Jan 12 18:03 stdin -> /proc/self/fd/0
302+
lrwxrwxrwx 1 root root 15 Jan 12 18:03 stdout -> /proc/self/fd/1
303+
-rw-rw-rw- 1 root root 0 Jan 12 18:03 termination-log
304+
crw-rw-rw- 1 root root 5, 0 Jan 12 18:03 tty
305+
crw-rw-rw- 1 root root 1, 9 Jan 12 18:03 urandom
306+
crw-rw-rw- 1 root root 1, 5 Jan 12 18:03 zero
307+
```
308+
309+
> The devices `/dev/dri` and `nvidia*` devices are listed for this ephemeral container
310+
> ephemeral container will work for abcdesktop
311+
312+
313+
### Delete `nvidia-pod`
314+
315+
```
316+
kubectl delete -f nvidia-pod.yaml
317+
pod "nvidia-pod" deleted from default namespace
318+
```
319+
320+
321+
## Conclusion
322+
323+
Setting `runtimeClassName: nvidia` on pod manifest allows ephemeral containers to share the pod's GPU.
324+
325+
326+
## Apply `runtimeClassName` to abcdesktop config (release >= 4.3 )
327+
328+
329+
Get the `od.config` file
330+
331+
If you don't already have the config file `od.config`, run the command line
332+
333+
```
334+
kubectl -n abcdesktop get configmap abcdesktop-config -o jsonpath='{.data.od\.config}' > od.config
335+
```
336+
337+
- Edit `od.config` and update the dictionary `desktop.pod` to add `'runtimeClassName':'nvidia'` in `spec` and save your od.config file.
338+
339+
```
340+
desktop.pod : {
341+
# default spec for all containers
342+
# can be overwritten on dedicated container spec
343+
# value inside mustrache like {{ uidNumber }} is replaced by context run value
344+
# for example {{ uidNumber }} is the uid number define in ldap server
345+
'spec' : {
346+
'shareProcessNamespace': False,
347+
'securityContext': {
348+
'supplementalGroups': [ '{{ supplementalGroups }}' ],
349+
'runAsUser': '{{ uidNumber }}',
350+
'runAsGroup': '{{ gidNumber }}'
351+
},
352+
'tolerations': [],
353+
'runtimeClassName': 'nvidia'
354+
},
355+
...
356+
```
357+
358+
- Update the configmap `abcdesktop-config`
359+
360+
```
361+
kubectl create -n abcdesktop configmap abcdesktop-config --from-file=od.config -o yaml --dry-run=client | kubectl replace -n abcdesktop -f -
362+
```
363+
364+
- Restart deployment `pyos-od`
365+
366+
```
367+
kubectl rollout restart deployment pyos-od -n abcdesktop
368+
```
369+
370+
- Create a new desktop pod to check the `runtimeClassName`
371+
372+
## Links
373+
374+
- nvidia gpu-operator/23.6.2
375+
376+
[https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.2/cdi.html#support-for-multi-instance-gpu](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.2/cdi.html#support-for-multi-instance-gpu)
377+
378+

0 commit comments

Comments
 (0)