Description
Versions:
podman -v
podman version 4.3.1
buildah -v
buildah version 1.28.0 (image-spec 1.0.2-dev, runtime-spec 1.0.2-dev)
nvidia-container-toolkit -version
NVIDIA Container Runtime Hook version 1.12.0-rc.3
commit: 14e587d55f2a4dc2e047a88e9acc2be72cb45af8
I am attempting to get containers to run with access to the GPU with rootless podman and the --userns keep-id flag. My current steps include:
Generating the cdi spec via:
nvidia-ctk cdi generate > nvidia.yaml && sudo mkdir /etc/cdi && sudo mv nvidia.yaml /etc/cdi
Attempt 1: Fails
podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/dri/controlD69": no such file or directory
I then removed references to the following in the devices section of the /etc/cdi/nvidia.yaml:
- path: /dev/dri/card5
- path: /dev/dri/controlD69
- path: /dev/dri/renderD129
and removed the create symlink hooks in the devices section.
hooks:
- args:
- nvidia-ctk
- hook
- create-symlinks
- --link
- ../card5::/dev/dri/by-path/pci-0000:58:00.0-card
- --link
- ../renderD129::/dev/dri/by-path/pci-0000:58:00.0-render
hookName: createContainer
path: nvidia-ctk
Finally I also removed the nvidia-ctk hook that changes the ownership of the /dev/dri path.
- args:
- nvidia-ctk
- hook
- chmod
- --mode
- "755"
- --path
- /dev/dri
hookName: createContainer
path: /usr/bin/nvidia-ctk
Attempt 2: Pass (missing selinux modules)
podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
return get_device_properties(device).name
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
I am not concerned about this error, I believe I need to just amend some policy modules as specified here.
However if I attempt to run the above with the –userns keep-id flag.
Attempt 3: Fail
podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)
I have also tried the different combinations for the flags of load-kmods and no-cgroups in /etc/nvidia-container-runtime/config.toml.
A lot of this trouble shooting has been directed from the following links.
- https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/issues/8
- Cannot use userns=keep-id together with nvidia-container-toolkit on rootless podman containers/podman#15863
- Running nvidia-container-runtime with podman is blowing up. nvidia-container-runtime#85
I am unsure on the of the lifecycle of the permissions when running these hooks however it looks like the first issue where the mapped permissions may not add up is here.
Activity