Description
Versions:
podman -v
podman version 4.3.1
buildah -v
buildah version 1.28.0 (image-spec 1.0.2-dev, runtime-spec 1.0.2-dev)
nvidia-container-toolkit -version
NVIDIA Container Runtime Hook version 1.12.0-rc.3
commit: 14e587d55f2a4dc2e047a88e9acc2be72cb45af8
I am attempting to get containers to run with access to the GPU with rootless podman and the --userns keep-id flag. My current steps include:
Generating the cdi spec via:
nvidia-ctk cdi generate > nvidia.yaml && sudo mkdir /etc/cdi && sudo mv nvidia.yaml /etc/cdi
Attempt 1: Fails
podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/dri/controlD69": no such file or directory
I then removed references to the following in the devices section of the /etc/cdi/nvidia.yaml:
- path: /dev/dri/card5
- path: /dev/dri/controlD69
- path: /dev/dri/renderD129
and removed the create symlink hooks in the devices section.
hooks:
- args:
- nvidia-ctk
- hook
- create-symlinks
- --link
- ../card5::/dev/dri/by-path/pci-0000:58:00.0-card
- --link
- ../renderD129::/dev/dri/by-path/pci-0000:58:00.0-render
hookName: createContainer
path: nvidia-ctk
Finally I also removed the nvidia-ctk hook that changes the ownership of the /dev/dri path.
- args:
- nvidia-ctk
- hook
- chmod
- --mode
- "755"
- --path
- /dev/dri
hookName: createContainer
path: /usr/bin/nvidia-ctk
Attempt 2: Pass (missing selinux modules)
podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
return get_device_properties(device).name
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
I am not concerned about this error, I believe I need to just amend some policy modules as specified here.
However if I attempt to run the above with the –userns keep-id flag.
Attempt 3: Fail
podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)
I have also tried the different combinations for the flags of load-kmods and no-cgroups in /etc/nvidia-container-runtime/config.toml.
A lot of this trouble shooting has been directed from the following links.
- https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/issues/8
- Cannot use userns=keep-id together with nvidia-container-toolkit on rootless podman containers/podman#15863
- Running nvidia-container-runtime with podman is blowing up. nvidia-container-runtime#85
I am unsure on the of the lifecycle of the permissions when running these hooks however it looks like the first issue where the mapped permissions may not add up is here.