Skip to content

OCI runtime error: crun: error executing hook using podman –userns keep-id #46

Closed
@osiler

Description

Versions:

	podman -v
            podman version 4.3.1

	buildah -v
	buildah version 1.28.0 (image-spec 1.0.2-dev, runtime-spec 1.0.2-dev)

	nvidia-container-toolkit -version
	NVIDIA Container Runtime Hook version 1.12.0-rc.3
	commit: 14e587d55f2a4dc2e047a88e9acc2be72cb45af8

I am attempting to get containers to run with access to the GPU with rootless podman and the --userns keep-id flag. My current steps include:

Generating the cdi spec via:

	nvidia-ctk cdi generate > nvidia.yaml && sudo mkdir /etc/cdi && sudo mv nvidia.yaml /etc/cdi

Attempt 1: Fails

podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"

Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/dri/controlD69": no such file or directory

I then removed references to the following in the devices section of the /etc/cdi/nvidia.yaml:

    - path: /dev/dri/card5 
    - path: /dev/dri/controlD69
    - path: /dev/dri/renderD129

and removed the create symlink hooks in the devices section.

    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card5::/dev/dri/by-path/pci-0000:58:00.0-card
      - --link
      - ../renderD129::/dev/dri/by-path/pci-0000:58:00.0-render
      hookName: createContainer
      path: nvidia-ctk

Finally I also removed the nvidia-ctk hook that changes the ownership of the /dev/dri path.

  - args:
    - nvidia-ctk
    - hook
    - chmod
    - --mode
    - "755"
    - --path
    - /dev/dri
    hookName: createContainer
    path: /usr/bin/nvidia-ctk

Attempt 2: Pass (missing selinux modules)

podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

I am not concerned about this error, I believe I need to just amend some policy modules as specified here.

However if I attempt to run the above with the –userns keep-id flag.

Attempt 3: Fail

podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)

I have also tried the different combinations for the flags of load-kmods and no-cgroups in /etc/nvidia-container-runtime/config.toml.

A lot of this trouble shooting has been directed from the following links.

I am unsure on the of the lifecycle of the permissions when running these hooks however it looks like the first issue where the mapped permissions may not add up is here.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions