OCI runtime error: crun: error executing hook using podman –userns keep-id

Versions:

```bash

	podman -v
            podman version 4.3.1

	buildah -v
	buildah version 1.28.0 (image-spec 1.0.2-dev, runtime-spec 1.0.2-dev)

	nvidia-container-toolkit -version
	NVIDIA Container Runtime Hook version 1.12.0-rc.3
	commit: 14e587d55f2a4dc2e047a88e9acc2be72cb45af8
```

I am attempting to get containers to run with access to the GPU with rootless podman and the --userns keep-id flag. My current steps include:

Generating the cdi spec via:

```bash
	nvidia-ctk cdi generate > nvidia.yaml && sudo mkdir /etc/cdi && sudo mv nvidia.yaml /etc/cdi
```

Attempt 1: Fails

```bash
podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"

Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/dri/controlD69": no such file or directory

```

I then removed references to the following in the devices section of the /etc/cdi/nvidia.yaml:

``` bash
    - path: /dev/dri/card5 
    - path: /dev/dri/controlD69
    - path: /dev/dri/renderD129
```












and removed the create symlink hooks in the devices section.

```bash
    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card5::/dev/dri/by-path/pci-0000:58:00.0-card
      - --link
      - ../renderD129::/dev/dri/by-path/pci-0000:58:00.0-render
      hookName: createContainer
      path: nvidia-ctk

```

Finally I also removed the nvidia-ctk hook that changes the ownership of the /dev/dri path.

```bash
  - args:
    - nvidia-ctk
    - hook
    - chmod
    - --mode
    - "755"
    - --path
    - /dev/dri
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
```

Attempt 2: Pass (missing selinux modules)

```bash
podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

```
I am not concerned about this error, I believe I need to just amend some policy modules as specified  [here](https://github.com/NVIDIA/dgx-selinux/tree/master/src/nvidia-container-selinux).

However if I attempt to run the above with the –userns keep-id flag.

Attempt 3: Fail

```bash
podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)
```

I have also tried the different combinations for the flags of load-kmods and no-cgroups in /etc/nvidia-container-runtime/config.toml. 

A lot of this trouble shooting has been directed from  the following links.

- https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/issues/8
- https://github.com/containers/podman/issues/15863
- https://github.com/NVIDIA/nvidia-container-runtime/issues/85

I am unsure on the of the lifecycle of the permissions when running these hooks however it looks like the first issue where the mapped permissions may not add up is [here](https://github.com/NVIDIA/nvidia-container-toolkit/blob/9fc2c5912242307e16dcba28c80b19aeb48c4703/cmd/nvidia-ctk/hook/update-ldcache/update-ldcache.go#L102).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCI runtime error: crun: error executing hook using podman –userns keep-id #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OCI runtime error: crun: error executing hook using podman –userns keep-id #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions