Skip to content

Cannot run hello-world from NixOS #510

@ocharles

Description

@ocharles

I currently have:

❯ sudo docker run --rm --gpus all hello-world
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: ldcache error: process /nix/store/c2i631h8i5vcs1sqifwxfsazhwrg6wr5-glibc-2.39-52-bin/sbin/ldconfig failed with error code: 1: unknown.

However

❯ sudo docker run --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
49b384cc7b4a: Already exists 
Digest: sha256:3f85b7caad41a95462cf5b787d8a04604c8262cdcdf9a472b8c52ef83375fe15
Status: Downloaded newer image for ubuntu:latest
Sat May 25 09:36:36 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off |   00000000:07:00.0  On |                  N/A |
|  0%   39C    P8              8W /  160W |     838MiB /   8188MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

I'm not sure what is special about hello-world that causes it to fail, but building my own custom image using Nix also fails, in the same way:

dotcharles on  master [!+?⇡] 
❯ cat image.nix 
{ pkgs ? import <nixpkgs> { }

, pkgsLinux ? import <nixpkgs> { system = "x86_64-linux"; }

}:


pkgs.dockerTools.buildImage {

  name = "hello-docker";

  config = {

    Cmd = [ "${pkgsLinux.linuxPackages.nvidia_x11.bin}/bin/nvidia-smi" ];

  };

}


dotcharles on  master [!+?⇡] 
❯ NIXPKGS_ALLOW_UNFREE=1 nix-build image.nix
/nix/store/9hv44m8sbg23z6l0y2zm7dhyxypd23iz-docker-image-hello-docker.tar.gz

dotcharles on  master [!+?⇡] took 2s 
❯ sudo docker load < result
Loaded image: hello-docker:9hv44m8sbg23z6l0y2zm7dhyxypd23iz

dotcharles on  master [!+?⇡] took 5s 
❯ sudo docker run --rm --gpus all hello-docker:9hv44m8sbg23z6l0y2zm7dhyxypd23iz
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: ldcache error: process /nix/store/c2i631h8i5vcs1sqifwxfsazhwrg6wr5-glibc-2.39-52-bin/sbin/ldconfig failed with error code: 1: unknown.

I'd really appreciate any pointers on debugging this!

My Docker config is:

dotcharles on  master [!+?⇡] 
❯ ps faux | grep docker
ollie     342826  0.0  0.1 659172 58904 pts/2    S+   10:24   0:00  |       \_ journalctl -u docker -f
ollie     352810  0.0  0.0   6680  2560 pts/1    S+   10:38   0:00          \_ /nix/store/28gpmx3z6ss3znd7fhmrzmvk3x5lnfbk-gnugrep-3.11/bin/grep --color=auto docker
root      233563  0.0  0.3 7668724 117392 ?      Ssl  May24   0:29 /nix/store/yr5bbzyw9g252mzpykv1zwpbc52qk5zq-moby-26.1.3/libexec/docker/dockerd --config-file=/nix/store/kccyg6i71h5nlwyarshcav0wx0phqyw1-daemon.json
root      233597  0.0  0.1 2540620 53520 ?       Ssl  May24   0:33  \_ containerd --config /var/run/docker/containerd/containerd.toml

dotcharles on  master [!+?⇡] 
❯ cat /nix/store/kccyg6i71h5nlwyarshcav0wx0phqyw1-daemon.json 
{
  "features": {
    "cdi": true
  },
  "group": "docker",
  "hosts": [
    "fd://"
  ],
  "live-restore": true,
  "log-driver": "journald",
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "/nix/store/fdd9z75hky47fzmrma45q2k6dkdqdi7i-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-runtime"
    }
  }
}

I note that I can get hello-world to work like this:

sudo docker run --device=nvidia.com/gpu=all  --rm -it hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

and this also works with my custom image:

sudo docker run --device=nvidia.com/gpu=all  --rm -it hello-docker:iv79vxkrcw4gxibz50js6cbxi6z4rzix

But ultimately I want to run images from Nomad, which seems to use the above form.

If anyone knows a way to make a "runtime" that just does that above, that would also probably unblock me!

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions