-
Notifications
You must be signed in to change notification settings - Fork 429
Open
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
I currently have:
❯ sudo docker run --rm --gpus all hello-world
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: ldcache error: process /nix/store/c2i631h8i5vcs1sqifwxfsazhwrg6wr5-glibc-2.39-52-bin/sbin/ldconfig failed with error code: 1: unknown.
However
❯ sudo docker run --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
49b384cc7b4a: Already exists
Digest: sha256:3f85b7caad41a95462cf5b787d8a04604c8262cdcdf9a472b8c52ef83375fe15
Status: Downloaded newer image for ubuntu:latest
Sat May 25 09:36:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78 Driver Version: 550.78 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:07:00.0 On | N/A |
| 0% 39C P8 8W / 160W | 838MiB / 8188MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
I'm not sure what is special about hello-world that causes it to fail, but building my own custom image using Nix also fails, in the same way:
dotcharles on master [!+?⇡]
❯ cat image.nix
{ pkgs ? import <nixpkgs> { }
, pkgsLinux ? import <nixpkgs> { system = "x86_64-linux"; }
}:
pkgs.dockerTools.buildImage {
name = "hello-docker";
config = {
Cmd = [ "${pkgsLinux.linuxPackages.nvidia_x11.bin}/bin/nvidia-smi" ];
};
}
dotcharles on master [!+?⇡]
❯ NIXPKGS_ALLOW_UNFREE=1 nix-build image.nix
/nix/store/9hv44m8sbg23z6l0y2zm7dhyxypd23iz-docker-image-hello-docker.tar.gz
dotcharles on master [!+?⇡] took 2s
❯ sudo docker load < result
Loaded image: hello-docker:9hv44m8sbg23z6l0y2zm7dhyxypd23iz
dotcharles on master [!+?⇡] took 5s
❯ sudo docker run --rm --gpus all hello-docker:9hv44m8sbg23z6l0y2zm7dhyxypd23iz
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: ldcache error: process /nix/store/c2i631h8i5vcs1sqifwxfsazhwrg6wr5-glibc-2.39-52-bin/sbin/ldconfig failed with error code: 1: unknown.
I'd really appreciate any pointers on debugging this!
My Docker config is:
dotcharles on master [!+?⇡]
❯ ps faux | grep docker
ollie 342826 0.0 0.1 659172 58904 pts/2 S+ 10:24 0:00 | \_ journalctl -u docker -f
ollie 352810 0.0 0.0 6680 2560 pts/1 S+ 10:38 0:00 \_ /nix/store/28gpmx3z6ss3znd7fhmrzmvk3x5lnfbk-gnugrep-3.11/bin/grep --color=auto docker
root 233563 0.0 0.3 7668724 117392 ? Ssl May24 0:29 /nix/store/yr5bbzyw9g252mzpykv1zwpbc52qk5zq-moby-26.1.3/libexec/docker/dockerd --config-file=/nix/store/kccyg6i71h5nlwyarshcav0wx0phqyw1-daemon.json
root 233597 0.0 0.1 2540620 53520 ? Ssl May24 0:33 \_ containerd --config /var/run/docker/containerd/containerd.toml
dotcharles on master [!+?⇡]
❯ cat /nix/store/kccyg6i71h5nlwyarshcav0wx0phqyw1-daemon.json
{
"features": {
"cdi": true
},
"group": "docker",
"hosts": [
"fd://"
],
"live-restore": true,
"log-driver": "journald",
"runtimes": {
"nvidia": {
"args": [],
"path": "/nix/store/fdd9z75hky47fzmrma45q2k6dkdqdi7i-container-toolkit-container-toolkit-1.15.0-rc.3/bin/nvidia-container-runtime"
}
}
}
I note that I can get hello-world to work like this:
sudo docker run --device=nvidia.com/gpu=all --rm -it hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.
and this also works with my custom image:
sudo docker run --device=nvidia.com/gpu=all --rm -it hello-docker:iv79vxkrcw4gxibz50js6cbxi6z4rzix
But ultimately I want to run images from Nomad, which seems to use the above form.
If anyone knows a way to make a "runtime" that just does that above, that would also probably unblock me!
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.