NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #1133
Locked
cdesiniotis
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Update: Please refer to our dedicated troubleshooting section for more information on this error.
1. Executive summary
Under specific conditions, it’s possible that containers may be abruptly detached from the GPUs they were initially connected to. We have determined the root cause of this issue and identified the affected environments this can occur in. Workarounds for the affected environments are provided at the end of this document until a proper fix is released.
2. Summary of the issue
Containerized GPU workloads may suddenly lose access to their GPUs. This situation occurs when systemd is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs (e.g. with something as simple as a
systemctl daemon-reload).When the container loses access to the GPU, you will see the following error message from the console output:
Failed to initialize NVML: Unknown ErrorThe container needs to be deleted once the issue occurs.
When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.
The issue originates from the fact that recent versions of
runcrequire that symlinks be present under/dev/charto any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.3. Affected environments
Affected environments are those
using runcandenabling systemd cgroup managementat the high-level container runtime.If the system is NOT using
systemdto managecgroups, then it is NOT subject to this issue.An exhaustive list of the affected environments is provided below:
Docker environment using
containerd/runc:Specific condition:
cgroupdriver enabled withsystemd(e.g. parameter"exec-opts": ["native.cgroupdriver=systemd"]set in/etc/docker/daemon.json).systemd cgroupmanagement is the default (i.e. on Ubuntu 22.04).Note: To check if Docker uses
systemd cgroupmanagement, run the following command (the output below indicates thatsystemd cgroupdriver is enabled) :K8s environment using
containerd/runc:
Note: To check if containerd usesSystemdCgroup = truein thecontainerdconfiguration file (usually located here:/etc/containerd/config.toml) as shown below:systemd cgroupmanagement, issue the following command:K8s environment (including OpenShift) using
cri-o/runc:cgroup_managerenabled withsystemdin thecri-oconfiguration file (usually located here:/etc/crio/crio.confor/etc/crio/crio.conf.d/00-default) as shown below (sample with OpenShift):Note: Podman environments use
crunby default and are not subject to this issue unlessruncis configured as the low-level container runtime to be used.4. How to check if you are affected
You can use the following steps to confirm that your system is affected. After you implement one of the workarounds (mentioned in the next section), you can repeat the steps to confirm that the error is no longer reproducible.
For Docker environments
Run a test container:
Note: Make sure to mount the different devices as shown above. They are needed to narrow the problem down to this specific issue.
If your system has more than 1 GPU, append the above command with the additional
--devicemount. Example with a system that has 2 GPUs:Check the logs from the container:
Then initiate a
daemon-reload:Check the logs from the container:
For K8s environments
Run a test pod:
Check the logs from the pod:
Then initiate a
daemon-reload:Check the logs from the pod:
5. Workarounds
The following workarounds are available for both standalone docker environments and k8s environments (multiple options are presented by order of preference; the one at the top is the most recommended):
For Docker environments
Using the
nvidia-ctkutility:The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in
/dev/charfor all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.
A simple
udevrule to enforce this can be seen below:A good place to install this rule would be:
/lib/udev/rules.d/71-nvidia-dev-char.rulesIn cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:
Where
{{NVIDIA_DRIVER_ROOT}}is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.Explicitly disabling systemd cgroup management in Docker
"exec-opts": ["native.cgroupdriver=cgroupfs"]in the/etc/docker/daemon.jsonfile and restart docker.Downgrading to
docker.iopackages wheresystemdis not the defaultcgroupmanager (and not overriding that of course).For K8s environments
Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).
For deployments using the standalone
k8s-device-plugin(i.e. not through the use of the operator), installation of audevrule as described in the previous section can be made to work around this issue. Be sure to pass the correct{{NVIDIA_DRIVER_ROOT}}in cases where the driver container is also in use.Explicitly disabling
systemd cgroupmanagement incontainerdorcri-o:cgroup_manager = "systemd"fromcri-oconfiguration file (usually located here:/etc/crio/crio.confor/etc/crio/crio.conf.d/00-default) and restartcri-o.Downgrading to a version of the
containerd.iopackage wheresystemdis not the defaultcgroupmanager (and not overriding that, of course).Beta Was this translation helpful? Give feedback.
All reactions