Skip to content

Commit 61f9bde

Browse files
committed
Redirect log message to stderr in nvidia runtime wrapper script
This change is required to make our nvidia runtime wrapper compliant with the OCI runtime spec. All OCI-compliant runtimes must support the operations documented at https://github.com/opencontainers/runtime-spec/blob/v1.2.1/runtime.md#operations. Before this change, our nvidia runtime wrapper was not producing the expected output when the query state operation (`state <container-id>`) was invoked AND the nvidia kernel modules happened to not be loaded. In this case, we were emitting an extra log message which caused the stdout of this command to not adhere to the schema defined in the OCI runtime spec. Redirecting the log message to stderr makes us compliant. This issue was discovered when deploying GPU Operator 25.10.0 on nodes using cri-o. GPU Operator 25.10.0 is the first release that installs nvidia runtime handlers with cri-o by default, as opposed to installing an OCI hook file. When performing a GPU driver upgrade, pods in the gpu-operator namespace would be in the `Init:RunContainerError` state for several minutes until the new driver finished installing -- note that no nvidia driver modules are loaded during this span of several minutes. When inspecting the cri-o logs, we observed the following error message: ``` level=warning msg="Error updating the container status \"16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028\": failed to decode container status for 16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028: skipThreeBytes: expect ull, error found in #2 byte of ...|nvidia drive|..., bigger context ...|nvidia driver modules are not yet loaded, invoking /|..." id=a4b48041-edc4-48c2-8d75-4ad03cb3d8e1 name=/runtime.v1.RuntimeService/CreateContainer ``` This error message indicates cri-o failed to get the status of the container because it could not decode the JSON returned by the runtime handler. Signed-off-by: Christopher Desiniotis <[email protected]>
1 parent e6eaa43 commit 61f9bde

File tree

3 files changed

+5
-5
lines changed

3 files changed

+5
-5
lines changed

cmd/nvidia-ctk-installer/toolkit/installer/executables.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ func (w *render) render() (io.Reader, error) {
162162
{{- if (.CheckModules) }}
163163
cat /proc/modules | grep -e "^nvidia " >/dev/null 2>&1
164164
if [ "${?}" != "0" ]; then
165-
echo "nvidia driver modules are not yet loaded, invoking {{ .DefaultRuntimeExecutablePath }} directly"
165+
echo "nvidia driver modules are not yet loaded, invoking {{ .DefaultRuntimeExecutablePath }} directly" >&2
166166
exec {{ .DefaultRuntimeExecutablePath }} "$@"
167167
fi
168168
{{- end }}

cmd/nvidia-ctk-installer/toolkit/installer/executables_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ func TestWrapperRender(t *testing.T) {
5151
expected: `#! /bin/sh
5252
cat /proc/modules | grep -e "^nvidia " >/dev/null 2>&1
5353
if [ "${?}" != "0" ]; then
54-
echo "nvidia driver modules are not yet loaded, invoking runc directly"
54+
echo "nvidia driver modules are not yet loaded, invoking runc directly" >&2
5555
exec runc "$@"
5656
fi
5757
/dest-dir/some-runtime \

cmd/nvidia-ctk-installer/toolkit/installer/installer_test.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ func TestToolkitInstaller(t *testing.T) {
170170
wrapper: `#! /bin/sh
171171
cat /proc/modules | grep -e "^nvidia " >/dev/null 2>&1
172172
if [ "${?}" != "0" ]; then
173-
echo "nvidia driver modules are not yet loaded, invoking runc directly"
173+
echo "nvidia driver modules are not yet loaded, invoking runc directly" >&2
174174
exec runc "$@"
175175
fi
176176
NVIDIA_CTK_CONFIG_FILE_PATH=/foo/bar/baz/.config/nvidia-container-runtime/config.toml \
@@ -185,7 +185,7 @@ PATH=/foo/bar/baz:$PATH \
185185
wrapper: `#! /bin/sh
186186
cat /proc/modules | grep -e "^nvidia " >/dev/null 2>&1
187187
if [ "${?}" != "0" ]; then
188-
echo "nvidia driver modules are not yet loaded, invoking runc directly"
188+
echo "nvidia driver modules are not yet loaded, invoking runc directly" >&2
189189
exec runc "$@"
190190
fi
191191
NVIDIA_CTK_CONFIG_FILE_PATH=/foo/bar/baz/.config/nvidia-container-runtime/config.toml \
@@ -200,7 +200,7 @@ PATH=/foo/bar/baz:$PATH \
200200
wrapper: `#! /bin/sh
201201
cat /proc/modules | grep -e "^nvidia " >/dev/null 2>&1
202202
if [ "${?}" != "0" ]; then
203-
echo "nvidia driver modules are not yet loaded, invoking runc directly"
203+
echo "nvidia driver modules are not yet loaded, invoking runc directly" >&2
204204
exec runc "$@"
205205
fi
206206
NVIDIA_CTK_CONFIG_FILE_PATH=/foo/bar/baz/.config/nvidia-container-runtime/config.toml \

0 commit comments

Comments
 (0)