-
Notifications
You must be signed in to change notification settings - Fork 429
Milestone
Description
After installing nvidia-container-toolkit=1.18.0-1 I encounter the following crash on Google Cloud n1-standard-4 instance with V100 GPU. Downgrading to 1.17.9-1 resolves it.
$ docker run -it --gpus=all alpine
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: nvidia-container-runtime did not terminate successfully: exit status 2: free(): invalid pointer
SIGABRT: abort
PC=0x723e54e969fc m=0 sigcode=18446744073709551610
signal arrived during cgo execution
goroutine 1 gp=0xc000002380 m=0 mp=0xc6ad60 [syscall]:
runtime.cgocall(0x6d4990, 0xc00015ae60)
/usr/local/go/src/runtime/cgocall.go:167 +0x4b fp=0xc00015ae38 sp=0xc00015ae00 pc=0x48378b
github.com/NVIDIA/nvidia-container-toolkit/internal/nvsandboxutils._Cfunc_nvSandboxUtilsShutdown()
_cgo_gotypes.go:171 +0x45 fp=0xc00015ae60 sp=0xc00015ae38 pc=0x6762e5
github.com/NVIDIA/nvidia-container-toolkit/internal/nvsandboxutils.nvSandboxUtilsShutdown(...)
/go/src/nvidia-container-toolkit/internal/nvsandboxutils/nvsandboxutils.go:42
github.com/NVIDIA/nvidia-container-toolkit/internal/nvsandboxutils.(*library).Shutdown(0xc000173b90)
/go/src/nvidia-container-toolkit/internal/nvsandboxutils/impl.go:36 +0x19 fp=0xc00015ae78 sp=0xc00015ae60 pc=0x6768f9
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*nvmllib).tryShutdown(0xc0000d7520)
/go/src/nvidia-container-toolkit/pkg/nvcdi/lib-nvml.go:229 +0x33 fp=0xc00015aec8 sp=0xc00015ae78 pc=0x6a77d3
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*nvmllib).DeviceSpecGenerators.deferwrap1()
/go/src/nvidia-container-toolkit/pkg/nvcdi/lib-nvml.go:57 +0x25 fp=0xc00015aee0 sp=0xc00015aec8 pc=0x6a6685
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*nvmllib).DeviceSpecGenerators(0xc0000d7520, {0xc00016e9b0, 0x1, 0x1})
/go/src/nvidia-container-toolkit/pkg/nvcdi/lib-nvml.go:63 +0x126 fp=0xc00015af60 sp=0xc00015aee0 pc=0x6a6586
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*wrapper).GetDeviceSpecsByID(0x3?, {0xc00016e9b0?, 0x0?, 0x5?})
/go/src/nvidia-container-toolkit/pkg/nvcdi/wrapper.go:79 +0x22 fp=0xc00015afa8 sp=0xc00015af60 pc=0x6acb82
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*wrapper).GetSpec(0xc0000dca50, {0xc00016e9b0?, 0x792b90?, 0xc00016e9c0?})
/go/src/nvidia-container-toolkit/pkg/nvcdi/wrapper.go:56 +0x8e fp=0xc00015b1b0 sp=0xc00015afa8 pc=0x6ac70e
github.com/NVIDIA/nvidia-container-toolkit/internal/modifier.newAutomaticCDISpecModifier({0x7fb0b0, 0xc0000ae440}, 0xc000082e00, {0xc00016e960, 0x1, 0x1})
/go/src/nvidia-container-toolkit/internal/modifier/cdi.go:200 +0x95b fp=0xc00015b670 sp=0xc00015b1b0 pc=0x6b40bb
github.com/NVIDIA/nvidia-container-toolkit/internal/modifier.NewCDIModifier({0x7fb0b0, 0xc0000ae440}, 0xc000082e00, {{0x7fb0b0, 0xc0000ae440}, 0x0, 0xc0001733e0, 0x0, {0xc00017a808, 0xa, ...}, ...}, ...)
/go/src/nvidia-container-toolkit/internal/modifier/cdi.go:68 +0x6bd fp=0xc00015b970 sp=0xc00015b670 pc=0x6b2afd
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.newModeModifier({0x7fb0b0?, 0xc0000ae440?}, {0x79349c?, 0x76d0e0?}, 0x4?, {{0x7fb0b0, 0xc0000ae440}, 0x0, 0xc0001733e0, 0x0, ...})
/go/src/nvidia-container-toolkit/internal/runtime/runtime_factory.go:112 +0x1fa fp=0xc00015ba38 sp=0xc00015b970 pc=0x6bc49a
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.newSpecModifier({0x7fb0b0, 0xc0000ae440}, 0xc000082e00, {0x7fa408?, 0xc0000a06a8?}, 0xc0000d8b80)
/go/src/nvidia-container-toolkit/internal/runtime/runtime_factory.go:73 +0x105 fp=0xc00015bcb0 sp=0xc00015ba38 pc=0x6bb945
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.newNVIDIAContainerRuntime({0x7fb0b0, 0xc0000ae440}, 0xc000082e00, {0xc0000b0000, 0xf, 0xf}, 0xc0000d8b80)
/go/src/nvidia-container-toolkit/internal/runtime/runtime_factory.go:50 +0x22d fp=0xc00015bd48 sp=0xc00015bcb0 pc=0x6bb70d
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.rt.Run({0xc0000ae440?, {0x0?, 0xc0000d8600?}}, {0xc0000b0000, 0xf, 0xf})
/go/src/nvidia-container-toolkit/internal/runtime/runtime.go:82 +0x5bb fp=0xc00015bee0 sp=0xc00015bd48 pc=0x6babbb
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.(*rt).Run(0x0?, {0xc0000b0000?, 0xc51bd8?, 0xc000002380?})
<autogenerated>:1 +0x47 fp=0xc00015bf20 sp=0xc00015bee0 pc=0x6bd967
main.main()
/go/src/nvidia-container-toolkit/cmd/nvidia-container-runtime/main.go:11 +0x3b fp=0xc00015bf50 sp=0xc00015bf20 pc=0x6bda9b
runtime.main()
/usr/local/go/src/runtime/proc.go:285 +0x29d fp=0xc00015bfe0 sp=0xc00015bf50 pc=0x452d3d
runtime.goexit({})
/usr/local/go/src/runtime/asm_amd64.s:1693 +0x1 fp=0xc00015bfe8 sp=0xc00015bfe0 pc=0x48ce61
GPU:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-16GB On | 00000000:00:04.0 Off | 0 |
| N/A 32C P0 25W / 300W | 1MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Running Ubuntu 22.04:
$ apt-cache policy nvidia-container-toolkit
nvidia-container-toolkit:
Installed: 1.18.0-1
Candidate: 1.18.0-1
Version table:
1.18.0-1 600
600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages
It seems to be related to native.cgroupdriver=cgroupfs in our /etc/docker/daemon.json:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
which we previously added to work around #48
Is this expected in the new version? Not sure how to debug further, but happy to provide any additional info.
Metadata
Metadata
Assignees
Labels
No labels