Skip to content

Segmentation fault in nvsandboxutils._Cfunc_nvSandboxUtilsShutdown() on Ubuntu 22.04 (v1.18.0) #1398

@timpalpant

Description

@timpalpant

After installing nvidia-container-toolkit=1.18.0-1 I encounter the following crash on Google Cloud n1-standard-4 instance with V100 GPU. Downgrading to 1.17.9-1 resolves it.

$ docker run -it --gpus=all alpine                   
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: nvidia-container-runtime did not terminate successfully: exit status 2: free(): invalid pointer
SIGABRT: abort
PC=0x723e54e969fc m=0 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1 gp=0xc000002380 m=0 mp=0xc6ad60 [syscall]:
runtime.cgocall(0x6d4990, 0xc00015ae60)
	/usr/local/go/src/runtime/cgocall.go:167 +0x4b fp=0xc00015ae38 sp=0xc00015ae00 pc=0x48378b
github.com/NVIDIA/nvidia-container-toolkit/internal/nvsandboxutils._Cfunc_nvSandboxUtilsShutdown()
	_cgo_gotypes.go:171 +0x45 fp=0xc00015ae60 sp=0xc00015ae38 pc=0x6762e5
github.com/NVIDIA/nvidia-container-toolkit/internal/nvsandboxutils.nvSandboxUtilsShutdown(...)
	/go/src/nvidia-container-toolkit/internal/nvsandboxutils/nvsandboxutils.go:42
github.com/NVIDIA/nvidia-container-toolkit/internal/nvsandboxutils.(*library).Shutdown(0xc000173b90)
	/go/src/nvidia-container-toolkit/internal/nvsandboxutils/impl.go:36 +0x19 fp=0xc00015ae78 sp=0xc00015ae60 pc=0x6768f9
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*nvmllib).tryShutdown(0xc0000d7520)
	/go/src/nvidia-container-toolkit/pkg/nvcdi/lib-nvml.go:229 +0x33 fp=0xc00015aec8 sp=0xc00015ae78 pc=0x6a77d3
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*nvmllib).DeviceSpecGenerators.deferwrap1()
	/go/src/nvidia-container-toolkit/pkg/nvcdi/lib-nvml.go:57 +0x25 fp=0xc00015aee0 sp=0xc00015aec8 pc=0x6a6685
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*nvmllib).DeviceSpecGenerators(0xc0000d7520, {0xc00016e9b0, 0x1, 0x1})
	/go/src/nvidia-container-toolkit/pkg/nvcdi/lib-nvml.go:63 +0x126 fp=0xc00015af60 sp=0xc00015aee0 pc=0x6a6586
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*wrapper).GetDeviceSpecsByID(0x3?, {0xc00016e9b0?, 0x0?, 0x5?})
	/go/src/nvidia-container-toolkit/pkg/nvcdi/wrapper.go:79 +0x22 fp=0xc00015afa8 sp=0xc00015af60 pc=0x6acb82
github.com/NVIDIA/nvidia-container-toolkit/pkg/nvcdi.(*wrapper).GetSpec(0xc0000dca50, {0xc00016e9b0?, 0x792b90?, 0xc00016e9c0?})
	/go/src/nvidia-container-toolkit/pkg/nvcdi/wrapper.go:56 +0x8e fp=0xc00015b1b0 sp=0xc00015afa8 pc=0x6ac70e
github.com/NVIDIA/nvidia-container-toolkit/internal/modifier.newAutomaticCDISpecModifier({0x7fb0b0, 0xc0000ae440}, 0xc000082e00, {0xc00016e960, 0x1, 0x1})
	/go/src/nvidia-container-toolkit/internal/modifier/cdi.go:200 +0x95b fp=0xc00015b670 sp=0xc00015b1b0 pc=0x6b40bb
github.com/NVIDIA/nvidia-container-toolkit/internal/modifier.NewCDIModifier({0x7fb0b0, 0xc0000ae440}, 0xc000082e00, {{0x7fb0b0, 0xc0000ae440}, 0x0, 0xc0001733e0, 0x0, {0xc00017a808, 0xa, ...}, ...}, ...)
	/go/src/nvidia-container-toolkit/internal/modifier/cdi.go:68 +0x6bd fp=0xc00015b970 sp=0xc00015b670 pc=0x6b2afd
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.newModeModifier({0x7fb0b0?, 0xc0000ae440?}, {0x79349c?, 0x76d0e0?}, 0x4?, {{0x7fb0b0, 0xc0000ae440}, 0x0, 0xc0001733e0, 0x0, ...})
	/go/src/nvidia-container-toolkit/internal/runtime/runtime_factory.go:112 +0x1fa fp=0xc00015ba38 sp=0xc00015b970 pc=0x6bc49a
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.newSpecModifier({0x7fb0b0, 0xc0000ae440}, 0xc000082e00, {0x7fa408?, 0xc0000a06a8?}, 0xc0000d8b80)
	/go/src/nvidia-container-toolkit/internal/runtime/runtime_factory.go:73 +0x105 fp=0xc00015bcb0 sp=0xc00015ba38 pc=0x6bb945
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.newNVIDIAContainerRuntime({0x7fb0b0, 0xc0000ae440}, 0xc000082e00, {0xc0000b0000, 0xf, 0xf}, 0xc0000d8b80)
	/go/src/nvidia-container-toolkit/internal/runtime/runtime_factory.go:50 +0x22d fp=0xc00015bd48 sp=0xc00015bcb0 pc=0x6bb70d
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.rt.Run({0xc0000ae440?, {0x0?, 0xc0000d8600?}}, {0xc0000b0000, 0xf, 0xf})
	/go/src/nvidia-container-toolkit/internal/runtime/runtime.go:82 +0x5bb fp=0xc00015bee0 sp=0xc00015bd48 pc=0x6babbb
github.com/NVIDIA/nvidia-container-toolkit/internal/runtime.(*rt).Run(0x0?, {0xc0000b0000?, 0xc51bd8?, 0xc000002380?})
	<autogenerated>:1 +0x47 fp=0xc00015bf20 sp=0xc00015bee0 pc=0x6bd967
main.main()
	/go/src/nvidia-container-toolkit/cmd/nvidia-container-runtime/main.go:11 +0x3b fp=0xc00015bf50 sp=0xc00015bf20 pc=0x6bda9b
runtime.main()
	/usr/local/go/src/runtime/proc.go:285 +0x29d fp=0xc00015bfe0 sp=0xc00015bf50 pc=0x452d3d
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1693 +0x1 fp=0xc00015bfe8 sp=0xc00015bfe0 pc=0x48ce61

GPU:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           On  |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P0             25W /  300W |       1MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Running Ubuntu 22.04:

$ apt-cache policy nvidia-container-toolkit
nvidia-container-toolkit:
  Installed: 1.18.0-1
  Candidate: 1.18.0-1
  Version table:
     1.18.0-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

It seems to be related to native.cgroupdriver=cgroupfs in our /etc/docker/daemon.json:

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "nvidia-container-runtime"
    }
  },
  "exec-opts": ["native.cgroupdriver=cgroupfs"]
}

which we previously added to work around #48

Is this expected in the new version? Not sure how to debug further, but happy to provide any additional info.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions