Skip to content

[BUG] CUDA agent segfault at exit caused by watcher thread #521

@Sy0307

Description

@Sy0307

Issue description:

When running GPU workloads (e.g. benchmark/gpu/workload/vec_add.cu) with the bpftime CUDA agent attached, the process sometimes segfaults on exit.
The crash happens in the CUDA watcher thread, which polls ctx->cuda_shared_mem->flag1 in bpf_attach_ctx::start_cuda_watcher_thread().

Root Cause:

The agent process opens the global shared memory with bpftime_initialize_global_shm(shm_open_type::SHM_OPEN_ONLY).
In runtime/src/bpftime_shm_internal.cpp, the global destructor __destruct_shm() unconditionally calls bpftime_destroy_global_shm().

bpftime_shm::~bpftime_shm() unmaps the Boost shared memory segment and calls cudaHostUnregister(base_addr) for the whole segment, which includes the cuda::CommSharedMem holding flag1/flag2.

The CUDA watcher thread is started in bpf_attach_ctx::start_cuda_watcher_thread() and is detached; because bpf_attach_ctx is stored in a union (bpf_attach_ctx_holder) whose destructor is empty, ~bpf_attach_ctx() never runs at process exit, so cuda_watcher_should_stop is never set to true.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions