Skip to content

CUDA context crash in libfabric backend progress thread when using GPU Direct RDMA #1157

@dmvevents

Description

@dmvevents

Description

When using the libfabric backend with GPU Direct RDMA (FI_EFA_USE_DEVICE_RDMA=1), the progress thread crashes with a CUDA context error because pthrCudaCtx_ is NULL when the thread starts.

Environment

  • NIXL 0.8.0
  • AWS P5.48xlarge (H100 GPUs + EFA)
  • TRT-LLM disaggregated inference
  • FI_EFA_USE_DEVICE_RDMA=1

Symptoms

CUDA error: invalid device context
Segmentation fault during fi_read with GPU memory

Root Cause

In src/plugins/libfabric/libfabric_backend.cpp, the progress thread starts in the constructor BEFORE registerMem() is called. When the thread tries to access GPU memory via fi_read(), the CUDA context (pthrCudaCtx_) is NULL.

The UCX backend handles this correctly by restarting the thread when the context changes, but the libfabric backend does not.

Proposed Fix

Apply CUDA context INSIDE the progress loop on every iteration, not just at thread start:

void LibfabricBackend::progressThread() {
    while (!progress_thread_stop_.load()) {
#ifdef HAVE_CUDA
        if (cuda_addr_wa_) {
            vramApplyCtx();  // Apply context on EVERY iteration
        }
#endif
        // ... rest of progress loop
    }
}

Workaround

None - requires source patch.

Impact

This blocks GPU Direct RDMA for all libfabric users with CUDA memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions