-
Notifications
You must be signed in to change notification settings - Fork 232
Open
Labels
Description
Description
When using the libfabric backend with GPU Direct RDMA (FI_EFA_USE_DEVICE_RDMA=1), the progress thread crashes with a CUDA context error because pthrCudaCtx_ is NULL when the thread starts.
Environment
- NIXL 0.8.0
- AWS P5.48xlarge (H100 GPUs + EFA)
- TRT-LLM disaggregated inference
FI_EFA_USE_DEVICE_RDMA=1
Symptoms
CUDA error: invalid device context
Segmentation fault during fi_read with GPU memory
Root Cause
In src/plugins/libfabric/libfabric_backend.cpp, the progress thread starts in the constructor BEFORE registerMem() is called. When the thread tries to access GPU memory via fi_read(), the CUDA context (pthrCudaCtx_) is NULL.
The UCX backend handles this correctly by restarting the thread when the context changes, but the libfabric backend does not.
Proposed Fix
Apply CUDA context INSIDE the progress loop on every iteration, not just at thread start:
void LibfabricBackend::progressThread() {
while (!progress_thread_stop_.load()) {
#ifdef HAVE_CUDA
if (cuda_addr_wa_) {
vramApplyCtx(); // Apply context on EVERY iteration
}
#endif
// ... rest of progress loop
}
}Workaround
None - requires source patch.
Impact
This blocks GPU Direct RDMA for all libfabric users with CUDA memory.
coderabbitai