-
Notifications
You must be signed in to change notification settings - Fork 232
Open
Description
Description
Race condition between fi_cq_read() and EFA driver completion posting causes sporadic completion queue errors and crashes.
Environment
- NIXL 0.8.0
- AWS EFA with libfabric backend
- High-throughput KV cache transfers
File
src/utils/libfabric/libfabric_rail.cpp
Symptoms
-FI_EAGAIN followed by crash
Sporadic completion queue errors
Intermittent transfer failures
Root Cause
The completion queue is read immediately after posting operations without synchronization. The EFA driver may not have finished posting the completion.
Proposed Fix
Add memory barrier and small delay before reading completion queue:
std::atomic_thread_fence(std::memory_order_seq_cst);
sched_yield();
usleep(1000); // 1ms delay for driver sync
ret = fi_cq_read(cq, &completion, 1);Performance Note
The 1ms delay has negligible impact on overall throughput as transfers are batched, but significantly improves stability under high load.
Impact
Causes intermittent failures under high-throughput conditions, especially with 32 rails.
coderabbitai
Metadata
Metadata
Assignees
Labels
No labels