Skip to content

Race condition in fi_cq_read causes sporadic completion queue errors #1162

@dmvevents

Description

@dmvevents

Description

Race condition between fi_cq_read() and EFA driver completion posting causes sporadic completion queue errors and crashes.

Environment

  • NIXL 0.8.0
  • AWS EFA with libfabric backend
  • High-throughput KV cache transfers

File

src/utils/libfabric/libfabric_rail.cpp

Symptoms

-FI_EAGAIN followed by crash
Sporadic completion queue errors
Intermittent transfer failures

Root Cause

The completion queue is read immediately after posting operations without synchronization. The EFA driver may not have finished posting the completion.

Proposed Fix

Add memory barrier and small delay before reading completion queue:

std::atomic_thread_fence(std::memory_order_seq_cst);
sched_yield();
usleep(1000);  // 1ms delay for driver sync
ret = fi_cq_read(cq, &completion, 1);

Performance Note

The 1ms delay has negligible impact on overall throughput as transfers are batched, but significantly improves stability under high load.

Impact

Causes intermittent failures under high-throughput conditions, especially with 32 rails.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions