Skip to content

Race conditions in metadata deserialization cause intermittent failures #1159

@dmvevents

Description

@dmvevents

Description

Multiple race conditions in metadata serialization/deserialization cause intermittent failures during KV cache transfer setup, especially in multi-rail configurations.

Environment

  • NIXL 0.8.0
  • Multi-rail libfabric backend
  • Disaggregated inference (separate prefill/decode workers)

Issues Found

1. Empty Descriptor Crash (v81)

File: src/infra/nixl_memory_section.cpp

When remote agent publishes metadata before KV cache is allocated, empty descriptor list causes error:

if (s_desc.descCount()==0)
    return NIXL_ERR_NOT_FOUND;  // Should continue, not fail

2. Missing MemSection Marker (v82)

File: src/core/nixl_agent.cpp

Missing marker causes permanent failure instead of retry:

if (sd.getStr("") != "MemSection") {
    return NIXL_ERR_MISMATCH;  // Should warn and retry
}

3. Premature Backend Removal (v85)

File: src/core/nixl_listener.cpp

On metadata load failure, backend connection is removed, requiring full restart:

remoteBackends.erase(remote_name);  // Should keep connection

4. Segment Count Race (v86)

File: src/infra/nixl_memory_section.cpp

seg_count can indicate more backends than actually serialized:

if (nixl_backend.size()==0)
    return NIXL_ERR_INVALID_PARAM;  // Should break loop

5. Stale Cache Entries (v88/v89)

File: src/core/nixl_agent.cpp

Empty or failed metadata loads are cached permanently, preventing retry.

Impact

These issues cause intermittent connection failures that require pod restarts to resolve, especially in multi-worker disaggregated inference setups.

Proposed Fixes

See individual patches in our repository for detailed fixes. General approach:

  • Replace hard errors with warnings + retry
  • Don't cache failed/empty metadata
  • Keep connections alive on transient failures

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions