-
Notifications
You must be signed in to change notification settings - Fork 232
Description
Description
Multiple race conditions in metadata serialization/deserialization cause intermittent failures during KV cache transfer setup, especially in multi-rail configurations.
Environment
- NIXL 0.8.0
- Multi-rail libfabric backend
- Disaggregated inference (separate prefill/decode workers)
Issues Found
1. Empty Descriptor Crash (v81)
File: src/infra/nixl_memory_section.cpp
When remote agent publishes metadata before KV cache is allocated, empty descriptor list causes error:
if (s_desc.descCount()==0)
return NIXL_ERR_NOT_FOUND; // Should continue, not fail2. Missing MemSection Marker (v82)
File: src/core/nixl_agent.cpp
Missing marker causes permanent failure instead of retry:
if (sd.getStr("") != "MemSection") {
return NIXL_ERR_MISMATCH; // Should warn and retry
}3. Premature Backend Removal (v85)
File: src/core/nixl_listener.cpp
On metadata load failure, backend connection is removed, requiring full restart:
remoteBackends.erase(remote_name); // Should keep connection4. Segment Count Race (v86)
File: src/infra/nixl_memory_section.cpp
seg_count can indicate more backends than actually serialized:
if (nixl_backend.size()==0)
return NIXL_ERR_INVALID_PARAM; // Should break loop5. Stale Cache Entries (v88/v89)
File: src/core/nixl_agent.cpp
Empty or failed metadata loads are cached permanently, preventing retry.
Impact
These issues cause intermittent connection failures that require pod restarts to resolve, especially in multi-worker disaggregated inference setups.
Proposed Fixes
See individual patches in our repository for detailed fixes. General approach:
- Replace hard errors with warnings + retry
- Don't cache failed/empty metadata
- Keep connections alive on transient failures