Race conditions in metadata deserialization cause intermittent failures

## Description

Multiple race conditions in metadata serialization/deserialization cause intermittent failures during KV cache transfer setup, especially in multi-rail configurations.

## Environment

- NIXL 0.8.0
- Multi-rail libfabric backend
- Disaggregated inference (separate prefill/decode workers)

## Issues Found

### 1. Empty Descriptor Crash (v81)
**File:** `src/infra/nixl_memory_section.cpp`

When remote agent publishes metadata before KV cache is allocated, empty descriptor list causes error:
```cpp
if (s_desc.descCount()==0)
    return NIXL_ERR_NOT_FOUND;  // Should continue, not fail
```

### 2. Missing MemSection Marker (v82)
**File:** `src/core/nixl_agent.cpp`

Missing marker causes permanent failure instead of retry:
```cpp
if (sd.getStr("") != "MemSection") {
    return NIXL_ERR_MISMATCH;  // Should warn and retry
}
```

### 3. Premature Backend Removal (v85)
**File:** `src/core/nixl_listener.cpp`

On metadata load failure, backend connection is removed, requiring full restart:
```cpp
remoteBackends.erase(remote_name);  // Should keep connection
```

### 4. Segment Count Race (v86)
**File:** `src/infra/nixl_memory_section.cpp`

seg_count can indicate more backends than actually serialized:
```cpp
if (nixl_backend.size()==0)
    return NIXL_ERR_INVALID_PARAM;  // Should break loop
```

### 5. Stale Cache Entries (v88/v89)
**File:** `src/core/nixl_agent.cpp`

Empty or failed metadata loads are cached permanently, preventing retry.

## Impact

These issues cause intermittent connection failures that require pod restarts to resolve, especially in multi-worker disaggregated inference setups.

## Proposed Fixes

See individual patches in our repository for detailed fixes. General approach:
- Replace hard errors with warnings + retry
- Don't cache failed/empty metadata
- Keep connections alive on transient failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race conditions in metadata deserialization cause intermittent failures #1159

Description

Environment

Issues Found

1. Empty Descriptor Crash (v81)

2. Missing MemSection Marker (v82)

3. Premature Backend Removal (v85)

4. Segment Count Race (v86)

5. Stale Cache Entries (v88/v89)

Impact

Proposed Fixes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race conditions in metadata deserialization cause intermittent failures #1159

Description

Description

Environment

Issues Found

1. Empty Descriptor Crash (v81)

2. Missing MemSection Marker (v82)

3. Premature Backend Removal (v85)

4. Segment Count Race (v86)

5. Stale Cache Entries (v88/v89)

Impact

Proposed Fixes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions