[Feature Request] Implement Client-side Caching (Informer pattern) to make Etcd a weak dependency

## The Feature We Request.
Currently, NIXL appears to have a strong runtime dependency on Etcd for service discovery and metadata exchange. In a distributed inference environment, the "Control Plane" (Etcd) availability should not dictate the stability of the "Data Plane" (NIXL worker communication).

If the Etcd cluster experiences downtime, network partitioning, or jitter, NIXL clients risk failing to resolve peer information, potentially causing the entire inference job to crash. Ideally, existing workers should be able to continue communicating using the last known topology, even if the central registry is temporarily unreachable.

## The Solution We Want.
I propose implementing a client-side caching mechanism—similar to the Kubernetes `Informer` / `Reflector` pattern found in `client-go`—to decouple NIXL's runtime availability from Etcd's immediate reachability.

The proposed architecture consists of four key components:

1. Thread-Safe Local Store
   - Maintain a local in-memory map (e.g., `std::unordered_map` protected by a `shared_mutex`) containing node metadata such as IP, Port, Label, and Rank.
   - All NIXL lookup operations (e.g., getting a peer's address) must read directly from this local cache without blocking on network I/O.

2. List-Watch Mechanism (Background Sync)
   - Implement a background thread that acts as a `Reflector`.
   - **List:** On startup, perform a range request to Etcd to fetch the full list of nodes and populate the local store.
   - **Watch:** Establish a long-polling `Watch` connection to receive incremental updates (`PUT` and `DELETE` events) and update the store in real-time.

3. Weak Dependency & Degraded Mode
   - This is the most critical change. If the connection to Etcd is lost (e.g., the Watch stream breaks or times out):
     - The client **MUST NOT** purge the local cache.
     - NIXL should enter a "Degraded Mode," continuing to serve existing (stale) data from the cache.
     - The background thread should attempt to reconnect using an exponential backoff strategy.

4. Lazy Failure Detection
   - Since the cache might contain stale data during an Etcd outage (e.g., a node dies, but Etcd cannot notify the client), NIXL should rely on transport-layer errors.
   - If a TCP/RDMA connection fails when attempting to send data to a cached peer, the client should lazily mark that specific node as unreachable in the local store, rather than relying solely on Etcd leases.

## Describe alternatives you've considered
- **Short-lived Caching:** Caching lookups for a few seconds. This reduces Etcd load but does not solve the issue of prolonged Etcd outages.
- **Static Configuration:** Using hostfiles. This lacks the flexibility required for dynamic scaling and elastic inference.

## Additional context
This enhancement would significantly improve the robustness of distributed systems built on NIXL, ensuring that transient control plane failures do not disrupt active high-performance computing tasks.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Implement Client-side Caching (Informer pattern) to make Etcd a weak dependency #1248

The Feature We Request.

The Solution We Want.

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Implement Client-side Caching (Informer pattern) to make Etcd a weak dependency #1248

Description

The Feature We Request.

The Solution We Want.

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions