-
Notifications
You must be signed in to change notification settings - Fork 232
Description
The Feature We Request.
Currently, NIXL appears to have a strong runtime dependency on Etcd for service discovery and metadata exchange. In a distributed inference environment, the "Control Plane" (Etcd) availability should not dictate the stability of the "Data Plane" (NIXL worker communication).
If the Etcd cluster experiences downtime, network partitioning, or jitter, NIXL clients risk failing to resolve peer information, potentially causing the entire inference job to crash. Ideally, existing workers should be able to continue communicating using the last known topology, even if the central registry is temporarily unreachable.
The Solution We Want.
I propose implementing a client-side caching mechanism—similar to the Kubernetes Informer / Reflector pattern found in client-go—to decouple NIXL's runtime availability from Etcd's immediate reachability.
The proposed architecture consists of four key components:
-
Thread-Safe Local Store
- Maintain a local in-memory map (e.g.,
std::unordered_mapprotected by ashared_mutex) containing node metadata such as IP, Port, Label, and Rank. - All NIXL lookup operations (e.g., getting a peer's address) must read directly from this local cache without blocking on network I/O.
- Maintain a local in-memory map (e.g.,
-
List-Watch Mechanism (Background Sync)
- Implement a background thread that acts as a
Reflector. - List: On startup, perform a range request to Etcd to fetch the full list of nodes and populate the local store.
- Watch: Establish a long-polling
Watchconnection to receive incremental updates (PUTandDELETEevents) and update the store in real-time.
- Implement a background thread that acts as a
-
Weak Dependency & Degraded Mode
- This is the most critical change. If the connection to Etcd is lost (e.g., the Watch stream breaks or times out):
- The client MUST NOT purge the local cache.
- NIXL should enter a "Degraded Mode," continuing to serve existing (stale) data from the cache.
- The background thread should attempt to reconnect using an exponential backoff strategy.
- This is the most critical change. If the connection to Etcd is lost (e.g., the Watch stream breaks or times out):
-
Lazy Failure Detection
- Since the cache might contain stale data during an Etcd outage (e.g., a node dies, but Etcd cannot notify the client), NIXL should rely on transport-layer errors.
- If a TCP/RDMA connection fails when attempting to send data to a cached peer, the client should lazily mark that specific node as unreachable in the local store, rather than relying solely on Etcd leases.
Describe alternatives you've considered
- Short-lived Caching: Caching lookups for a few seconds. This reduces Etcd load but does not solve the issue of prolonged Etcd outages.
- Static Configuration: Using hostfiles. This lacks the flexibility required for dynamic scaling and elastic inference.
Additional context
This enhancement would significantly improve the robustness of distributed systems built on NIXL, ensuring that transient control plane failures do not disrupt active high-performance computing tasks.