increate timeouts for daemon registration with Kineto (#1158)

sanrise · facebook-github-bot · commit 8f4ee661a93c · 2025-10-22T15:13:11.000-07:00
Summary:

I wanted to share an update regarding recent GPU timeout issues we've been experiencing, particularly affecting the last three GPUs in our 8-worker setups. We've identified the root cause as a "Thundering Herd + Timeout" problem within Dynolog's IPCMonitor, and I'm happy to report that a solution has been drafted.


Previously, when all eight processes simultaneously sent IPC requests to Dynolog, the single-threaded IPCMonitor would process these requests serially. Each request took approximately 10ms, causing later processes to exceed the original 50ms timeout. For instance, the Dynolog logs showed:
```
20:24:45.391549 - Process 2202 registered
20:24:45.401608 - Process 2201 registered (+10ms)
...
20:24:45.441941 - Process 2204 registered (+10ms)
20:24:45.452018 - Process 2206 registered (+10ms)
20:24:45.462101 - Process 2205 registered (+10ms)
```

This serial processing meant that the 6th, 7th, and 8th processes (2204, 2206, and 2205 respectively) were significantly delayed. As a result, they failed with errors like:

```
ERROR:2025-10-13 20:24:45 2204:2265 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2206:2266 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2205:2267 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
```

To resolve this, I've increased the IPC timeout to 90ms. This value was chosen because we observed approximately 10ms of processing time per rank, so for 8 ranks, plus a buffer, 90ms provides sufficient time for all processes to register successfully, even under simultaneous load, ensuring that all GPUs can initialize without encountering these timeout errors. This change should significantly improve the stability and reliability of our GPU-accelerated workloads.

Differential Revision: D84573484
diff --git a/libkineto/src/IpcFabricConfigClient.cpp b/libkineto/src/IpcFabricConfigClient.cpp
@@ -95,7 +95,7 @@ IpcFabricConfigClient::IpcFabricConfigClient()
 
 // Connect to the Dynolog service through Fabric name `dynolog`
 constexpr const char* kDynoIpcName = "dynolog";
-constexpr int maxIpcRetries = 5;
+constexpr int maxIpcRetries = 9;
 constexpr int kSleepUs = 10000;
 
 int32_t IpcFabricConfigClient::registerInstance(int32_t gpu) {