Skip to content

Commit 8f4ee66

Browse files
sanrisefacebook-github-bot
authored andcommitted
increate timeouts for daemon registration with Kineto (#1158)
Summary: I wanted to share an update regarding recent GPU timeout issues we've been experiencing, particularly affecting the last three GPUs in our 8-worker setups. We've identified the root cause as a "Thundering Herd + Timeout" problem within Dynolog's IPCMonitor, and I'm happy to report that a solution has been drafted. Previously, when all eight processes simultaneously sent IPC requests to Dynolog, the single-threaded IPCMonitor would process these requests serially. Each request took approximately 10ms, causing later processes to exceed the original 50ms timeout. For instance, the Dynolog logs showed: ``` 20:24:45.391549 - Process 2202 registered 20:24:45.401608 - Process 2201 registered (+10ms) ... 20:24:45.441941 - Process 2204 registered (+10ms) 20:24:45.452018 - Process 2206 registered (+10ms) 20:24:45.462101 - Process 2205 registered (+10ms) ``` This serial processing meant that the 6th, 7th, and 8th processes (2204, 2206, and 2205 respectively) were significantly delayed. As a result, they failed with errors like: ``` ERROR:2025-10-13 20:24:45 2204:2265 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail ERROR:2025-10-13 20:24:45 2206:2266 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail ERROR:2025-10-13 20:24:45 2205:2267 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail ``` To resolve this, I've increased the IPC timeout to 90ms. This value was chosen because we observed approximately 10ms of processing time per rank, so for 8 ranks, plus a buffer, 90ms provides sufficient time for all processes to register successfully, even under simultaneous load, ensuring that all GPUs can initialize without encountering these timeout errors. This change should significantly improve the stability and reliability of our GPU-accelerated workloads. Differential Revision: D84573484
1 parent 28ce97b commit 8f4ee66

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

libkineto/src/IpcFabricConfigClient.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ IpcFabricConfigClient::IpcFabricConfigClient()
9595

9696
// Connect to the Dynolog service through Fabric name `dynolog`
9797
constexpr const char* kDynoIpcName = "dynolog";
98-
constexpr int maxIpcRetries = 5;
98+
constexpr int maxIpcRetries = 9;
9999
constexpr int kSleepUs = 10000;
100100

101101
int32_t IpcFabricConfigClient::registerInstance(int32_t gpu) {

0 commit comments

Comments
 (0)