increate timeouts for daemon registration with Kineto #1158

sanrise · 2025-10-22T22:12:41Z

Summary:
I wanted to share an update regarding recent GPU timeout issues we've been experiencing, particularly affecting the last three GPUs in our 8-worker setups. We've identified the root cause as a "Thundering Herd + Timeout" problem within Dynolog's IPCMonitor, and I'm happy to report that a solution has been drafted.

Previously, when all eight processes simultaneously sent IPC requests to Dynolog, the single-threaded IPCMonitor would process these requests serially. Each request took approximately 10ms, causing later processes to exceed the original 50ms timeout. For instance, the Dynolog logs showed:

20:24:45.391549 - Process 2202 registered
20:24:45.401608 - Process 2201 registered (+10ms)
...
20:24:45.441941 - Process 2204 registered (+10ms)
20:24:45.452018 - Process 2206 registered (+10ms)
20:24:45.462101 - Process 2205 registered (+10ms)

This serial processing meant that the 6th, 7th, and 8th processes (2204, 2206, and 2205 respectively) were significantly delayed. As a result, they failed with errors like:

ERROR:2025-10-13 20:24:45 2204:2265 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2206:2266 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2205:2267 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail

To resolve this, I've increased the IPC timeout to 90ms. This value was chosen because we observed approximately 10ms of processing time per rank, so for 8 ranks, plus a buffer, 90ms provides sufficient time for all processes to register successfully, even under simultaneous load, ensuring that all GPUs can initialize without encountering these timeout errors. This change should significantly improve the stability and reliability of our GPU-accelerated workloads.

Differential Revision: D84573484

meta-codesync · 2025-10-22T22:12:49Z

@sanrise has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84573484.

Summary: I wanted to share an update regarding recent GPU timeout issues we've been experiencing, particularly affecting the last three GPUs in our 8-worker setups. We've identified the root cause as a "Thundering Herd + Timeout" problem within Dynolog's IPCMonitor, and I'm happy to report that a solution has been drafted. Previously, when all eight processes simultaneously sent IPC requests to Dynolog, the single-threaded IPCMonitor would process these requests serially. Each request took approximately 10ms, causing later processes to exceed the original 50ms timeout. For instance, the Dynolog logs showed: ``` 20:24:45.391549 - Process 2202 registered 20:24:45.401608 - Process 2201 registered (+10ms) ... 20:24:45.441941 - Process 2204 registered (+10ms) 20:24:45.452018 - Process 2206 registered (+10ms) 20:24:45.462101 - Process 2205 registered (+10ms) ``` This serial processing meant that the 6th, 7th, and 8th processes (2204, 2206, and 2205 respectively) were significantly delayed. As a result, they failed with errors like: ``` ERROR:2025-10-13 20:24:45 2204:2265 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail ERROR:2025-10-13 20:24:45 2206:2266 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail ERROR:2025-10-13 20:24:45 2205:2267 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail ``` To resolve this, I've increased the IPC timeout to 90ms. This value was chosen because we observed approximately 10ms of processing time per rank, so for 8 ranks, plus a buffer, 90ms provides sufficient time for all processes to register successfully, even under simultaneous load, ensuring that all GPUs can initialize without encountering these timeout errors. This change should significantly improve the stability and reliability of our GPU-accelerated workloads. Differential Revision: D84573484

jj10306 · 2025-10-23T12:39:17Z

thanks for the fix @sanrise 🙂

meta-cla bot added the cla signed label Oct 22, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 22, 2025

sanrise force-pushed the export-D84573484 branch from c96e952 to 8f4ee66 Compare October 22, 2025 22:13

sraikund16 approved these changes Oct 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

increate timeouts for daemon registration with Kineto #1158

increate timeouts for daemon registration with Kineto #1158

Uh oh!

sanrise commented Oct 22, 2025

Uh oh!

meta-codesync bot commented Oct 22, 2025

Uh oh!

jj10306 commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

increate timeouts for daemon registration with Kineto #1158

Are you sure you want to change the base?

increate timeouts for daemon registration with Kineto #1158

Uh oh!

Conversation

sanrise commented Oct 22, 2025

Uh oh!

meta-codesync bot commented Oct 22, 2025

Uh oh!

jj10306 commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants