Skip to content

Conversation

@sanrise
Copy link
Contributor

@sanrise sanrise commented Oct 22, 2025

Summary:
I wanted to share an update regarding recent GPU timeout issues we've been experiencing, particularly affecting the last three GPUs in our 8-worker setups. We've identified the root cause as a "Thundering Herd + Timeout" problem within Dynolog's IPCMonitor, and I'm happy to report that a solution has been drafted.

Previously, when all eight processes simultaneously sent IPC requests to Dynolog, the single-threaded IPCMonitor would process these requests serially. Each request took approximately 10ms, causing later processes to exceed the original 50ms timeout. For instance, the Dynolog logs showed:

20:24:45.391549 - Process 2202 registered
20:24:45.401608 - Process 2201 registered (+10ms)
...
20:24:45.441941 - Process 2204 registered (+10ms)
20:24:45.452018 - Process 2206 registered (+10ms)
20:24:45.462101 - Process 2205 registered (+10ms)

This serial processing meant that the 6th, 7th, and 8th processes (2204, 2206, and 2205 respectively) were significantly delayed. As a result, they failed with errors like:

ERROR:2025-10-13 20:24:45 2204:2265 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2206:2266 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2205:2267 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail

To resolve this, I've increased the IPC timeout to 90ms. This value was chosen because we observed approximately 10ms of processing time per rank, so for 8 ranks, plus a buffer, 90ms provides sufficient time for all processes to register successfully, even under simultaneous load, ensuring that all GPUs can initialize without encountering these timeout errors. This change should significantly improve the stability and reliability of our GPU-accelerated workloads.

Differential Revision: D84573484

@meta-cla meta-cla bot added the cla signed label Oct 22, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 22, 2025

@sanrise has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84573484.

Summary:

I wanted to share an update regarding recent GPU timeout issues we've been experiencing, particularly affecting the last three GPUs in our 8-worker setups. We've identified the root cause as a "Thundering Herd + Timeout" problem within Dynolog's IPCMonitor, and I'm happy to report that a solution has been drafted.


Previously, when all eight processes simultaneously sent IPC requests to Dynolog, the single-threaded IPCMonitor would process these requests serially. Each request took approximately 10ms, causing later processes to exceed the original 50ms timeout. For instance, the Dynolog logs showed:
```
20:24:45.391549 - Process 2202 registered
20:24:45.401608 - Process 2201 registered (+10ms)
...
20:24:45.441941 - Process 2204 registered (+10ms)
20:24:45.452018 - Process 2206 registered (+10ms)
20:24:45.462101 - Process 2205 registered (+10ms)
```

This serial processing meant that the 6th, 7th, and 8th processes (2204, 2206, and 2205 respectively) were significantly delayed. As a result, they failed with errors like:

```
ERROR:2025-10-13 20:24:45 2204:2265 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2206:2266 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
ERROR:2025-10-13 20:24:45 2205:2267 IpcFabricConfigClient.cpp:188] Failed to receive ondemand config type=3 from dyno: IPC recv fail
```

To resolve this, I've increased the IPC timeout to 90ms. This value was chosen because we observed approximately 10ms of processing time per rank, so for 8 ranks, plus a buffer, 90ms provides sufficient time for all processes to register successfully, even under simultaneous load, ensuring that all GPUs can initialize without encountering these timeout errors. This change should significantly improve the stability and reliability of our GPU-accelerated workloads.

Differential Revision: D84573484
@jj10306
Copy link
Contributor

jj10306 commented Oct 23, 2025

thanks for the fix @sanrise 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants