Skip to content

Conversation

@wagner-austin
Copy link
Contributor

Summary

During distributed training shutdown (especially with Dask), workers may close connections at different times. When one worker tries to send/receive on a socket that the peer has already closed, it gets EPIPE (code 32) or ECONNRESET (code 54/104) errors. Previously these caused fatal crashes via Log::Fatal.

This PR makes these expected shutdown errors non-fatal by returning SOCKET_ERROR instead, allowing callers to handle cleanup gracefully.

Changes

  • Added IsConnectionClosedError() helper function to identify connection-closed error codes (EPIPE, ECONNRESET, ENOTCONN, ESHUTDOWN on POSIX; WSAECONNRESET, WSAECONNABORTED, WSAESHUTDOWN, WSAENOTCONN on Windows)
  • Modified TcpSocket::Send() and TcpSocket::Recv() to return SOCKET_ERROR for these errors instead of calling Log::Fatal

Related Issue

Fixes #4074

Test Plan

  • Pre-commit passes
  • Windows build compiles (MSVC)
  • Linux build compiles (GCC in WSL)
  • C++ unit tests pass (31/31)
  • Reproduction test confirms fix works (socket send on closed connection no longer crashes)

During distributed training shutdown, workers may close connections at
different times, causing EPIPE or ECONNRESET errors. Previously these
caused fatal crashes via Log::Fatal. Now these expected errors are
handled gracefully by returning SOCKET_ERROR to let callers clean up.

See microsoft#4074
@wagner-austin
Copy link
Contributor Author

Closing - submitted prematurely. Will re-submit after verifying CI on fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dask tests randomly fail with socket error code 54 or 104

1 participant