-
Notifications
You must be signed in to change notification settings - Fork 175
sidecar: fix thread-safety bug in UdpWriter reconnect #1219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The reconnect logic added in Netflix#1180 introduced a race condition when multiple threads concurrently trigger reconnection. The connect() method performs two non-atomic operations without synchronization, causing "Connect already invoked" exceptions. Add synchronization to connect() and use double-checked locking in writeImpl() to prevent concurrent reconnection attempts. Add test case for concurrent reconnection to verify the fix. Fixes Netflix#1218 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address additional bugs in the reconnect logic beyond the initial thread-safety fix: 1. **Resource leak**: If DatagramChannel.open() succeeds but connect() fails, the opened channel is never closed, causing file descriptor exhaustion. 2. **State corruption**: If connect() fails after opening the channel, this.channel points to an open but unconnected channel. Subsequent writes throw NotYetConnectedException (not caught by the ClosedChannelException handler), permanently breaking the writer. 3. **Data loss**: Original code dropped data after reconnection even if reconnection succeeded. **Changes:** - Use local variable in connect() and only assign on success - Close channel in connect() if connection fails (prevents leak) - Retry write after successful reconnection (prevents data loss) - Handle case where another thread already reconnected - Update test to verify data delivery after reconnection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
jzhuge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Resource Leak Reintroduced in connect()
The change from catch (Throwable t) to catch (IOException e) reintroduces a resource leak for non-IOException errors.
Problem
DatagramChannel.connect() can throw exceptions that are not IOException:
| Exception | When | Extends IOException? |
|---|---|---|
SecurityException |
Security manager denies permission | ❌ No (RuntimeException) |
UnsupportedOperationException |
Address type not supported | ❌ No (RuntimeException) |
IllegalArgumentException |
Invalid address | ❌ No (RuntimeException) |
ClosedByInterruptException |
Thread interrupted | ✅ Yes |
With catch (IOException e), if connect() throws SecurityException or UnsupportedOperationException, the catch block is bypassed, the newChannel is never closed, and we have a resource leak.
Suggested Fix
Restore catch (Throwable t) to ensure the channel is always closed on any failure:
private void connect() throws IOException {
DatagramChannel newChannel = DatagramChannel.open();
try {
newChannel.connect(address);
channel = newChannel;
} catch (Throwable t) { // Must catch Throwable to prevent resource leaks
try {
newChannel.close();
} catch (IOException ignored) {
// Suppress close exception during error handling
}
throw t;
}
}If PMD complains about catching Throwable, we can suppress it with a comment:
} catch (Throwable t) { // NOPMD - must catch all exceptions to prevent resource leakOr handle the re-throw more explicitly:
} catch (Throwable t) {
try {
newChannel.close();
} catch (IOException ignored) {
}
if (t instanceof IOException) {
throw (IOException) t;
} else if (t instanceof RuntimeException) {
throw (RuntimeException) t;
} else if (t instanceof Error) {
throw (Error) t;
} else {
// Should never happen, but satisfy compiler
throw new IOException(t);
}
}Reference
DatagramChannel.connect() javadoc lists all possible exceptions.
|
Excellent analysis! Gemini 3 Pro identified a critical bug I completely missed. The "Zombie Resurrection" BugYou're absolutely correct. After
This violates the contract that The NPE RiskAnd you're right that simply setting } else {
// Another thread reconnected, retry the write once with new channel
try {
buffer.rewind();
channel.write(buffer); // NPE if channel is nullThe Solution: Explicit
|
Follow up if necessary as |
Change exception handling from catching IOException to Exception to prevent resource leaks when connect() throws RuntimeException. DatagramChannel.connect() can throw: - IOException and subclasses (ClosedChannelException, etc.) - SecurityException (RuntimeException) - UnsupportedOperationException (RuntimeException) - IllegalArgumentException (RuntimeException) Previously only IOException was caught, causing resource leaks when RuntimeException occurred. Now catch Exception to handle both IOException and RuntimeException while allowing Errors (OOM, etc.) to propagate for JVM health. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Summary
Fix multiple bugs in
UdpWriterreconnect logic that cause thread-safety issues, resource leaks, state corruption, and data loss under high concurrency.Problems Fixed
1. Thread-Safety Bug (Original Issue)
The reconnect logic added in #1180 (commit
75ee7aea) introduced a race condition. When multiple threads concurrently trigger reconnection after aClosedChannelException, theconnect()method's two non-atomic operations can interleave:channel = DatagramChannel.open()(creates channel X)channel = DatagramChannel.open()(creates channel Y, overwrites field)channel.connect(address)(connects channel Y)channel.connect(address)on Y → FAILS: "Connect already invoked"2. Resource Leak
If
DatagramChannel.open()succeeds butchannel.connect(address)fails (e.g.,IOException,SecurityException,UnresolvedAddressException), the opened channel is never closed, causing file descriptor exhaustion over time.3. State Corruption ("Death Spiral")
If
connect()fails after opening the channel,this.channelpoints to an open but unconnected channel. WhenwriteImpl()tries to write to it:NotYetConnectedException(aRuntimeException, notClosedChannelException)4. Data Loss
The original code dropped data after reconnection even if reconnection succeeded, as it would
throw eafter callingconnect().Impact
High-throughput Spark jobs publishing metrics via Spectator fail under high concurrency.
Changes
Commit 1: Thread-Safety Fix
lockobject for synchronizationchannelfieldvolatilefor visibilityconnect()to make channel creation and connection atomicwriteImpl()to prevent duplicate reconnectionclose()for consistencyconcurrentReconnect()to verify thread-safetyCommit 2: Resource Leak & State Corruption Fix
connect()and only assign to field on successconnect()if connection fails (prevents leak & corruption)udpReconnectIfClosedtest to verify data delivery after reconnectTest Plan
./gradlew :spectator-reg-sidecar:testconcurrentReconnect()verifies concurrent reconnection safetyudpReconnectIfClosed()verifies data delivery after reconnectFixes #1218
🤖 Generated with Claude Code