Skip to content

Conversation

@jzhuge
Copy link
Contributor

@jzhuge jzhuge commented Dec 12, 2025

Summary

Fix multiple bugs in UdpWriter reconnect logic that cause thread-safety issues, resource leaks, state corruption, and data loss under high concurrency.

Problems Fixed

1. Thread-Safety Bug (Original Issue)

The reconnect logic added in #1180 (commit 75ee7aea) introduced a race condition. When multiple threads concurrently trigger reconnection after a ClosedChannelException, the connect() method's two non-atomic operations can interleave:

  1. Thread A: channel = DatagramChannel.open() (creates channel X)
  2. Thread B: channel = DatagramChannel.open() (creates channel Y, overwrites field)
  3. Thread A: channel.connect(address) (connects channel Y)
  4. Thread B: channel.connect(address) on Y → FAILS: "Connect already invoked"

2. Resource Leak

If DatagramChannel.open() succeeds but channel.connect(address) fails (e.g., IOException, SecurityException, UnresolvedAddressException), the opened channel is never closed, causing file descriptor exhaustion over time.

3. State Corruption ("Death Spiral")

If connect() fails after opening the channel, this.channel points to an open but unconnected channel. When writeImpl() tries to write to it:

  • Throws NotYetConnectedException (a RuntimeException, not ClosedChannelException)
  • The catch block doesn't catch it
  • Self-healing logic never triggers again
  • The writer is permanently broken

4. Data Loss

The original code dropped data after reconnection even if reconnection succeeded, as it would throw e after calling connect().

Impact

High-throughput Spark jobs publishing metrics via Spectator fail under high concurrency.

Changes

Commit 1: Thread-Safety Fix

  1. Add lock object for synchronization
  2. Make channel field volatile for visibility
  3. Synchronize connect() to make channel creation and connection atomic
  4. Use double-checked locking in writeImpl() to prevent duplicate reconnection
  5. Synchronize close() for consistency
  6. Add test case concurrentReconnect() to verify thread-safety

Commit 2: Resource Leak & State Corruption Fix

  1. Use local variable in connect() and only assign to field on success
  2. Close channel in connect() if connection fails (prevents leak & corruption)
  3. Retry write after successful reconnection (prevents data loss)
  4. Handle case where another thread already reconnected
  5. Update udpReconnectIfClosed test to verify data delivery after reconnect

Test Plan

  • ✅ All existing tests pass: ./gradlew :spectator-reg-sidecar:test
  • ✅ New test concurrentReconnect() verifies concurrent reconnection safety
  • ✅ Updated test udpReconnectIfClosed() verifies data delivery after reconnect
  • ✅ Checkstyle passes

Fixes #1218

🤖 Generated with Claude Code

jzhuge and others added 2 commits December 12, 2025 00:35
The reconnect logic added in Netflix#1180 introduced a race condition when
multiple threads concurrently trigger reconnection. The connect()
method performs two non-atomic operations without synchronization,
causing "Connect already invoked" exceptions.

Add synchronization to connect() and use double-checked locking in
writeImpl() to prevent concurrent reconnection attempts.

Add test case for concurrent reconnection to verify the fix.

Fixes Netflix#1218

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address additional bugs in the reconnect logic beyond the initial
thread-safety fix:

1. **Resource leak**: If DatagramChannel.open() succeeds but connect()
   fails, the opened channel is never closed, causing file descriptor
   exhaustion.

2. **State corruption**: If connect() fails after opening the channel,
   this.channel points to an open but unconnected channel. Subsequent
   writes throw NotYetConnectedException (not caught by the
   ClosedChannelException handler), permanently breaking the writer.

3. **Data loss**: Original code dropped data after reconnection even
   if reconnection succeeded.

**Changes:**
- Use local variable in connect() and only assign on success
- Close channel in connect() if connection fails (prevents leak)
- Retry write after successful reconnection (prevents data loss)
- Handle case where another thread already reconnected
- Update test to verify data delivery after reconnection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@jzhuge jzhuge marked this pull request as draft December 12, 2025 09:33
@brharrington brharrington added this to the 1.9.3 milestone Dec 12, 2025
Copy link
Contributor Author

@jzhuge jzhuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Resource Leak Reintroduced in connect()

The change from catch (Throwable t) to catch (IOException e) reintroduces a resource leak for non-IOException errors.

Problem

DatagramChannel.connect() can throw exceptions that are not IOException:

Exception When Extends IOException?
SecurityException Security manager denies permission ❌ No (RuntimeException)
UnsupportedOperationException Address type not supported ❌ No (RuntimeException)
IllegalArgumentException Invalid address ❌ No (RuntimeException)
ClosedByInterruptException Thread interrupted ✅ Yes

With catch (IOException e), if connect() throws SecurityException or UnsupportedOperationException, the catch block is bypassed, the newChannel is never closed, and we have a resource leak.

Suggested Fix

Restore catch (Throwable t) to ensure the channel is always closed on any failure:

private void connect() throws IOException {
    DatagramChannel newChannel = DatagramChannel.open();
    try {
        newChannel.connect(address);
        channel = newChannel;
    } catch (Throwable t) {  // Must catch Throwable to prevent resource leaks
        try {
            newChannel.close();
        } catch (IOException ignored) {
            // Suppress close exception during error handling
        }
        throw t;
    }
}

If PMD complains about catching Throwable, we can suppress it with a comment:

} catch (Throwable t) {  // NOPMD - must catch all exceptions to prevent resource leak

Or handle the re-throw more explicitly:

} catch (Throwable t) {
    try {
        newChannel.close();
    } catch (IOException ignored) {
    }
    if (t instanceof IOException) {
        throw (IOException) t;
    } else if (t instanceof RuntimeException) {
        throw (RuntimeException) t;
    } else if (t instanceof Error) {
        throw (Error) t;
    } else {
        // Should never happen, but satisfy compiler
        throw new IOException(t);
    }
}

Reference

DatagramChannel.connect() javadoc lists all possible exceptions.

@jzhuge
Copy link
Contributor Author

jzhuge commented Dec 12, 2025

Excellent analysis! Gemini 3 Pro identified a critical bug I completely missed.

The "Zombie Resurrection" Bug

You're absolutely correct. After close() is called:

  1. The channel is closed
  2. But channel field still points to the closed channel object
  3. A worker thread hits ClosedChannelException
  4. Checks if (channel == ch)true (both point to same closed object)
  5. Calls connect()Writer resurrects from the dead 🧟

This violates the contract that close() should permanently shut down the writer.

The NPE Risk

And you're right that simply setting channel = null in close() would cause NPE in the else block:

} else {
    // Another thread reconnected, retry the write once with new channel
    try {
        buffer.rewind();
        channel.write(buffer);  // NPE if channel is null

The Solution: Explicit closed State

Your suggested fix with the closed flag is the correct approach. It:

  • ✅ Prevents resurrection after close() is called
  • ✅ Avoids NPE by checking closed before operations
  • ✅ Makes shutdown semantics explicit and clear
  • ✅ Follows standard pattern for closeable resources

One minor refinement to your suggestion - in writeImpl(), we might want to throw an exception when closed instead of silently returning, to signal that writes are no longer accepted:

@Override public void writeImpl(String line) throws IOException {
    if (closed) {
        throw new IOException("Writer has been closed");
    }
    // ... rest of implementation
}

Or if we want to follow the existing pattern of suppressing errors (from the SidecarWriter.write() wrapper), we can silently return as you suggested.

Summary

Your review is spot-on:

  • connect() resource leak is fixed
  • ✅ Retry logic prevents data loss
  • Zombie resurrection bug - critical issue
  • ❌ Missing explicit shutdown state

The closed flag is essential for correct shutdown semantics.

Great catch! 🎯

@jzhuge
Copy link
Contributor Author

jzhuge commented Dec 12, 2025

The "Zombie Resurrection" Bug

  1. The channel is closed
  2. But channel field still points to the closed channel object
  3. A worker thread hits ClosedChannelException
  4. Checks if (channel == ch)true (both point to same closed object)
  5. Calls connect()Writer resurrects from the dead 🧟

This violates the contract that close() should permanently shut down the writer.

Follow up if necessary as close() is only called during shut down

Change exception handling from catching IOException to Exception to
prevent resource leaks when connect() throws RuntimeException.

DatagramChannel.connect() can throw:
- IOException and subclasses (ClosedChannelException, etc.)
- SecurityException (RuntimeException)
- UnsupportedOperationException (RuntimeException)
- IllegalArgumentException (RuntimeException)

Previously only IOException was caught, causing resource leaks when
RuntimeException occurred. Now catch Exception to handle both
IOException and RuntimeException while allowing Errors (OOM, etc.)
to propagate for JVM health.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@jzhuge jzhuge marked this pull request as ready for review December 12, 2025 19:33
@brharrington brharrington merged commit 7d8dd72 into Netflix:main Jan 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in UdpWriter

2 participants