sidecar: fix thread-safety bug in UdpWriter reconnect #1219

jzhuge · 2025-12-12T08:37:28Z

Summary

Fix multiple bugs in UdpWriter reconnect logic that cause thread-safety issues, resource leaks, state corruption, and data loss under high concurrency.

Problems Fixed

1. Thread-Safety Bug (Original Issue)

The reconnect logic added in #1180 (commit 75ee7aea) introduced a race condition. When multiple threads concurrently trigger reconnection after a ClosedChannelException, the connect() method's two non-atomic operations can interleave:

Thread A: channel = DatagramChannel.open() (creates channel X)
Thread B: channel = DatagramChannel.open() (creates channel Y, overwrites field)
Thread A: channel.connect(address) (connects channel Y)
Thread B: channel.connect(address) on Y → FAILS: "Connect already invoked"

2. Resource Leak

If DatagramChannel.open() succeeds but channel.connect(address) fails (e.g., IOException, SecurityException, UnresolvedAddressException), the opened channel is never closed, causing file descriptor exhaustion over time.

3. State Corruption ("Death Spiral")

If connect() fails after opening the channel, this.channel points to an open but unconnected channel. When writeImpl() tries to write to it:

Throws NotYetConnectedException (a RuntimeException, not ClosedChannelException)
The catch block doesn't catch it
Self-healing logic never triggers again
The writer is permanently broken

4. Data Loss

The original code dropped data after reconnection even if reconnection succeeded, as it would throw e after calling connect().

Impact

High-throughput Spark jobs publishing metrics via Spectator fail under high concurrency.

Changes

Commit 1: Thread-Safety Fix

Add lock object for synchronization
Make channel field volatile for visibility
Synchronize connect() to make channel creation and connection atomic
Use double-checked locking in writeImpl() to prevent duplicate reconnection
Synchronize close() for consistency
Add test case concurrentReconnect() to verify thread-safety

Commit 2: Resource Leak & State Corruption Fix

Use local variable in connect() and only assign to field on success
Close channel in connect() if connection fails (prevents leak & corruption)
Retry write after successful reconnection (prevents data loss)
Handle case where another thread already reconnected
Update udpReconnectIfClosed test to verify data delivery after reconnect

Test Plan

✅ All existing tests pass: ./gradlew :spectator-reg-sidecar:test
✅ New test concurrentReconnect() verifies concurrent reconnection safety
✅ Updated test udpReconnectIfClosed() verifies data delivery after reconnect
✅ Checkstyle passes

Fixes #1218

🤖 Generated with Claude Code

The reconnect logic added in Netflix#1180 introduced a race condition when multiple threads concurrently trigger reconnection. The connect() method performs two non-atomic operations without synchronization, causing "Connect already invoked" exceptions. Add synchronization to connect() and use double-checked locking in writeImpl() to prevent concurrent reconnection attempts. Add test case for concurrent reconnection to verify the fix. Fixes Netflix#1218 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Address additional bugs in the reconnect logic beyond the initial thread-safety fix: 1. **Resource leak**: If DatagramChannel.open() succeeds but connect() fails, the opened channel is never closed, causing file descriptor exhaustion. 2. **State corruption**: If connect() fails after opening the channel, this.channel points to an open but unconnected channel. Subsequent writes throw NotYetConnectedException (not caught by the ClosedChannelException handler), permanently breaking the writer. 3. **Data loss**: Original code dropped data after reconnection even if reconnection succeeded. **Changes:** - Use local variable in connect() and only assign on success - Close channel in connect() if connection fails (prevents leak) - Retry write after successful reconnection (prevents data loss) - Handle case where another thread already reconnected - Update test to verify data delivery after reconnection 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

jzhuge

⚠️ Resource Leak Reintroduced in `connect()`

The change from catch (Throwable t) to catch (IOException e) reintroduces a resource leak for non-IOException errors.

Problem

DatagramChannel.connect() can throw exceptions that are not IOException:

Exception	When	Extends IOException?
`SecurityException`	Security manager denies permission	❌ No (RuntimeException)
`UnsupportedOperationException`	Address type not supported	❌ No (RuntimeException)
`IllegalArgumentException`	Invalid address	❌ No (RuntimeException)
`ClosedByInterruptException`	Thread interrupted	✅ Yes

With catch (IOException e), if connect() throws SecurityException or UnsupportedOperationException, the catch block is bypassed, the newChannel is never closed, and we have a resource leak.

Suggested Fix

Restore catch (Throwable t) to ensure the channel is always closed on any failure:

private void connect() throws IOException {
    DatagramChannel newChannel = DatagramChannel.open();
    try {
        newChannel.connect(address);
        channel = newChannel;
    } catch (Throwable t) {  // Must catch Throwable to prevent resource leaks
        try {
            newChannel.close();
        } catch (IOException ignored) {
            // Suppress close exception during error handling
        }
        throw t;
    }
}

If PMD complains about catching Throwable, we can suppress it with a comment:

} catch (Throwable t) {  // NOPMD - must catch all exceptions to prevent resource leak

Or handle the re-throw more explicitly:

} catch (Throwable t) {
    try {
        newChannel.close();
    } catch (IOException ignored) {
    }
    if (t instanceof IOException) {
        throw (IOException) t;
    } else if (t instanceof RuntimeException) {
        throw (RuntimeException) t;
    } else if (t instanceof Error) {
        throw (Error) t;
    } else {
        // Should never happen, but satisfy compiler
        throw new IOException(t);
    }
}

Reference

DatagramChannel.connect() javadoc lists all possible exceptions.

jzhuge · 2025-12-12T18:43:35Z

Excellent analysis! Gemini 3 Pro identified a critical bug I completely missed.

The "Zombie Resurrection" Bug

You're absolutely correct. After close() is called:

The channel is closed
But channel field still points to the closed channel object
A worker thread hits ClosedChannelException
Checks if (channel == ch) → true (both point to same closed object)
Calls connect() → Writer resurrects from the dead 🧟

This violates the contract that close() should permanently shut down the writer.

The NPE Risk

And you're right that simply setting channel = null in close() would cause NPE in the else block:

} else {
    // Another thread reconnected, retry the write once with new channel
    try {
        buffer.rewind();
        channel.write(buffer);  // NPE if channel is null

The Solution: Explicit `closed` State

Your suggested fix with the closed flag is the correct approach. It:

✅ Prevents resurrection after close() is called
✅ Avoids NPE by checking closed before operations
✅ Makes shutdown semantics explicit and clear
✅ Follows standard pattern for closeable resources

One minor refinement to your suggestion - in writeImpl(), we might want to throw an exception when closed instead of silently returning, to signal that writes are no longer accepted:

@Override public void writeImpl(String line) throws IOException {
    if (closed) {
        throw new IOException("Writer has been closed");
    }
    // ... rest of implementation
}

Or if we want to follow the existing pattern of suppressing errors (from the SidecarWriter.write() wrapper), we can silently return as you suggested.

Summary

Your review is spot-on:

✅ connect() resource leak is fixed
✅ Retry logic prevents data loss
❌ Zombie resurrection bug - critical issue
❌ Missing explicit shutdown state

The closed flag is essential for correct shutdown semantics.

Great catch! 🎯

jzhuge · 2025-12-12T19:26:35Z

The "Zombie Resurrection" Bug

The channel is closed

But channel field still points to the closed channel object

A worker thread hits ClosedChannelException

Checks if (channel == ch) → true (both point to same closed object)

Calls connect() → Writer resurrects from the dead 🧟

This violates the contract that close() should permanently shut down the writer.

Follow up if necessary as close() is only called during shut down

Change exception handling from catching IOException to Exception to prevent resource leaks when connect() throws RuntimeException. DatagramChannel.connect() can throw: - IOException and subclasses (ClosedChannelException, etc.) - SecurityException (RuntimeException) - UnsupportedOperationException (RuntimeException) - IllegalArgumentException (RuntimeException) Previously only IOException was caught, causing resource leaks when RuntimeException occurred. Now catch Exception to handle both IOException and RuntimeException while allowing Errors (OOM, etc.) to propagate for JVM health. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

jzhuge and others added 2 commits December 12, 2025 00:35

jzhuge marked this pull request as draft December 12, 2025 09:33

brharrington added this to the 1.9.3 milestone Dec 12, 2025

fix pmd errors

5e0d5e8

jzhuge commented Dec 12, 2025

View reviewed changes

jzhuge marked this pull request as ready for review December 12, 2025 19:33

brharrington added the bug label Dec 16, 2025

brharrington approved these changes Jan 5, 2026

View reviewed changes

brharrington merged commit 7d8dd72 into Netflix:main Jan 5, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sidecar: fix thread-safety bug in UdpWriter reconnect #1219

sidecar: fix thread-safety bug in UdpWriter reconnect #1219

jzhuge commented Dec 12, 2025 •

edited

Loading

Uh oh!

jzhuge left a comment

Uh oh!

jzhuge commented Dec 12, 2025 •

edited

Loading

Uh oh!

jzhuge commented Dec 12, 2025 •

edited

Loading

The "Zombie Resurrection" Bug

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sidecar: fix thread-safety bug in UdpWriter reconnect #1219

sidecar: fix thread-safety bug in UdpWriter reconnect #1219

Conversation

jzhuge commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problems Fixed

1. Thread-Safety Bug (Original Issue)

2. Resource Leak

3. State Corruption ("Death Spiral")

4. Data Loss

Impact

Changes

Commit 1: Thread-Safety Fix

Commit 2: Resource Leak & State Corruption Fix

Test Plan

Uh oh!

jzhuge left a comment

Choose a reason for hiding this comment

⚠️ Resource Leak Reintroduced in connect()

Problem

Suggested Fix

Reference

Uh oh!

jzhuge commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The "Zombie Resurrection" Bug

The NPE Risk

The Solution: Explicit closed State

Summary

Uh oh!

jzhuge commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The "Zombie Resurrection" Bug

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jzhuge commented Dec 12, 2025 •

edited

Loading

⚠️ Resource Leak Reintroduced in `connect()`

jzhuge commented Dec 12, 2025 •

edited

Loading

The Solution: Explicit `closed` State

jzhuge commented Dec 12, 2025 •

edited

Loading