transport: pass network channel exceptions to close listeners #127895

schase-es · 2025-05-08T07:59:26Z

Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated.

Closes: ES-11644

Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated. Closes: ES-11644

elasticsearchmachine · 2025-05-08T07:59:50Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

schase-es · 2025-05-08T08:01:18Z

As discussed (#127736), this is the first section broken into its own PR. I put some work into adding testing. Otherwise, there was one deleted line that was unnecessary, and a small comment I added to the Netty4TcpChannel.

DaveCTurner

All makes sense to me, I left some comments but nothing major.

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4Transport.java

server/src/test/java/org/elasticsearch/transport/InboundHandlerTests.java

server/src/main/java/org/elasticsearch/transport/TcpChannel.java

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4Transport.java

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4TcpChannel.java

server/src/main/java/org/elasticsearch/transport/TcpTransport.java

server/src/test/java/org/elasticsearch/transport/TcpTransportTests.java

- renamed channelError and onException - removed dead channel error logging code - used expectThrows pattern in tests - added closeListener test - added assert to onFailure branch of netty channel future/listener adapter - de-duplicated throwable/exception adapting code - log transport errors at debug

schase-es · 2025-05-08T22:14:03Z

I addressed all the straightforward things -- thanks for the feedback :)

I had a hard time getting a test into TcpTranportTests that wasn't a trivial test of a FakeTcpChannel. I was able to find a place in ClusterConnectionManagerTests that looked fine, but is at the Transport.Connection level.

Please comment with any additional ideas -- the test infrastructure in this area is a lot to take in.

DaveCTurner

Yeah good point on the testing, it's a little tricky. The only observable change here AFAICT is how the ChannelCloseLogger logs the exception now if there is an exception, so I think that's what I'd try. We test this logging in org.elasticsearch.transport.netty4.ESLoggingHandlerIT#testConnectionLogging today. Can we use that? It seems empirically that simply starting and stopping a node yields some clean closes and some Connection Reset exceptions.

This isn't really the right place for that test, we should create a TcpTransportIT and move it there too.

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4TcpChannel.java

DaveCTurner · 2025-05-09T06:51:20Z

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4Transport.java

@@ -308,18 +307,27 @@ protected void stopInternal() {
        }, serverBootstraps::clear, () -> clientBootstrap = null);
    }

+    private Exception exceptionFromThrowable(Throwable cause) {


I have a slight preference for including the ExceptionsHelper.maybeDieOnAnotherThread(cause); call in here too, just to make it clearer that we're not silently swallowing an Error here (which would be a bad bug). I'd have moved the channel.setCloseException call in here too I think. I could be persuaded to leave it like this if you feel strongly tho.

Part of the reasoning for separating just this bit out, was that this pattern is potentially widely used enough to be in ExceptionsHelper. There's actually another exception handler at the bottom of this file that uses this, and it's in several other netty error receivers. Maybe another reason to wait, is that there is a parallel set of channel wrappers in the HTTP domain, that may have similar issues around exception reporting on close.

I also thought about making the TcpChannel setCloseException take a throwable instead, and do this internally.

Most of these exception handlers are short and have the same 2-3 things, depending on the context. Even though it's boilerplate and could be abstracted (plus or minus the channel type and netty pattern), I gravitate more towards being able to see those few things in an important exception handler instead of having to follow the code. I think this is okay for now?

server/src/main/java/org/elasticsearch/transport/TcpTransport.java

- added test to check for close logging, with and without exception - fixed up close logger formatting - javadoc fixup for channel exception field - further refactored remaining code with throwable -> exception in Netty4Transport

DaveCTurner · 2025-05-10T07:38:53Z

...tty4/src/internalClusterTest/java/org/elasticsearch/transport/netty4/ESLoggingHandlerIT.java

+        int failAttempts = 0;
+        do {
+            internalCluster().restartNode(nodeName);
+        } while (latch.await(500, TimeUnit.MILLISECONDS) == false && failAttempts++ < 10);


Did you encounter cases where we don't get an exceptional close on the first try?

Hi David -- great question. This turned out to not be needed, so I've removed it. I think this grew out of an earlier version that was broken, and got lost in some debugging and breakpoints.

nicktindall

LGTM, pending @DaveCTurner's outstanding comments. Great work.

- removed complicated do... while logic around network exceptions on close

DaveCTurner

LGTM

For possible future reference: I am not 100% convinced that restarting a node will always yield both clean and exceptional closes of transport connections as asserted in ESLoggingHandlerIT with this change. Empirically it seems to be true, but CI will tell us for sure whether we need to do something more elaborate.

Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated. Backports PR 127895/ES-11644 to 8.19

schase-es requested a review from a team as a code owner May 8, 2025 07:59

schase-es added >non-issue :Distributed Coordination/Network Http and internode communication implementations labels May 8, 2025

schase-es requested a review from nicktindall May 8, 2025 07:59

elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0 labels May 8, 2025

schase-es requested a review from DaveCTurner May 8, 2025 07:59

DaveCTurner reviewed May 8, 2025

View reviewed changes

DaveCTurner reviewed May 9, 2025

View reviewed changes

DaveCTurner mentioned this pull request May 9, 2025

Remove exception-mangling in connect/close listeners #127954

Merged

Addressed next round of review feedback:

86342af

- added test to check for close logging, with and without exception - fixed up close logger formatting - javadoc fixup for channel exception field - further refactored remaining code with throwable -> exception in Netty4Transport

DaveCTurner reviewed May 10, 2025

View reviewed changes

nicktindall reviewed May 17, 2025

View reviewed changes

schase-es added 2 commits May 20, 2025 16:00

Addressed last review feedback:

d820810

- removed complicated do... while logic around network exceptions on close

Merge branch 'main' into ES-11644_channel-exception-passing

326c2a6

DaveCTurner approved these changes May 21, 2025

View reviewed changes

schase-es merged commit e713f7c into elastic:main May 21, 2025
17 checks passed

schase-es added backport v8.19.0 labels Jun 12, 2025

schase-es mentioned this pull request Jun 12, 2025

transport: pass network channel exceptions to close listeners (#127895) #129381

Merged

transport: pass network channel exceptions to close listeners #127895

transport: pass network channel exceptions to close listeners #127895

Uh oh!

Conversation

schase-es commented May 8, 2025

Uh oh!

elasticsearchmachine commented May 8, 2025

Uh oh!

schase-es commented May 8, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

schase-es commented May 8, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DaveCTurner May 9, 2025

Choose a reason for hiding this comment

Uh oh!

schase-es May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DaveCTurner May 10, 2025

Choose a reason for hiding this comment

Uh oh!

schase-es May 20, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

schase-es May 9, 2025 •

edited

Loading