Skip to content

transport: pass network channel exceptions to close listeners #127895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

schase-es
Copy link

Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated.

Closes: ES-11644

Previously, exceptions encountered on a netty channel were caught and logged at
some level, but not passed to the TcpChannel or Transport.Connection close
listeners. This limited observability. This change implements this exception
reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail
reporting exceptions and their close listeners receiving them. Some test
infrastructure (FakeTcpChannel) and assertions in close listener onFailure
methods have been updated.

Closes: ES-11644
@schase-es schase-es requested a review from a team as a code owner May 8, 2025 07:59
@schase-es schase-es added >non-issue :Distributed Coordination/Network Http and internode communication implementations labels May 8, 2025
@schase-es schase-es requested a review from nicktindall May 8, 2025 07:59
@elasticsearchmachine elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0 labels May 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@schase-es schase-es requested a review from DaveCTurner May 8, 2025 07:59
@schase-es
Copy link
Author

As discussed (#127736), this is the first section broken into its own PR. I put some work into adding testing. Otherwise, there was one deleted line that was unnecessary, and a small comment I added to the Netty4TcpChannel.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All makes sense to me, I left some comments but nothing major.

- renamed channelError and onException
- removed dead channel error logging code
- used expectThrows pattern in tests
- added closeListener test
- added assert to onFailure branch of netty channel future/listener adapter
- de-duplicated throwable/exception adapting code
- log transport errors at debug
@schase-es
Copy link
Author

I addressed all the straightforward things -- thanks for the feedback :)

I had a hard time getting a test into TcpTranportTests that wasn't a trivial test of a FakeTcpChannel. I was able to find a place in ClusterConnectionManagerTests that looked fine, but is at the Transport.Connection level.

Please comment with any additional ideas -- the test infrastructure in this area is a lot to take in.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point on the testing, it's a little tricky. The only observable change here AFAICT is how the ChannelCloseLogger logs the exception now if there is an exception, so I think that's what I'd try. We test this logging in org.elasticsearch.transport.netty4.ESLoggingHandlerIT#testConnectionLogging today. Can we use that? It seems empirically that simply starting and stopping a node yields some clean closes and some Connection Reset exceptions.

This isn't really the right place for that test, we should create a TcpTransportIT and move it there too.

@@ -308,18 +307,27 @@ protected void stopInternal() {
}, serverBootstraps::clear, () -> clientBootstrap = null);
}

private Exception exceptionFromThrowable(Throwable cause) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for including the ExceptionsHelper.maybeDieOnAnotherThread(cause); call in here too, just to make it clearer that we're not silently swallowing an Error here (which would be a bad bug). I'd have moved the channel.setCloseException call in here too I think. I could be persuaded to leave it like this if you feel strongly tho.

Copy link
Author

@schase-es schase-es May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the reasoning for separating just this bit out, was that this pattern is potentially widely used enough to be in ExceptionsHelper. There's actually another exception handler at the bottom of this file that uses this, and it's in several other netty error receivers. Maybe another reason to wait, is that there is a parallel set of channel wrappers in the HTTP domain, that may have similar issues around exception reporting on close.

I also thought about making the TcpChannel setCloseException take a throwable instead, and do this internally.

Most of these exception handlers are short and have the same 2-3 things, depending on the context. Even though it's boilerplate and could be abstracted (plus or minus the channel type and netty pattern), I gravitate more towards being able to see those few things in an important exception handler instead of having to follow the code. I think this is okay for now?

- added test to check for close logging, with and without exception
- fixed up close logger formatting
- javadoc fixup for channel exception field
- further refactored remaining code with throwable -> exception in Netty4Transport
int failAttempts = 0;
do {
internalCluster().restartNode(nodeName);
} while (latch.await(500, TimeUnit.MILLISECONDS) == false && failAttempts++ < 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you encounter cases where we don't get an exceptional close on the first try?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants