-
Notifications
You must be signed in to change notification settings - Fork 25.2k
transport: pass network channel exceptions to close listeners #127895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
transport: pass network channel exceptions to close listeners #127895
Conversation
Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated. Closes: ES-11644
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
As discussed (#127736), this is the first section broken into its own PR. I put some work into adding testing. Otherwise, there was one deleted line that was unnecessary, and a small comment I added to the Netty4TcpChannel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All makes sense to me, I left some comments but nothing major.
modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4Transport.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/transport/InboundHandlerTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/transport/TcpChannel.java
Outdated
Show resolved
Hide resolved
modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4Transport.java
Outdated
Show resolved
Hide resolved
modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4TcpChannel.java
Outdated
Show resolved
Hide resolved
modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4TcpChannel.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/transport/TcpTransport.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/transport/TcpTransportTests.java
Outdated
Show resolved
Hide resolved
- renamed channelError and onException - removed dead channel error logging code - used expectThrows pattern in tests - added closeListener test - added assert to onFailure branch of netty channel future/listener adapter - de-duplicated throwable/exception adapting code - log transport errors at debug
I addressed all the straightforward things -- thanks for the feedback :) I had a hard time getting a test into TcpTranportTests that wasn't a trivial test of a FakeTcpChannel. I was able to find a place in ClusterConnectionManagerTests that looked fine, but is at the Transport.Connection level. Please comment with any additional ideas -- the test infrastructure in this area is a lot to take in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point on the testing, it's a little tricky. The only observable change here AFAICT is how the ChannelCloseLogger
logs the exception now if there is an exception, so I think that's what I'd try. We test this logging in org.elasticsearch.transport.netty4.ESLoggingHandlerIT#testConnectionLogging
today. Can we use that? It seems empirically that simply starting and stopping a node yields some clean closes and some Connection Reset
exceptions.
This isn't really the right place for that test, we should create a TcpTransportIT
and move it there too.
modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4TcpChannel.java
Outdated
Show resolved
Hide resolved
@@ -308,18 +307,27 @@ protected void stopInternal() { | |||
}, serverBootstraps::clear, () -> clientBootstrap = null); | |||
} | |||
|
|||
private Exception exceptionFromThrowable(Throwable cause) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for including the ExceptionsHelper.maybeDieOnAnotherThread(cause);
call in here too, just to make it clearer that we're not silently swallowing an Error
here (which would be a bad bug). I'd have moved the channel.setCloseException
call in here too I think. I could be persuaded to leave it like this if you feel strongly tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the reasoning for separating just this bit out, was that this pattern is potentially widely used enough to be in ExceptionsHelper. There's actually another exception handler at the bottom of this file that uses this, and it's in several other netty error receivers. Maybe another reason to wait, is that there is a parallel set of channel wrappers in the HTTP domain, that may have similar issues around exception reporting on close.
I also thought about making the TcpChannel setCloseException take a throwable instead, and do this internally.
Most of these exception handlers are short and have the same 2-3 things, depending on the context. Even though it's boilerplate and could be abstracted (plus or minus the channel type and netty pattern), I gravitate more towards being able to see those few things in an important exception handler instead of having to follow the code. I think this is okay for now?
server/src/main/java/org/elasticsearch/transport/TcpTransport.java
Outdated
Show resolved
Hide resolved
- added test to check for close logging, with and without exception - fixed up close logger formatting - javadoc fixup for channel exception field - further refactored remaining code with throwable -> exception in Netty4Transport
int failAttempts = 0; | ||
do { | ||
internalCluster().restartNode(nodeName); | ||
} while (latch.await(500, TimeUnit.MILLISECONDS) == false && failAttempts++ < 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you encounter cases where we don't get an exceptional close on the first try?
Previously, exceptions encountered on a netty channel were caught and logged at some level, but not passed to the TcpChannel or Transport.Connection close listeners. This limited observability. This change implements this exception reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail reporting exceptions and their close listeners receiving them. Some test infrastructure (FakeTcpChannel) and assertions in close listener onFailure methods have been updated.
Closes: ES-11644