-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Remove lots of redundant ref-counting from transport pipeline #123390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove lots of redundant ref-counting from transport pipeline #123390
Conversation
We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding up the logic in general by only incrementing ref-counts where ownership is unclear while avoiding count changes on obvious "moves".
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
||
@ChannelHandler.Sharable | ||
public class NettyByteBufSizer extends MessageToMessageDecoder<ByteBuf> { | ||
public class NettyByteBufSizer extends ChannelInboundHandlerAdapter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just isn't message to message, no need to mess with the ref-counts when purely filtering.
} | ||
} else { | ||
bytesConsumedThisDecode += maxBytesToConsume; | ||
bytesConsumed += maxBytesToConsume; | ||
fragmentConsumer.accept(retainedContent); | ||
if (maxBytesToConsume == remainingToConsume) { | ||
try (ReleasableBytesReference retained = reference.retainedSlice(0, maxBytesToConsume)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would love a follow-up to remove the retaining on the slice, we should have a slice mechanism that doesn't retain and only retain if we need to in the accumulator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where original reference released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all goes to org.elasticsearch.transport.InboundPipeline#releasePendingBytes
eventually. We return the number of bytes we consumed in this method, then we release that many bytes (in a way that could use some love next maybe :)) in that method.
@@ -56,81 +55,74 @@ public void close() { | |||
|
|||
public void handleBytes(TcpChannel channel, ReleasableBytesReference reference) throws IOException { | |||
if (uncaughtException != null) { | |||
reference.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we either close outright or put the buffer in the list before doing anything, we're always safe I think.
private void doHandleBytes(TcpChannel channel) throws IOException { | ||
do { | ||
CheckedConsumer<Object, IOException> decodeConsumer = f -> forwardFragment(channel, f); | ||
int bytesDecoded = decoder.decode(pending.peekFirst(), decodeConsumer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always try a single buffer first. Otherwise if a message was stuck/flushed behind the second half of a larger message in a pending chunk all slicing etc. will reference the full buffer needlessly which is uncool for contention reasons and especially when we do a retained slice out of the second message.
public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception { | ||
if (msg instanceof ByteBuf buf && buf.capacity() >= 1024) { | ||
int readableBytes = buf.readableBytes(); | ||
buf = buf.discardReadBytes().capacity(readableBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved down or the message delineation moved up into Netty, only necessary to resize the buffer if we're planning to retain it beyond the current read really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should go back to the AdaptiveRecvByteBufAllocator with min=4kb init=8kb max=64kb and remove resizer. Allocating 64kb chunks and then trim them, likely copying, looks wasteful. We should not see GC pressure with HTTP content streams for bulk. Different story in transport, I guess.
Also, I'm looking at metrics and see requests p50 <8kb, p75 < 64kb payload size. A substantial amount of requests will never need a 64kb chunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that sounds like a good idea to me. The problem might be in those tiny requests in some way, I think we have a lot of potential for batching things together to save latency+bytes end-to-end in search in particular.
#121885 will make a big difference here in real clusters by removing loads and loads of free context + repetitive shard query + empty shard result messages that are definitely all < 8kb and make up a huge percentage of requests overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can look into that in a follow-up? We have to be a tiny bit careful here to maybe test the situation that motivated this thing in the first place where we had degenerate cases of slow/flaky networks creating series of ~2k sized buffers that then assembled to 100M transport messages with enormous overhead (stuff like CCR was the problem here).
Though to be fair, there's also a real fix for that by deserialising as we receive bytes the Netty way instead of assembling the full message up-front :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can look into that in a follow-up?
Of course, I was thinking aloud.
try (ReleasableBytesReference reference = Netty4Utils.toReleasableBytesReference(buffer)) { | ||
pipeline.handleBytes(channel, reference); | ||
try { | ||
pipeline.handleBytes(channel, Netty4Utils.toReleasableBytesReference(buffer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember talking with @DaveCTurner about reference acquisition. It's easier to trace references when we acquire and release those next to each other - visually. I think previous version is preferable, it's clear that at the end of try
reference will be released. When we pass reference to different thread or buffer chunks we explicitly acquire/release ref, that make "move" more explicit. I cannot say that I prefer a bit faster but hard to trace references :)
Otherwise it's not trivial. Every good and exceptional termination need to release refs explicitly. I tried to do HTTP path, there are places where ref is released "out of the blue" because there some exception going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that discussion, the two of us had it many times over :D
What it comes down to is performance vs. easy of use pretty much.
I fully agree that making each block of code keep the ref count constant when it's not escaping an object and incrementing it by one when escaping makes the code much easier to read.
There is a considerable performance downside to the pattern though.
We are escaping these buffers to another thread (a do so quite often actually).
Now the difference between incrementing on the transport thread and decrementing on whatever pool we fork to + decrementing the original reference on the transport thread after vs. just "moving" from one thread to another without the fork is quite relevant. It's still costly to "free on another thread" but still considerably cheaper (and more importantly: considerably more scalable) without the contention of decrementing on two separate threads.
For other ref-counted code around search hits or so where there's a wider user base I think we should keep things simple for now as David suggests, but in the core transport logic I think it's quite worthwhile to save some ref-counting to keep latencies and scaling behavior as predictable as possible shouldn't we? (long story short: I'd rather have an easier time understanding the performance of the approach conceptually than trade that for slightly easier to read code if that makes any sense? :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how much performance we gain from this, roughly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's impossible to quantify I'm afraid. It's more the more cores/threads you run and the effect is more visible for in throughput under heavy load that it is in lightly loaded end-to-end latency benchmarks.
We are eliminating some negative cache effects here but there's a lot more of them all over the codebase. Best we can do is deal with this stuff one by one I guess :) Ideally we simplify counting like this PR does more and more and move away from ref-counting (and towards just a single close and a well understood lifecycle) wherever possible for both performance and eventually also conformance with Netty 5's approach to buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I can see ref-counting take up O(1-5%) of transport thread time in some heavy search throughput benchmarks as well as some visible amount of CPU cycles from contention on releasing buffers on search threads as well. I'd expect this change may already show a visible improvement in throughput benchmarks or at least get us very close to one in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I hope it will show up in the nightlies :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather have an easier time understanding the performance of the approach conceptually than trade that for slightly easier to read code if that makes any sense?
Yeah that's where we disagree :) IMO in the long run given the overall team structure "easier to read" is a close proxy to "correct", and that overrules any performance considerations (we could have both if only we had a borrow-checker 😉).
I'm ok with this in the heart of the transport code tho, especially since the refcounting discipline in the rest of the codebase is in much better shape than it used to be. Indeed I wonder if we can further simplify and flatten all the inbound pipeline stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks Mikhail! |
💚 Backport successful
|
…c#123390) We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding up the logic in general by only incrementing ref-counts where ownership is unclear while avoiding count changes on obvious "moves".
We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding up the logic in general by only incrementing ref-counts where ownership is unclear while avoiding count changes on obvious "moves".
This PR is a step in that direction, though obviously there's still loads of remaining spots where similar optimizations could be done.