Remove lots of redundant ref-counting from transport pipeline #123390

original-brownbear · 2025-02-25T16:42:29Z

We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding up the logic in general by only incrementing ref-counts where ownership is unclear while avoiding count changes on obvious "moves".
This PR is a step in that direction, though obviously there's still loads of remaining spots where similar optimizations could be done.

We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding up the logic in general by only incrementing ref-counts where ownership is unclear while avoiding count changes on obvious "moves".

elasticsearchmachine · 2025-02-25T16:42:57Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

original-brownbear · 2025-02-25T16:59:27Z

...les/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/NettyByteBufSizer.java


 @ChannelHandler.Sharable
-public class NettyByteBufSizer extends MessageToMessageDecoder<ByteBuf> {
+public class NettyByteBufSizer extends ChannelInboundHandlerAdapter {


This just isn't message to message, no need to mess with the ref-counts when purely filtering.

original-brownbear · 2025-02-25T17:03:53Z

server/src/main/java/org/elasticsearch/transport/InboundDecoder.java

                }
            } else {
                bytesConsumedThisDecode += maxBytesToConsume;
                bytesConsumed += maxBytesToConsume;
-                fragmentConsumer.accept(retainedContent);
+                if (maxBytesToConsume == remainingToConsume) {
+                    try (ReleasableBytesReference retained = reference.retainedSlice(0, maxBytesToConsume)) {


This would love a follow-up to remove the retaining on the slice, we should have a slice mechanism that doesn't retain and only retain if we need to in the accumulator.

where original reference released?

This all goes to org.elasticsearch.transport.InboundPipeline#releasePendingBytes eventually. We return the number of bytes we consumed in this method, then we release that many bytes (in a way that could use some love next maybe :)) in that method.

original-brownbear · 2025-02-25T17:04:43Z

server/src/main/java/org/elasticsearch/transport/InboundPipeline.java

@@ -56,81 +55,74 @@ public void close() {

    public void handleBytes(TcpChannel channel, ReleasableBytesReference reference) throws IOException {
        if (uncaughtException != null) {
+            reference.close();


If we either close outright or put the buffer in the list before doing anything, we're always safe I think.

original-brownbear · 2025-02-25T17:18:53Z

server/src/main/java/org/elasticsearch/transport/InboundPipeline.java

+    private void doHandleBytes(TcpChannel channel) throws IOException {
+        do {
+            CheckedConsumer<Object, IOException> decodeConsumer = f -> forwardFragment(channel, f);
+            int bytesDecoded = decoder.decode(pending.peekFirst(), decodeConsumer);


Always try a single buffer first. Otherwise if a message was stuck/flushed behind the second half of a larger message in a pending chunk all slicing etc. will reference the full buffer needlessly which is uncool for contention reasons and especially when we do a retained slice out of the second message.

original-brownbear · 2025-02-25T17:23:06Z

...les/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/NettyByteBufSizer.java

+    public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
+        if (msg instanceof ByteBuf buf && buf.capacity() >= 1024) {
+            int readableBytes = buf.readableBytes();
+            buf = buf.discardReadBytes().capacity(readableBytes);


This should be moved down or the message delineation moved up into Netty, only necessary to resize the buffer if we're planning to retain it beyond the current read really.

I think we should go back to the AdaptiveRecvByteBufAllocator with min=4kb init=8kb max=64kb and remove resizer. Allocating 64kb chunks and then trim them, likely copying, looks wasteful. We should not see GC pressure with HTTP content streams for bulk. Different story in transport, I guess.
Also, I'm looking at metrics and see requests p50 <8kb, p75 < 64kb payload size. A substantial amount of requests will never need a 64kb chunk.

Yea that sounds like a good idea to me. The problem might be in those tiny requests in some way, I think we have a lot of potential for batching things together to save latency+bytes end-to-end in search in particular.
#121885 will make a big difference here in real clusters by removing loads and loads of free context + repetitive shard query + empty shard result messages that are definitely all < 8kb and make up a huge percentage of requests overall.

We can look into that in a follow-up? We have to be a tiny bit careful here to maybe test the situation that motivated this thing in the first place where we had degenerate cases of slow/flaky networks creating series of ~2k sized buffers that then assembled to 100M transport messages with enormous overhead (stuff like CCR was the problem here).
Though to be fair, there's also a real fix for that by deserialising as we receive bytes the Netty way instead of assembling the full message up-front :)

We can look into that in a follow-up?

Of course, I was thinking aloud.

mhl-b · 2025-02-25T22:28:09Z

...ort-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4MessageInboundHandler.java

-        try (ReleasableBytesReference reference = Netty4Utils.toReleasableBytesReference(buffer)) {
-            pipeline.handleBytes(channel, reference);
+        try {
+            pipeline.handleBytes(channel, Netty4Utils.toReleasableBytesReference(buffer));


I remember talking with @DaveCTurner about reference acquisition. It's easier to trace references when we acquire and release those next to each other - visually. I think previous version is preferable, it's clear that at the end of try reference will be released. When we pass reference to different thread or buffer chunks we explicitly acquire/release ref, that make "move" more explicit. I cannot say that I prefer a bit faster but hard to trace references :)

Otherwise it's not trivial. Every good and exceptional termination need to release refs explicitly. I tried to do HTTP path, there are places where ref is released "out of the blue" because there some exception going on.

I know that discussion, the two of us had it many times over :D
What it comes down to is performance vs. easy of use pretty much.
I fully agree that making each block of code keep the ref count constant when it's not escaping an object and incrementing it by one when escaping makes the code much easier to read.
There is a considerable performance downside to the pattern though.
We are escaping these buffers to another thread (a do so quite often actually).
Now the difference between incrementing on the transport thread and decrementing on whatever pool we fork to + decrementing the original reference on the transport thread after vs. just "moving" from one thread to another without the fork is quite relevant. It's still costly to "free on another thread" but still considerably cheaper (and more importantly: considerably more scalable) without the contention of decrementing on two separate threads.

For other ref-counted code around search hits or so where there's a wider user base I think we should keep things simple for now as David suggests, but in the core transport logic I think it's quite worthwhile to save some ref-counting to keep latencies and scaling behavior as predictable as possible shouldn't we? (long story short: I'd rather have an easier time understanding the performance of the approach conceptually than trade that for slightly easier to read code if that makes any sense? :))

Do you know how much performance we gain from this, roughly?

That's impossible to quantify I'm afraid. It's more the more cores/threads you run and the effect is more visible for in throughput under heavy load that it is in lightly loaded end-to-end latency benchmarks.
We are eliminating some negative cache effects here but there's a lot more of them all over the codebase. Best we can do is deal with this stuff one by one I guess :) Ideally we simplify counting like this PR does more and more and move away from ref-counting (and towards just a single close and a well understood lifecycle) wherever possible for both performance and eventually also conformance with Netty 5's approach to buffers.

That said, I can see ref-counting take up O(1-5%) of transport thread time in some heavy search throughput benchmarks as well as some visible amount of CPU cycles from contention on releasing buffers on search threads as well. I'd expect this change may already show a visible improvement in throughput benchmarks or at least get us very close to one in a follow-up.

Nice, I hope it will show up in the nightlies :)

I'd rather have an easier time understanding the performance of the approach conceptually than trade that for slightly easier to read code if that makes any sense?

Yeah that's where we disagree :) IMO in the long run given the overall team structure "easier to read" is a close proxy to "correct", and that overrules any performance considerations (we could have both if only we had a borrow-checker 😉).

I'm ok with this in the heart of the transport code tho, especially since the refcounting discipline in the rest of the codebase is in much better shape than it used to be. Indeed I wonder if we can further simplify and flatten all the inbound pipeline stuff.

mhl-b

LGTM

original-brownbear · 2025-02-26T22:11:00Z

Thanks Mikhail!

elasticsearchmachine · 2025-02-26T22:12:53Z

💚 Backport successful

Status	Branch	Result
✅	8.x

…c#123390) We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding up the logic in general by only incrementing ref-counts where ownership is unclear while avoiding count changes on obvious "moves".

Remove lots of redundant ref-counting from transport pipeline

deffad7

We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding up the logic in general by only incrementing ref-counts where ownership is unclear while avoiding count changes on obvious "moves".

original-brownbear added >non-issue :Distributed Coordination/Network Http and internode communication implementations v9.1.0 labels Feb 25, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Feb 25, 2025

original-brownbear mentioned this pull request Feb 25, 2025

Faster Inbound Pipeline #80656

Closed

original-brownbear commented Feb 25, 2025

View reviewed changes

original-brownbear added 2 commits February 25, 2025 18:30

shorter

28b0283

Merge branch 'main' into avoid-redundant-ref-adjustment

bdd2632

original-brownbear requested a review from mhl-b February 25, 2025 20:34

mhl-b reviewed Feb 25, 2025

View reviewed changes

original-brownbear requested a review from mhl-b February 25, 2025 22:53

mhl-b approved these changes Feb 25, 2025

View reviewed changes

original-brownbear added auto-backport Automatically create backport pull requests when merged v8.19.0 labels Feb 26, 2025

original-brownbear merged commit da51c8c into elastic:main Feb 26, 2025
18 checks passed

original-brownbear deleted the avoid-redundant-ref-adjustment branch February 26, 2025 22:11

original-brownbear mentioned this pull request Feb 26, 2025

[8.x] Remove lots of redundant ref-counting from transport pipeline (#123390) #123554

Open

Remove lots of redundant ref-counting from transport pipeline #123390

Remove lots of redundant ref-counting from transport pipeline #123390

Uh oh!

Conversation

original-brownbear commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Feb 26, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 26, 2025

💚 Backport successful

Uh oh!

Uh oh!

original-brownbear commented Feb 25, 2025 •

edited

Loading

mhl-b Feb 25, 2025 •

edited

Loading

mhl-b Feb 25, 2025 •

edited

Loading

mhl-b Feb 25, 2025 •

edited

Loading

mhl-b Feb 25, 2025 •

edited

Loading