Skip to content

Remove lots of redundant ref-counting from transport pipeline #123390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Feb 25, 2025

We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding up the logic in general by only incrementing ref-counts where ownership is unclear while avoiding count changes on obvious "moves".
This PR is a step in that direction, though obviously there's still loads of remaining spots where similar optimizations could be done.

We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding
up the logic in general by only incrementing ref-counts where ownership is unclear while
avoiding count changes on obvious "moves".
@original-brownbear original-brownbear added >non-issue :Distributed Coordination/Network Http and internode communication implementations v9.1.0 labels Feb 25, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Feb 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)


@ChannelHandler.Sharable
public class NettyByteBufSizer extends MessageToMessageDecoder<ByteBuf> {
public class NettyByteBufSizer extends ChannelInboundHandlerAdapter {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just isn't message to message, no need to mess with the ref-counts when purely filtering.

}
} else {
bytesConsumedThisDecode += maxBytesToConsume;
bytesConsumed += maxBytesToConsume;
fragmentConsumer.accept(retainedContent);
if (maxBytesToConsume == remainingToConsume) {
try (ReleasableBytesReference retained = reference.retainedSlice(0, maxBytesToConsume)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would love a follow-up to remove the retaining on the slice, we should have a slice mechanism that doesn't retain and only retain if we need to in the accumulator.

Copy link
Contributor

@mhl-b mhl-b Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where original reference released?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all goes to org.elasticsearch.transport.InboundPipeline#releasePendingBytes eventually. We return the number of bytes we consumed in this method, then we release that many bytes (in a way that could use some love next maybe :)) in that method.

@@ -56,81 +55,74 @@ public void close() {

public void handleBytes(TcpChannel channel, ReleasableBytesReference reference) throws IOException {
if (uncaughtException != null) {
reference.close();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we either close outright or put the buffer in the list before doing anything, we're always safe I think.

private void doHandleBytes(TcpChannel channel) throws IOException {
do {
CheckedConsumer<Object, IOException> decodeConsumer = f -> forwardFragment(channel, f);
int bytesDecoded = decoder.decode(pending.peekFirst(), decodeConsumer);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always try a single buffer first. Otherwise if a message was stuck/flushed behind the second half of a larger message in a pending chunk all slicing etc. will reference the full buffer needlessly which is uncool for contention reasons and especially when we do a retained slice out of the second message.

public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
if (msg instanceof ByteBuf buf && buf.capacity() >= 1024) {
int readableBytes = buf.readableBytes();
buf = buf.discardReadBytes().capacity(readableBytes);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved down or the message delineation moved up into Netty, only necessary to resize the buffer if we're planning to retain it beyond the current read really.

Copy link
Contributor

@mhl-b mhl-b Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go back to the AdaptiveRecvByteBufAllocator with min=4kb init=8kb max=64kb and remove resizer. Allocating 64kb chunks and then trim them, likely copying, looks wasteful. We should not see GC pressure with HTTP content streams for bulk. Different story in transport, I guess.
Also, I'm looking at metrics and see requests p50 <8kb, p75 < 64kb payload size. A substantial amount of requests will never need a 64kb chunk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that sounds like a good idea to me. The problem might be in those tiny requests in some way, I think we have a lot of potential for batching things together to save latency+bytes end-to-end in search in particular.
#121885 will make a big difference here in real clusters by removing loads and loads of free context + repetitive shard query + empty shard result messages that are definitely all < 8kb and make up a huge percentage of requests overall.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can look into that in a follow-up? We have to be a tiny bit careful here to maybe test the situation that motivated this thing in the first place where we had degenerate cases of slow/flaky networks creating series of ~2k sized buffers that then assembled to 100M transport messages with enormous overhead (stuff like CCR was the problem here).
Though to be fair, there's also a real fix for that by deserialising as we receive bytes the Netty way instead of assembling the full message up-front :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can look into that in a follow-up?

Of course, I was thinking aloud.

try (ReleasableBytesReference reference = Netty4Utils.toReleasableBytesReference(buffer)) {
pipeline.handleBytes(channel, reference);
try {
pipeline.handleBytes(channel, Netty4Utils.toReleasableBytesReference(buffer));
Copy link
Contributor

@mhl-b mhl-b Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember talking with @DaveCTurner about reference acquisition. It's easier to trace references when we acquire and release those next to each other - visually. I think previous version is preferable, it's clear that at the end of try reference will be released. When we pass reference to different thread or buffer chunks we explicitly acquire/release ref, that make "move" more explicit. I cannot say that I prefer a bit faster but hard to trace references :)

Otherwise it's not trivial. Every good and exceptional termination need to release refs explicitly. I tried to do HTTP path, there are places where ref is released "out of the blue" because there some exception going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that discussion, the two of us had it many times over :D
What it comes down to is performance vs. easy of use pretty much.
I fully agree that making each block of code keep the ref count constant when it's not escaping an object and incrementing it by one when escaping makes the code much easier to read.
There is a considerable performance downside to the pattern though.
We are escaping these buffers to another thread (a do so quite often actually).
Now the difference between incrementing on the transport thread and decrementing on whatever pool we fork to + decrementing the original reference on the transport thread after vs. just "moving" from one thread to another without the fork is quite relevant. It's still costly to "free on another thread" but still considerably cheaper (and more importantly: considerably more scalable) without the contention of decrementing on two separate threads.

For other ref-counted code around search hits or so where there's a wider user base I think we should keep things simple for now as David suggests, but in the core transport logic I think it's quite worthwhile to save some ref-counting to keep latencies and scaling behavior as predictable as possible shouldn't we? (long story short: I'd rather have an easier time understanding the performance of the approach conceptually than trade that for slightly easier to read code if that makes any sense? :))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how much performance we gain from this, roughly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's impossible to quantify I'm afraid. It's more the more cores/threads you run and the effect is more visible for in throughput under heavy load that it is in lightly loaded end-to-end latency benchmarks.
We are eliminating some negative cache effects here but there's a lot more of them all over the codebase. Best we can do is deal with this stuff one by one I guess :) Ideally we simplify counting like this PR does more and more and move away from ref-counting (and towards just a single close and a well understood lifecycle) wherever possible for both performance and eventually also conformance with Netty 5's approach to buffers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I can see ref-counting take up O(1-5%) of transport thread time in some heavy search throughput benchmarks as well as some visible amount of CPU cycles from contention on releasing buffers on search threads as well. I'd expect this change may already show a visible improvement in throughput benchmarks or at least get us very close to one in a follow-up.

Copy link
Contributor

@mhl-b mhl-b Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I hope it will show up in the nightlies :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have an easier time understanding the performance of the approach conceptually than trade that for slightly easier to read code if that makes any sense?

Yeah that's where we disagree :) IMO in the long run given the overall team structure "easier to read" is a close proxy to "correct", and that overrules any performance considerations (we could have both if only we had a borrow-checker 😉).

I'm ok with this in the heart of the transport code tho, especially since the refcounting discipline in the rest of the codebase is in much better shape than it used to be. Indeed I wonder if we can further simplify and flatten all the inbound pipeline stuff.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear original-brownbear added auto-backport Automatically create backport pull requests when merged v8.19.0 labels Feb 26, 2025
@original-brownbear
Copy link
Member Author

Thanks Mikhail!

@original-brownbear original-brownbear merged commit da51c8c into elastic:main Feb 26, 2025
18 checks passed
@original-brownbear original-brownbear deleted the avoid-redundant-ref-adjustment branch February 26, 2025 22:11
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 26, 2025
…c#123390)

We can do with a whole lot less in ref-counting, avoiding lots of contention and speeding
up the logic in general by only incrementing ref-counts where ownership is unclear while
avoiding count changes on obvious "moves".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants