Stop passing search query local node fanout through transport layer #122669

original-brownbear · 2025-02-15T07:05:08Z

We don't need to go through all the steps in the transport layer for executing query work on the local node. Doing so adds considerable overhead. E.g. with security enabled, the amount of work on transport threads goes up by about 4x for local only searches in the geonames track. The simpler a query is to rewrite and can-match and the more shards are queried on the local node, the more pronounced the overhead is in practice.

I think the win is entirely evident and it's a relatively trivial change from the search perspective (note though that some of the sub-tasks aren't a thing anymore which IMO is both a performance win and entirely cosmetic from the user's perspective), but we need a review/ok from the security team here.

We don't need to go through all the steps in the transport layer for executing query work on the local node. Doing so adds considerable overhead. E.g. with security enabled, the amount of work on transport threads goes up by about 4x for local only searches in the `geonames` track. The simpler a query is to rewrite and can-match and the more shards are queried on the local node, the more pronounced the overhead is in practice.

elasticsearchmachine · 2025-02-15T07:05:31Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

javanna · 2025-02-15T22:19:48Z

server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java

+        var searchTransport = getSearchTransport();
+        var task = getTask();
+        if (searchTransport.transportService().getLocalNodeConnection() == connection) {
+            searchService.executeQueryPhase(request, task, listener);


out of curiosity: if we do this, why do it only for the query phase? Also, couldn't this conditional be added to the sendExecuteQuery method instead? What kind of overhead does this save? I can imagine that this is a pretty common pattern, or is search the edge case?

Hmm good question. I guess I only noticed this in benchmarks for search and for the query phase specifically. For fetch it's not super visible since you don't hit so many shards mostly and for bulk indexing you still have per-shard bulks so the cost isn't in that.
I first noticed this with batched execution where the overhead becomes super visible but it's equally visible without it for large data nodes that do coordination work already (or if queries are heavy, like a large terms query or some geo stuff or so).
The overhead saved is 1. all the lookups in the transport layer, lots of listener wrapping, child-task registration and most importantly security.
But :) that's why I need a review from security here I think. Functionally I think security still works the same way if not more efficiently. All tests pass because we auth the top level search request. DLS/FLS are applied as well but somehow those cache assertions needed adjustment and seemingly we do use the cache more now and I can't explain why.
the security overhead is considerable here, it's well in excess of the can_match cost for most rally runs it seems :O

It's this (couldn't zoom out further :P)

vs this

and on a transport thread.

javanna · 2025-02-15T22:20:22Z

server/src/main/java/org/elasticsearch/search/SearchService.java

+            shard = getShard(request);
+        } catch (RuntimeException e) {
+            listener.onFailure(e);
+            return;


what is this additional catch fixing? Is it a bug that you observed?

Jup concurrent shard deletion can cause a shard not found exception :)

should that be a separate change then? It is not a problem introduced by your change, is it?

It is introduced here, without this change the transport layer catches the exception and passes it on to the listener.

javanna · 2025-02-15T22:20:55Z

...rity/src/internalClusterTest/java/org/elasticsearch/integration/DlsFlsRequestCacheTests.java

@@ -342,7 +342,7 @@ public void testRequestCacheWithTemplateRoleQuery() {
        // Since the DLS for the alias uses a stored script, this should cause the request cached to be disabled
        assertSearchResponse(client1.prepareSearch(DLS_TEMPLATE_ROLE_QUERY_ALIAS).setRequestCache(true), Set.of("1"), Set.of("username"));
        // No cache should be used
-        assertCacheState(DLS_TEMPLATE_ROLE_QUERY_INDEX, 2, 2);
+        assertCacheState(DLS_TEMPLATE_ROLE_QUERY_INDEX, 3, 2);


could you expand on these assertions needing to be adapted?

Not quite to be honest. It seems going through the transport layer updates the thread-local context variable somehow and then authz works differently. But I think the security folks need to look at this.
I used the same approach in the batched exec logic (easy to switch to go through the transport layer if need be fortunately). I think this getting a review from security is a soft-dependency to using this path for batched, I'm just not sure how strong tests are end-to-end :)

javanna · 2025-02-17T14:43:49Z

This seems to me like something that may be interesting, but not only for the query phase, and may deserve a broader discussion with the Distributed team on the transport handling, as well as with Security around possible security implications.

original-brownbear · 2025-02-18T08:22:29Z

This change is not just interesting IMO, it's quite core to making full use of #121885 because it enables reuse of data structures across shards in the query phase.
Also I effectively did this change for batched execution per data node in that PR without tripping any security assertions in how I implemented the transport action for batched execution. I don't think this is a distributed/network/transport problem in any way, but it is a security question either way it plays out. If there's a case where this is not safe, security is lacking test coverage, if it's safe would still be nice to understand why the cache logic works out differently and to have confirmation.
Hope that makes sense :)?

javanna · 2025-02-18T17:20:07Z

I don't quite see how this is a thing that only batched query phase execution benefits from. If going through transport for local execution adds overhead, that is a problem everywhere in Elasticsearch, that should be addressed at a different level. Could someone on @elastic/es-distributed-coordination chime in on this? Thoughts? Thanks!

original-brownbear · 2025-02-18T17:39:40Z

I think it's generally an overhead-vs-complexity issue :) You always (even without security) have to go through looking up the transport action, wrapping the listener and whatnot. But on the other hand :) you get to use the same code locally and remotely and all your top-level transport action needs is a reference to the transport service (which it has anyway) instead of having to essentially have all your dependencies on both child and top level action.
There's always going to be overhead with going through more layers of code?

Btw. come to think of it ... I wonder if the is even that big of a deal, I just realised that we effectively have the same logic of just looping the indices in the batched can_match execution. And funny enough, both the can_match per node action and the query phase transport show up in org.elasticsearch.xpack.security.authz.PreAuthorizationUtils#CHILD_ACTIONS_PRE_AUTHORIZED_BY_PARENT together?

original-brownbear · 2025-03-02T22:12:21Z

Just to provide a little motivation for moving this thing forward. Benchmarking the achieved IPC for searching larger shard counts I see something like this:

The deep pipeline going through the transport layer needlessly is hurting us a lot. We're at almost 20% i-cache miss rate and resulting ~0.7IPC for the search threads and effectively memory bound on instructions even. We're missing out on more than 75% of what the CPU could do per cycle here and especially once batched execution hits, this will be the second largest sequential bottleneck remaining.

DaveCTurner · 2025-03-03T08:37:23Z

If going through transport for local execution adds overhead, that is a problem everywhere in Elasticsearch, that should be addressed at a different level. Could someone on @elastic/es-distributed-coordination chime in on this? Thoughts? Thanks!

Invoking something via the transport service is always going to be more expensive than just calling the code directly just by nature of the API that org.elasticsearch.transport.TransportService#sendRequest exposes: the caller has to convert the listener into a TransportResponseHandler (and sometimes the request parameters into a single TransportRequest) and we have to look up the action to invoke from the String and then call the megamorphic TransportActionHandler#execute method rather than just calling an effectively-final method to do the work. It also has different semantics by design:, e.g. it traverses the request filter chain (including security auth checks) again, creates and registers a new Task, adds more exception handling, and likely lots of other things.

It's not a lot of overhead and in most places we're not on a hot path so it's a better engineering tradeoff to write the simpler code, but on performance-critical paths like search the balance may well tip in favour of skipping all this jazz wherever possible. I wouldn't be surprised to see something similar on the indexing path too, although the indexing execution model is already sensitive to the overheads of invoking transport actions (whether local or remote): the batched execution we have on the indexing path amortizes these overheads enough that the benefits are probably way smaller. If we made search less chatty on the wire then maybe this wouldn't be so important?

original-brownbear · 2025-03-03T09:02:36Z

If we made search less chatty on the wire then maybe this wouldn't be so important?

Yes :) but that's sort of why I opened this PR :D The changes in #121885 essentially work through doing the hotter search over shards loop same way it's done in this PR. This isn't so much a network/distrib question, it's for security, you could rephrase it as "Do we need the explicit shard level search request for security or is all the authentication/authorization done safely for the top level SearchRequest already?". Looking at the code and the way can-match is implemented + all tests passing, the answer is "no" I think but that's what I'd like to see confirmation on.

From the search side, it's still nice to have this for the unbatched path as well for code reasons. It allowed me to (without adding much code over today) be able to have a combined data+coordinating collecting its per-shard results into a global results structure (of preallocated size == total_shards_in_search) while the data nodes collect into the same data structure with size == total_shards_for_this_search_on_this_node :P It's really just a trivial implementation detail.

And just generally speaking, we need to get away from building these shard request objects and having so much isolation for searches across shards. With #121885 the cost of a query moves towards all the ceremony of parsing a query into a Lucene structure and such in many if not most non-aggregation use-cases. We can only cut down on that ceremony (and the related I-cache misses) by moving to a tighter loop that reuses stuff across shards. This is just step 1 on that road.

tvernum · 2025-03-10T05:06:09Z

I think the security impact is real, and we'll need to work through a solution.

The changes to the tests are needed because the lack of interception is causing ShardSearchRequestInterceptor to no longer execute, which means requests are being cached that should not be.

It's plausible that the lack of interception has other security impacts too, though if that is the case then I would hope that there are tests in place that would have picked it up.

original-brownbear added >non-issue :Search Foundations/Search Catch all for Search Foundations labels Feb 15, 2025

elasticsearchmachine added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.1.0 labels Feb 15, 2025

original-brownbear added 3 commits February 15, 2025 10:07

handle exeception

03374cb

Merge remote-tracking branch 'elastic/main' into short-ciruit-local-node

868da8e

cache?

ed12a21

javanna reviewed Feb 15, 2025

View reviewed changes

original-brownbear requested a review from a team February 18, 2025 08:22

original-brownbear requested a review from javanna March 3, 2025 09:02

Merge branch 'main' into short-ciruit-local-node

2c012d9

elasticsearchmachine added v9.2.0 and removed v9.1.0 labels Jun 26, 2025

Stop passing search query local node fanout through transport layer #122669

Are you sure you want to change the base?

Stop passing search query local node fanout through transport layer #122669

Uh oh!

Conversation

original-brownbear commented Feb 15, 2025

Uh oh!

elasticsearchmachine commented Feb 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

javanna commented Feb 17, 2025

Uh oh!

original-brownbear commented Feb 18, 2025

Uh oh!

javanna commented Feb 18, 2025

Uh oh!

original-brownbear commented Feb 18, 2025

Uh oh!

original-brownbear commented Mar 2, 2025

Uh oh!

DaveCTurner commented Mar 3, 2025

Uh oh!

original-brownbear commented Mar 3, 2025

Uh oh!

tvernum commented Mar 10, 2025

Uh oh!

Uh oh!