Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop passing search query local node fanout through transport layer #122669

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

original-brownbear
Copy link
Member

We don't need to go through all the steps in the transport layer for executing query work on the local node. Doing so adds considerable overhead. E.g. with security enabled, the amount of work on transport threads goes up by about 4x for local only searches in the geonames track. The simpler a query is to rewrite and can-match and the more shards are queried on the local node, the more pronounced the overhead is in practice.

I think the win is entirely evident and it's a relatively trivial change from the search perspective (note though that some of the sub-tasks aren't a thing anymore which IMO is both a performance win and entirely cosmetic from the user's perspective), but we need a review/ok from the security team here.

We don't need to go through all the steps in the transport layer for executing query work on the local node.
Doing so adds considerable overhead. E.g. with security enabled, the amount of work on transport threads goes
up by about 4x for local only searches in the `geonames` track.
The simpler a query is to rewrite and can-match and the more shards are queried on the local node, the more
pronounced the overhead is in practice.
@original-brownbear original-brownbear added >non-issue :Search Foundations/Search Catch all for Search Foundations labels Feb 15, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine elasticsearchmachine added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.1.0 labels Feb 15, 2025
var searchTransport = getSearchTransport();
var task = getTask();
if (searchTransport.transportService().getLocalNodeConnection() == connection) {
searchService.executeQueryPhase(request, task, listener);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity: if we do this, why do it only for the query phase? Also, couldn't this conditional be added to the sendExecuteQuery method instead? What kind of overhead does this save? I can imagine that this is a pretty common pattern, or is search the edge case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good question. I guess I only noticed this in benchmarks for search and for the query phase specifically. For fetch it's not super visible since you don't hit so many shards mostly and for bulk indexing you still have per-shard bulks so the cost isn't in that.
I first noticed this with batched execution where the overhead becomes super visible but it's equally visible without it for large data nodes that do coordination work already (or if queries are heavy, like a large terms query or some geo stuff or so).
The overhead saved is 1. all the lookups in the transport layer, lots of listener wrapping, child-task registration and most importantly security.
But :) that's why I need a review from security here I think. Functionally I think security still works the same way if not more efficiently. All tests pass because we auth the top level search request. DLS/FLS are applied as well but somehow those cache assertions needed adjustment and seemingly we do use the cache more now and I can't explain why.
the security overhead is considerable here, it's well in excess of the can_match cost for most rally runs it seems :O

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's this (couldn't zoom out further :P)
out2

vs this
out

and on a transport thread.

shard = getShard(request);
} catch (RuntimeException e) {
listener.onFailure(e);
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this additional catch fixing? Is it a bug that you observed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jup concurrent shard deletion can cause a shard not found exception :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should that be a separate change then? It is not a problem introduced by your change, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is introduced here, without this change the transport layer catches the exception and passes it on to the listener.

@@ -342,7 +342,7 @@ public void testRequestCacheWithTemplateRoleQuery() {
// Since the DLS for the alias uses a stored script, this should cause the request cached to be disabled
assertSearchResponse(client1.prepareSearch(DLS_TEMPLATE_ROLE_QUERY_ALIAS).setRequestCache(true), Set.of("1"), Set.of("username"));
// No cache should be used
assertCacheState(DLS_TEMPLATE_ROLE_QUERY_INDEX, 2, 2);
assertCacheState(DLS_TEMPLATE_ROLE_QUERY_INDEX, 3, 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you expand on these assertions needing to be adapted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite to be honest. It seems going through the transport layer updates the thread-local context variable somehow and then authz works differently. But I think the security folks need to look at this.
I used the same approach in the batched exec logic (easy to switch to go through the transport layer if need be fortunately). I think this getting a review from security is a soft-dependency to using this path for batched, I'm just not sure how strong tests are end-to-end :)

@javanna
Copy link
Member

javanna commented Feb 17, 2025

This seems to me like something that may be interesting, but not only for the query phase, and may deserve a broader discussion with the Distributed team on the transport handling, as well as with Security around possible security implications.

@original-brownbear
Copy link
Member Author

This change is not just interesting IMO, it's quite core to making full use of #121885 because it enables reuse of data structures across shards in the query phase.
Also I effectively did this change for batched execution per data node in that PR without tripping any security assertions in how I implemented the transport action for batched execution. I don't think this is a distributed/network/transport problem in any way, but it is a security question either way it plays out. If there's a case where this is not safe, security is lacking test coverage, if it's safe would still be nice to understand why the cache logic works out differently and to have confirmation.
Hope that makes sense :)?

@original-brownbear original-brownbear requested a review from a team February 18, 2025 08:22
@javanna
Copy link
Member

javanna commented Feb 18, 2025

I don't quite see how this is a thing that only batched query phase execution benefits from. If going through transport for local execution adds overhead, that is a problem everywhere in Elasticsearch, that should be addressed at a different level. Could someone on @elastic/es-distributed-coordination chime in on this? Thoughts? Thanks!

@original-brownbear
Copy link
Member Author

I think it's generally an overhead-vs-complexity issue :) You always (even without security) have to go through looking up the transport action, wrapping the listener and whatnot. But on the other hand :) you get to use the same code locally and remotely and all your top-level transport action needs is a reference to the transport service (which it has anyway) instead of having to essentially have all your dependencies on both child and top level action.
There's always going to be overhead with going through more layers of code?

Btw. come to think of it ... I wonder if the is even that big of a deal, I just realised that we effectively have the same logic of just looping the indices in the batched can_match execution. And funny enough, both the can_match per node action and the query phase transport show up in org.elasticsearch.xpack.security.authz.PreAuthorizationUtils#CHILD_ACTIONS_PRE_AUTHORIZED_BY_PARENT together?

@original-brownbear
Copy link
Member Author

Just to provide a little motivation for moving this thing forward. Benchmarking the achieved IPC for searching larger shard counts I see something like this:

image

The deep pipeline going through the transport layer needlessly is hurting us a lot. We're at almost 20% i-cache miss rate and resulting ~0.7IPC for the search threads and effectively memory bound on instructions even. We're missing out on more than 75% of what the CPU could do per cycle here and especially once batched execution hits, this will be the second largest sequential bottleneck remaining.

@DaveCTurner
Copy link
Contributor

If going through transport for local execution adds overhead, that is a problem everywhere in Elasticsearch, that should be addressed at a different level. Could someone on @elastic/es-distributed-coordination chime in on this? Thoughts? Thanks!

Invoking something via the transport service is always going to be more expensive than just calling the code directly just by nature of the API that org.elasticsearch.transport.TransportService#sendRequest exposes: the caller has to convert the listener into a TransportResponseHandler (and sometimes the request parameters into a single TransportRequest) and we have to look up the action to invoke from the String and then call the megamorphic TransportActionHandler#execute method rather than just calling an effectively-final method to do the work. It also has different semantics by design:, e.g. it traverses the request filter chain (including security auth checks) again, creates and registers a new Task, adds more exception handling, and likely lots of other things.

It's not a lot of overhead and in most places we're not on a hot path so it's a better engineering tradeoff to write the simpler code, but on performance-critical paths like search the balance may well tip in favour of skipping all this jazz wherever possible. I wouldn't be surprised to see something similar on the indexing path too, although the indexing execution model is already sensitive to the overheads of invoking transport actions (whether local or remote): the batched execution we have on the indexing path amortizes these overheads enough that the benefits are probably way smaller. If we made search less chatty on the wire then maybe this wouldn't be so important?

@original-brownbear
Copy link
Member Author

If we made search less chatty on the wire then maybe this wouldn't be so important?

Yes :) but that's sort of why I opened this PR :D The changes in #121885 essentially work through doing the hotter search over shards loop same way it's done in this PR. This isn't so much a network/distrib question, it's for security, you could rephrase it as "Do we need the explicit shard level search request for security or is all the authentication/authorization done safely for the top level SearchRequest already?". Looking at the code and the way can-match is implemented + all tests passing, the answer is "no" I think but that's what I'd like to see confirmation on.

From the search side, it's still nice to have this for the unbatched path as well for code reasons. It allowed me to (without adding much code over today) be able to have a combined data+coordinating collecting its per-shard results into a global results structure (of preallocated size == total_shards_in_search) while the data nodes collect into the same data structure with size == total_shards_for_this_search_on_this_node :P It's really just a trivial implementation detail.

And just generally speaking, we need to get away from building these shard request objects and having so much isolation for searches across shards. With #121885 the cost of a query moves towards all the ceremony of parsing a query into a Lucene structure and such in many if not most non-aggregation use-cases. We can only cut down on that ceremony (and the related I-cache misses) by moving to a tighter loop that reuses stuff across shards. This is just step 1 on that road.

@tvernum
Copy link
Contributor

tvernum commented Mar 10, 2025

I think the security impact is real, and we'll need to work through a solution.

The changes to the tests are needed because the lack of interception is causing ShardSearchRequestInterceptor to no longer execute, which means requests are being cached that should not be.

It's plausible that the lack of interception has other security impacts too, though if that is the case then I would hope that there are tests in place that would have picked it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants