-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop passing search query local node fanout through transport layer #122669
base: main
Are you sure you want to change the base?
Stop passing search query local node fanout through transport layer #122669
Conversation
We don't need to go through all the steps in the transport layer for executing query work on the local node. Doing so adds considerable overhead. E.g. with security enabled, the amount of work on transport threads goes up by about 4x for local only searches in the `geonames` track. The simpler a query is to rewrite and can-match and the more shards are queried on the local node, the more pronounced the overhead is in practice.
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
var searchTransport = getSearchTransport(); | ||
var task = getTask(); | ||
if (searchTransport.transportService().getLocalNodeConnection() == connection) { | ||
searchService.executeQueryPhase(request, task, listener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity: if we do this, why do it only for the query phase? Also, couldn't this conditional be added to the sendExecuteQuery method instead? What kind of overhead does this save? I can imagine that this is a pretty common pattern, or is search the edge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm good question. I guess I only noticed this in benchmarks for search and for the query phase specifically. For fetch it's not super visible since you don't hit so many shards mostly and for bulk indexing you still have per-shard bulks so the cost isn't in that.
I first noticed this with batched execution where the overhead becomes super visible but it's equally visible without it for large data nodes that do coordination work already (or if queries are heavy, like a large terms query or some geo stuff or so).
The overhead saved is 1. all the lookups in the transport layer, lots of listener wrapping, child-task registration and most importantly security.
But :) that's why I need a review from security here I think. Functionally I think security still works the same way if not more efficiently. All tests pass because we auth the top level search request. DLS/FLS are applied as well but somehow those cache assertions needed adjustment and seemingly we do use the cache more now and I can't explain why.
the security overhead is considerable here, it's well in excess of the can_match cost for most rally runs it seems :O
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shard = getShard(request); | ||
} catch (RuntimeException e) { | ||
listener.onFailure(e); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this additional catch fixing? Is it a bug that you observed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jup concurrent shard deletion can cause a shard not found exception :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should that be a separate change then? It is not a problem introduced by your change, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is introduced here, without this change the transport layer catches the exception and passes it on to the listener.
@@ -342,7 +342,7 @@ public void testRequestCacheWithTemplateRoleQuery() { | |||
// Since the DLS for the alias uses a stored script, this should cause the request cached to be disabled | |||
assertSearchResponse(client1.prepareSearch(DLS_TEMPLATE_ROLE_QUERY_ALIAS).setRequestCache(true), Set.of("1"), Set.of("username")); | |||
// No cache should be used | |||
assertCacheState(DLS_TEMPLATE_ROLE_QUERY_INDEX, 2, 2); | |||
assertCacheState(DLS_TEMPLATE_ROLE_QUERY_INDEX, 3, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you expand on these assertions needing to be adapted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite to be honest. It seems going through the transport layer updates the thread-local context variable somehow and then authz works differently. But I think the security folks need to look at this.
I used the same approach in the batched exec logic (easy to switch to go through the transport layer if need be fortunately). I think this getting a review from security is a soft-dependency to using this path for batched, I'm just not sure how strong tests are end-to-end :)
This seems to me like something that may be interesting, but not only for the query phase, and may deserve a broader discussion with the Distributed team on the transport handling, as well as with Security around possible security implications. |
This change is not just interesting IMO, it's quite core to making full use of #121885 because it enables reuse of data structures across shards in the query phase. |
I don't quite see how this is a thing that only batched query phase execution benefits from. If going through transport for local execution adds overhead, that is a problem everywhere in Elasticsearch, that should be addressed at a different level. Could someone on @elastic/es-distributed-coordination chime in on this? Thoughts? Thanks! |
I think it's generally an overhead-vs-complexity issue :) You always (even without security) have to go through looking up the transport action, wrapping the listener and whatnot. But on the other hand :) you get to use the same code locally and remotely and all your top-level transport action needs is a reference to the transport service (which it has anyway) instead of having to essentially have all your dependencies on both child and top level action. Btw. come to think of it ... I wonder if the is even that big of a deal, I just realised that we effectively have the same logic of just looping the indices in the batched |
Just to provide a little motivation for moving this thing forward. Benchmarking the achieved IPC for searching larger shard counts I see something like this: The deep pipeline going through the transport layer needlessly is hurting us a lot. We're at almost 20% i-cache miss rate and resulting ~0.7IPC for the search threads and effectively memory bound on instructions even. We're missing out on more than 75% of what the CPU could do per cycle here and especially once batched execution hits, this will be the second largest sequential bottleneck remaining. |
Invoking something via the transport service is always going to be more expensive than just calling the code directly just by nature of the API that It's not a lot of overhead and in most places we're not on a hot path so it's a better engineering tradeoff to write the simpler code, but on performance-critical paths like search the balance may well tip in favour of skipping all this jazz wherever possible. I wouldn't be surprised to see something similar on the indexing path too, although the indexing execution model is already sensitive to the overheads of invoking transport actions (whether local or remote): the batched execution we have on the indexing path amortizes these overheads enough that the benefits are probably way smaller. If we made search less chatty on the wire then maybe this wouldn't be so important? |
Yes :) but that's sort of why I opened this PR :D The changes in #121885 essentially work through doing the hotter search over shards loop same way it's done in this PR. This isn't so much a network/distrib question, it's for security, you could rephrase it as "Do we need the explicit shard level search request for security or is all the authentication/authorization done safely for the top level SearchRequest already?". Looking at the code and the way can-match is implemented + all tests passing, the answer is "no" I think but that's what I'd like to see confirmation on. From the search side, it's still nice to have this for the unbatched path as well for code reasons. It allowed me to (without adding much code over today) be able to have a combined data+coordinating collecting its per-shard results into a global results structure (of preallocated size == total_shards_in_search) while the data nodes collect into the same data structure with size == total_shards_for_this_search_on_this_node :P It's really just a trivial implementation detail. And just generally speaking, we need to get away from building these shard request objects and having so much isolation for searches across shards. With #121885 the cost of a query moves towards all the ceremony of parsing a query into a Lucene structure and such in many if not most non-aggregation use-cases. We can only cut down on that ceremony (and the related I-cache misses) by moving to a tighter loop that reuses stuff across shards. This is just step 1 on that road. |
I think the security impact is real, and we'll need to work through a solution. The changes to the tests are needed because the lack of interception is causing It's plausible that the lack of interception has other security impacts too, though if that is the case then I would hope that there are tests in place that would have picked it up. |
We don't need to go through all the steps in the transport layer for executing query work on the local node. Doing so adds considerable overhead. E.g. with security enabled, the amount of work on transport threads goes up by about 4x for local only searches in the
geonames
track. The simpler a query is to rewrite and can-match and the more shards are queried on the local node, the more pronounced the overhead is in practice.I think the win is entirely evident and it's a relatively trivial change from the search perspective (note though that some of the sub-tasks aren't a thing anymore which IMO is both a performance win and entirely cosmetic from the user's perspective), but we need a review/ok from the security team here.