Fix SearchErrorTraceIT and friends to work with batched query execution #127150

original-brownbear · 2025-04-22T13:50:22Z

Making this work with batched execution and fixing a memory leak:

Fix memory leak by removing listener on first message. There really only is a single message here per node anyway with batched execution in the mix. Either it's a single shard on the data node and we get a single query message or it's multiple shards and we get a single batched message, so fine to remove listener after the first message since all tests do a single request only anyway.
Add a new hook that allows inspection of the actual response. This is needed for batched since batched sends a non-error response even if the data node failed all searches. We had this before in the onResponseSent hook but checking the instance after it's been sent over the wire causes needless overhead in the production code so moving to a "before-style" hook here.

Making this work with batched execution and fixing a memory leak: * Fix memory leak by removing listener on first message. There really only is a single message here per node anyway with batched execution in the mix. Either it's a single shard on the data node and we get a single query message or it's multiple shards and we get a single batched message, so fine to remove listener after the first message since all tests do a single request only anyway. * Add a new hook that allows inspection of the actual response. This is needed for batched since batched sends a non-error response even if the data node failed all searches. We had this before in the `onResponseSent` hook but checking the instance after it's been sent over the wire causes needless overhead in the production code so moving to a "before-style" hook here.

elasticsearchmachine · 2025-04-22T13:50:46Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

original-brownbear · 2025-04-22T13:51:41Z

server/src/main/java/org/elasticsearch/transport/TransportMessageListener.java

+     * @param action the request action
+     * @param response response instance
+     */
+    default void onBeforeResponseSent(long requestId, String action, TransportResponse response) {}


@DaveCTurner wdyt? this should be ok right? In prod the overhead is negligible I think.

benchaplin

I left one comment below. I have a general question as well—what do you mean:

We had this before in the onResponseSent hook

I was a bit surprised that we don't have testing to inspect transport responses being sent from data nodes.

benchaplin · 2025-05-08T17:12:47Z

test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java

+                        var r = asInstanceOf(SearchQueryThenFetchAsyncAction.NodeQueryResponse.class, response);
+                        for (Object result : r.getResults()) {
+                            if (result instanceof Exception error) {
+                                checkStacktraceStateAndRemove(error, mockTs);


Is looping + setting the transportMessageHasStackTrace correct here? What if one result has a stack trace: transportMessageHasStackTrace.set(true), then a result in a future iteration doesn't: transportMessageHasStackTrace.set(false). Then transportMessageHasStackTrace really means "did the last-checked shard have a stack trace." I believe our current test cases trigger errors on all shards so the issue is not noticeable.

It's not great, I sort of tried to point this out via the fact that we only see a single request per thread in the op too.
I just didn't want to refactor the test here, it seems we could invert the logic here easily and set a boolean "expectsStacktrace" or so in the transport message listener and then simply assert inline instead of after the fact. If we can only communicate one bit back from the listener there isn't really a mathematical way to cleanly assert on it for multiple things :P

Shouldn't this do something like: capture the existence of a stack trace for ALL exception results, and:

if they're all true: transportMessageHasStackTrace.set(true)

if they're all false: transportMessageHasStackTrace.set(false)

else there's something wrong, so throw some assertion error

Right that would work but also make this even harder to follow? If we want to fix this my vote would be to invest 5 more minutes here and just assert inline so that we can pass the expectation for everything to the listener at the beginning of each test? :) Otherwise if we go for the 3 outcome logic, we'll have some inline assertions and some "at the end of the test" assertions mixed, that's just needlessly complex?

Ah I see what you mean now - throw out transportMessageHasStackTrace and pass the "expectsStacktrace" to assert against inside the onBeforeResponseSent override. I agree that's better than the mixed assertions.

original-brownbear · 2025-05-08T17:38:09Z

We had this before in the onResponseSent hook

#125163 see this PR, we were wasting memory holding on to the message just for the noop (in production) hook :) I dropped the holding on to the message till after it's fully flushed to the wire in that PR.

original-brownbear added >test Issues or PRs that are addressing/adding tests :Search Foundations/Search Catch all for Search Foundations labels Apr 22, 2025

elasticsearchmachine added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.1.0 labels Apr 22, 2025

original-brownbear commented Apr 22, 2025

View reviewed changes

original-brownbear mentioned this pull request Mar 27, 2025

[Meta] Batched Query Phase Follow-up Tasks #125788

Open

8 tasks

benchaplin reviewed May 8, 2025

View reviewed changes

original-brownbear requested a review from benchaplin May 8, 2025 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SearchErrorTraceIT and friends to work with batched query execution #127150

Fix SearchErrorTraceIT and friends to work with batched query execution #127150

original-brownbear commented Apr 22, 2025

elasticsearchmachine commented Apr 22, 2025

original-brownbear Apr 22, 2025

benchaplin left a comment

benchaplin May 8, 2025

original-brownbear May 8, 2025

benchaplin May 8, 2025

original-brownbear May 8, 2025

benchaplin May 8, 2025

original-brownbear commented May 8, 2025

Fix SearchErrorTraceIT and friends to work with batched query execution #127150

Are you sure you want to change the base?

Fix SearchErrorTraceIT and friends to work with batched query execution #127150

Conversation

original-brownbear commented Apr 22, 2025

elasticsearchmachine commented Apr 22, 2025

original-brownbear Apr 22, 2025

Choose a reason for hiding this comment

benchaplin left a comment

Choose a reason for hiding this comment

benchaplin May 8, 2025

Choose a reason for hiding this comment

original-brownbear May 8, 2025

Choose a reason for hiding this comment

benchaplin May 8, 2025

Choose a reason for hiding this comment

original-brownbear May 8, 2025

Choose a reason for hiding this comment

benchaplin May 8, 2025

Choose a reason for hiding this comment

original-brownbear commented May 8, 2025