Log stack traces on data nodes before they are cleared for transport #125732

benchaplin · 2025-03-26T21:53:28Z

#118266 cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false. However, all logging of exceptions happens on the coordinating node. This change made it impossible to debug errors via stack trace when error_trace=false.

Here, I've logged the exception on the data node right before the stack trace is cleared. It's prefixed with [nodeId][indexName][shard] to match the rest.suppressed shard failures log on the coordinating node, allowing for easy error tracing from the coordinating node to the responsible data node.

Might this flood the (debug level) logs?
This change has the potential to log [# of shards] times for each index in a search. However, this log:

elasticsearch/server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java

Line 405 in 937bcd9

    
           logger.debug(() -> format("%s: Failed to execute [%s] lastShard [%s]", shard, request, lastShard), e);

on the coordinating node already logs [# of shards]*[# of replicas] times at the debug level. And before #118266, each included a stack trace. Therefore, per node, this change does not increase logs by any order of magnitude.

elasticsearchmachine · 2025-03-26T21:55:18Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2025-03-26T21:56:42Z

Hi @benchaplin, I've created a changelog YAML for you.

server/src/main/java/org/elasticsearch/search/SearchService.java

javanna

I left some comments, thanks @benchaplin !

CONTRIBUTING.md

javanna · 2025-03-27T13:35:14Z

server/src/main/java/org/elasticsearch/search/SearchService.java

            header = Boolean.parseBoolean(threadPool.getThreadContext().getHeaderOrDefault("error_trace", "false"));
        }
        if (header == false) {
            return listener.delegateResponse((l, e) -> {
+                logger.debug(
+                    () -> format("[%s]%s Clearing stack trace before transport:", clusterService.localNode().getId(), request.shardId()),


This looks like the best place to add the logging indeed, because we ensure that we do the additional logging exclusively for the cases where we suppress the stack trace, before doing so.

If we log this at debug, we are not going to see it with the default log level, are we? I think we should use warn instead at least?

The error message looks a little misleading also, all we are interested in is the error itself, so I would log the same that we'd get on the coord node, but this time we'd get the stacktrace.

There's a couple more aspects that deserve attention I think:

if we keep on logging on the coord node, we should probably only log in the data nodes when the error trace is not requested, otherwise we just add redundant logging?

if we keep on logging on the coord node, it may happen that the node acting as coord node acts as a data node as well as part of serving a search request. That would lead to duplicated logging on that node, that may be ok but not ideal.

I've updated the log message to be more clear for users and raised the level to WARN on the same condition that the rest suppressed logger logs at WARN.

Agreed, and that is the current behavior as this log is only wrapped in if (header == false) {.

That is true. I think the shard failure logs on the coord node (see my example below) are important, but an argument could be made to remove the rest suppressed log if error_trace=false. Then again rest.suppressed is only one log line. But I imagine removing any of these logs would count as a breaking change (?), as alerts out there (like our own) might rely on them.

I wouldn't think changing the way we log is a breaking change. But I think this could be a follow-up.

server/src/main/java/org/elasticsearch/search/SearchService.java

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java

smalyshev · 2025-03-26T22:12:12Z

qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java

@@ -32,24 +41,41 @@
 public class SearchErrorTraceIT extends HttpSmokeTestCase {
    private BooleanSupplier hasStackTrace;

+    private static final String loggerName = "org.elasticsearch.search.SearchService";


I'd probably use SearchService.class here.

Good call, done.

smalyshev · 2025-03-27T00:16:07Z

...rch/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

+        Map<String, Object> responseEntity = performRequestAndGetResponseEntityAfterDelay(searchRequest, TimeValue.ZERO);
+        String asyncExecutionId = (String) responseEntity.get("id");
+        Request request = new Request("GET", "/_async_search/" + asyncExecutionId);
+        while (responseEntity.get("is_running") instanceof Boolean isRunning && isRunning) {


I think you can use assertBusy here?

smalyshev · 2025-03-27T16:35:05Z

I think our main problem with this is that users are giving us (or we giving ourselves) logs with useful part of the backtrace removed. So I wonder if this patch really fixes that? Would the users see the missing part in the data node logs? Would they know how to get it and how to give it to us?

server/src/main/java/org/elasticsearch/search/SearchService.java

benchaplin · 2025-03-28T12:36:33Z

For reference, here's a brief example of what logs we have today + what I'm adding. I've thrown a NPE in SearchService to trigger. Setup: 3 nodes, 3 primary shards, 3 replicas.

(Coord node: we get 6 of these, one per shard)

[2025-03-28T08:17:51,928][DEBUG][o.e.a.s.TransportSearchAction] [runTask-0] [meJUNXYoT1iBSTnJgI6Unw][test][2]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[test], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true, allow_selectors=true, include_failure_indices=false], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={}}] lastShard [false] org.elasticsearch.transport.RemoteTransportException: [runTask-0][127.0.0.1:9300][indices:data/read/search[phase/query]]  
Caused by: java.lang.NullPointerException: testing123
[ no stack trace ]

(Coord node: after 6 above failures)

[2025-03-28T08:17:51,946][DEBUG][o.e.a.s.TransportSearchAction] [runTask-0] All shards failed for phase: [query] org.elasticsearch.ElasticsearchException$1: testing123
[ long stack trace for ElasticsearchException ]
Caused by: java.lang.NullPointerException: testing123
[ no stack trace ]

(Coord node: WARN level for status >= 500, DEBUG else)

[2025-03-28T08:17:51,946][WARN ][r.suppressed             ] [runTask-0] path: /test/_search, params: {index=test}, status: 500 Failed to execute phase [query], all shards failed; shardFailures {[CrhugeEAQNGHtZ14Y6Apjg][test][0]: org.elasticsearch.transport.RemoteTransportException: [runTask-1][127.0.0.1:9301][indices:data/read/search[phase/query]]  
Caused by: java.lang.NullPointerException: testing123  
}{[meJUNXYoT1iBSTnJgI6Unw][test][1]: org.elasticsearch.transport.RemoteTransportException: [runTask-0][127.0.0.1:9300][indices:data/read/search[phase/query]]  
Caused by: java.lang.NullPointerException: testing123  
}{[CrhugeEAQNGHtZ14Y6Apjg][test][2]: org.elasticsearch.transport.RemoteTransportException: [runTask-1][127.0.0.1:9301][indices:data/read/search[phase/query]]  
Caused by: java.lang.NullPointerException: testing123  
}
[ long stack trace ]
Caused by: java.lang.NullPointerException: testing123
[ no stack trace ]

(Data node: this PR's new log - we get 6 of these spread across the nodes, one per shard)

[2025-03-28T08:17:51,944][DEBUG][o.e.s.SearchService      ] [runTask-1] [CrhugeEAQNGHtZ14Y6Apjg][test][2]: failed to execute search request java.lang.NullPointerException: testing123
	at [email protected]/org.elasticsearch.search.SearchService.throwException(SearchService.java:768)
	at [email protected]/org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:802)
	at [email protected]/org.elasticsearch.search.SearchService.lambda$executeQueryPhase$6(SearchService.java:648)
... [ full stack trace ]

Edit - after b34afc1, the new log will match the level of the r.suppressed log, so it would be WARN in this example.

benchaplin · 2025-04-01T20:05:22Z

(apologies for the force push, messed up my upstream merge)

javanna

left a couple of minors, LGTM otherwise

qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java

javanna

LGTM great work, thanks!

elasticsearchmachine · 2025-04-03T17:46:50Z

💔 Backport failed

Status	Branch	Result
❌	8.18	Commit could not be cherrypicked due to conflicts
❌	8.x	Commit could not be cherrypicked due to conflicts
❌	9.0	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 125732

…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d) # Conflicts: # qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java # server/src/main/java/org/elasticsearch/search/SearchService.java # test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java # x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

benchaplin · 2025-04-03T20:13:01Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x
✅	9.0
✅	8.18

Questions ?

Please refer to the Backport tool documentation

…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d) # Conflicts: # qa/smoke-test-http/src/internalClusterTest/java/org/elasticsearch/http/SearchErrorTraceIT.java # server/src/main/java/org/elasticsearch/search/SearchService.java # test/framework/src/main/java/org/elasticsearch/search/ErrorTraceHelper.java # x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchErrorTraceIT.java

…nsport (#125732) (#126246) * Log stack traces on data nodes before they are cleared for transport (#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)

…sport (#125732) (#126245) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)

…sport (#125732) (#126243) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false. (cherry picked from commit 9f6eb1d)

…lastic#125732) We recently cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false to reduce unnecessary data transfer and memory on the coordinating node (elastic#118266). However, all logging of exceptions happens on the coordinating node, so stack traces disappeared from any logs. This change logs stack traces directly on the data node when error_trace=false.

Debug log stack trace before it is cleared

fdaba56

elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Mar 26, 2025

benchaplin added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v8.18.0 v8.19.0 and removed needs:triage Requires assignment of a team area label v9.1.0 labels Mar 26, 2025

benchaplin added the v9.0.0 label Mar 26, 2025

Update docs/changelog/125732.yaml

79a06aa

benchaplin added the auto-backport Automatically create backport pull requests when merged label Mar 26, 2025

benchaplin requested review from javanna and piergm March 26, 2025 21:58

benchaplin mentioned this pull request Mar 26, 2025

Mystery NPE on search Cannot invoke \"java.lang.Long.longValue()\" because \"value\" is null #123747

Open

benchaplin added the v9.1.0 label Mar 26, 2025

Just set log level to debug

e38d20c

benwtrent reviewed Mar 27, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/search/SearchService.java Show resolved Hide resolved

javanna reviewed Mar 27, 2025

View reviewed changes

smalyshev reviewed Mar 27, 2025

View reviewed changes

benchaplin added 2 commits March 27, 2025 13:23

Reword log message, pass lighter params, address test fixes

6cb677f

Fix tests

78d58a4

drempapis reviewed Mar 28, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/search/SearchService.java Outdated Show resolved Hide resolved

benchaplin removed v9.0.0 v8.18.0 labels Mar 28, 2025

benchaplin requested a review from a team as a code owner April 1, 2025 19:48

benchaplin force-pushed the log_data_node_failures branch from 9ef7abb to 9f527eb Compare April 1, 2025 19:54

benchaplin removed request for a team and piergm April 1, 2025 19:54

benchaplin added 2 commits April 1, 2025 16:11

Merge branch 'main' into log_data_node_failures

8eca23f

Add task ID to log

93eddcc

javanna reviewed Apr 2, 2025

View reviewed changes

benchaplin and others added 3 commits April 2, 2025 12:18

Rename tests, grab numShards differently

47751a8

[CI] Auto commit changes from spotless

b4c5baa

Merge branch 'main' into log_data_node_failures

93e4625

javanna approved these changes Apr 3, 2025

View reviewed changes

benchaplin merged commit 9f6eb1d into elastic:main Apr 3, 2025
17 checks passed

elasticsearchmachine added the backport pending label Apr 3, 2025

benchaplin mentioned this pull request Apr 3, 2025

[8.18] Log stack traces on data nodes before they are cleared for transport (#125732) #126246

Merged

benchaplin removed the backport pending label Apr 9, 2025

benchaplin mentioned this pull request Apr 25, 2025

Always log data node failures #127420

Merged

Log stack traces on data nodes before they are cleared for transport #125732

Log stack traces on data nodes before they are cleared for transport #125732

Uh oh!

Conversation

benchaplin commented Mar 26, 2025

Uh oh!

elasticsearchmachine commented Mar 26, 2025

Uh oh!

elasticsearchmachine commented Mar 26, 2025

Uh oh!

Uh oh!

javanna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

javanna Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

benchaplin Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

javanna Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smalyshev Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

benchaplin Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

smalyshev Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

smalyshev commented Mar 27, 2025

Uh oh!

Uh oh!

benchaplin commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchaplin commented Apr 1, 2025

Uh oh!

javanna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

javanna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 3, 2025

💔 Backport failed

Uh oh!

benchaplin commented Apr 3, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Uh oh!

benchaplin commented Mar 28, 2025 •

edited

Loading