Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log stack traces on data nodes before they are cleared for transport #125732

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

benchaplin
Copy link

#118266 cleared stack traces on data nodes before transport back to the coordinating node when error_trace=false. However, all logging of exceptions happens on the coordinating node. This change made it impossible to debug errors via stack trace when error_trace=false.

Here, I've logged the exception on the data node right before the stack trace is cleared. It's prefixed with [nodeId][indexName][shard] to match the rest.suppressed shard failures log on the coordinating node, allowing for easy error tracing from the coordinating node to the responsible data node.

Might this flood the (debug level) logs?
This change has the potential to log [# of shards] times for each index in a search. However, this log:

logger.debug(() -> format("%s: Failed to execute [%s] lastShard [%s]", shard, request, lastShard), e);
on the coordinating node already logs [# of shards]*[# of replicas] times at the debug level. And before #118266, each included a stack trace. Therefore, per node, this change does not increase logs by any order of magnitude.

@elasticsearchmachine elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Mar 26, 2025
@benchaplin benchaplin added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v8.18.0 v8.19.0 and removed needs:triage Requires assignment of a team area label v9.1.0 labels Mar 26, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @benchaplin, I've created a changelog YAML for you.

) {
boolean header = true;
if (version.onOrAfter(ERROR_TRACE_IN_TRANSPORT_HEADER) && threadPool.getThreadContext() != null) {
if (request.getChannelVersion().onOrAfter(ERROR_TRACE_IN_TRANSPORT_HEADER) && threadPool.getThreadContext() != null) {
header = Boolean.parseBoolean(threadPool.getThreadContext().getHeaderOrDefault("error_trace", "false"));
}
if (header == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see where we are logging the trace? It seems we are only logging that its actually removed.

I am thinking we should log the exception as a WARN if:

  • header == false (meaning the trace wouldn't be provided user side)
  • the exception isn't a user exception (e.g. should be considered a 5xx)

However, I defer to @javanna here. He has much better context around what we are trying to achieve.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing e to the logger here outputs the stack trace. Maybe my log message isn't clear... think something like "search failed with exception:" would be better?

Interesting point on WARN - do you think a WARN log per shard for the same underlying error is acceptable here? Luca and I decided it would be difficult to dedupe these at the moment. It might become easier with batched query execution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

I think that this shouldn't be debug. Maybe debug for all exceptions. But we should log WARN for things that are 5xx, indicating the failure.

I think adding a "Clearing stack trace before...." is unnecessary. Logging isn't just for our debugging, but also for users.

I am not sure indicating that the trace is being removed is overly useful here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the log message to be more clear for users, and raised the level WARN for 5XX. Thanks for the idea there - I think it makes sense for this new message to log at the same level as rest suppressed.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, thanks @benchaplin !

header = Boolean.parseBoolean(threadPool.getThreadContext().getHeaderOrDefault("error_trace", "false"));
}
if (header == false) {
return listener.delegateResponse((l, e) -> {
logger.debug(
() -> format("[%s]%s Clearing stack trace before transport:", clusterService.localNode().getId(), request.shardId()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the best place to add the logging indeed, because we ensure that we do the additional logging exclusively for the cases where we suppress the stack trace, before doing so.

If we log this at debug, we are not going to see it with the default log level, are we? I think we should use warn instead at least?

The error message looks a little misleading also, all we are interested in is the error itself, so I would log the same that we'd get on the coord node, but this time we'd get the stacktrace.

There's a couple more aspects that deserve attention I think:

  1. if we keep on logging on the coord node, we should probably only log in the data nodes when the error trace is not requested, otherwise we just add redundant logging?
  2. if we keep on logging on the coord node, it may happen that the node acting as coord node acts as a data node as well as part of serving a search request. That would lead to duplicated logging on that node, that may be ok but not ideal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the log message to be more clear for users and raised the level to WARN on the same condition that the rest suppressed logger logs at WARN.

  1. Agreed, and that is the current behavior as this log is only wrapped in if (header == false) {.
  2. That is true. I think the shard failure logs on the coord node (see my example below) are important, but an argument could be made to remove the rest suppressed log if error_trace=false. Then again rest.suppressed is only one log line. But I imagine removing any of these logs would count as a breaking change (?), as alerts out there (like our own) might rely on them.

private BooleanSupplier transportMessageHasStackTrace;
@BeforeClass
public static void setDebugLogLevel() {
Configurator.setLevel("org.elasticsearch.search.SearchService", Level.DEBUG);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this is not required, if we log at WARN ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ITs, I found it easiest to trigger 4XX failures which I think should still log at the DEBUG level.

@@ -136,6 +150,106 @@ public void testAsyncSearchFailingQueryErrorTraceFalse() throws IOException, Int
assertFalse(transportMessageHasStackTrace.getAsBoolean());
}

public void testLoggingInAsyncSearchFailingQueryErrorTraceDefault() throws IOException, InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably fold this test into the errortrace true test? You could randomly set the flag to true, otherwise not, the result should be the same?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view, the three potential values of error_trace (true, false, and empty) all deserve their own test so that any changes in behavior immediately lead to a test failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can see that, yet knowing that empty and false lead to the same, we could save some code and test runtime too perhaps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heard, saving code and test runtime sounds like a good tradeoff.

@@ -37,19 +44,27 @@ protected Collection<Class<? extends Plugin>> nodePlugins() {
return CollectionUtils.appendToCopyNoNullElements(super.nodePlugins(), MockTransportService.TestPlugin.class);
}

@BeforeClass
public static void setDebugLogLevel() {
Configurator.setLevel("org.elasticsearch.search.SearchService", Level.DEBUG);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully we adjust log level and this is no longer needed.

@@ -108,6 +123,80 @@ public void testSearchFailingQueryErrorTraceFalse() throws IOException {
assertFalse(hasStackTrace.getAsBoolean());
}

public void testLoggingInSearchFailingQueryErrorTraceDefault() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably fold this test into the errortrace true test? You could randomly set the flag to true, otherwise not, the result should be the same?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done)

private void setupIndexWithDocs() {
createIndex("test1", "test2");
private int setupIndexWithDocs() {
int numShards = between(DEFAULT_MIN_NUM_SHARDS, DEFAULT_MAX_NUM_SHARDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of shards is already randomized if you call createIndex . See ESIntegTestCase#numberOfShards.

@@ -32,24 +41,41 @@
public class SearchErrorTraceIT extends HttpSmokeTestCase {
private BooleanSupplier hasStackTrace;

private static final String loggerName = "org.elasticsearch.search.SearchService";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably use SearchService.class here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done.

Map<String, Object> responseEntity = performRequestAndGetResponseEntityAfterDelay(searchRequest, TimeValue.ZERO);
String asyncExecutionId = (String) responseEntity.get("id");
Request request = new Request("GET", "/_async_search/" + asyncExecutionId);
while (responseEntity.get("is_running") instanceof Boolean isRunning && isRunning) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use assertBusy here?

@smalyshev
Copy link
Contributor

I think our main problem with this is that users are giving us (or we giving ourselves) logs with useful part of the backtrace removed. So I wonder if this patch really fixes that? Would the users see the missing part in the data node logs? Would they know how to get it and how to give it to us?

@@ -552,6 +557,7 @@ static <T> ActionListener<T> maybeWrapListenerForStackTrace(
}
if (header == false) {
return listener.delegateResponse((l, e) -> {
logger.debug(() -> format("[%s]%s: failed to execute search request", nodeId, shardId), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to correlate the request log on the coordinate node with the error reported on the data node? If so, is the logging information here enough?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I modeled the nodeId/shardId prefix to lead us right back to the coord node's logs. If you take a look at the example below: [CrhugeEAQNGHtZ14Y6Apjg][test][2] could be taken from the r.suppressed log and grepped across data nodes to find the new log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the nodeId/shardId combination, which associates a coord with a data node. What if there are multiple failing requests on the coord node? How can we identify the error of a single request? Is it a use case we must consider?

@benchaplin
Copy link
Author

benchaplin commented Mar 28, 2025

For reference, here's a brief example of what logs we have today + what I'm adding. I've thrown a NPE in SearchService to trigger. Setup: 3 nodes, 3 primary shards, 3 replicas.

(Coord node: we get 6 of these, one per shard)

[2025-03-28T08:17:51,928][DEBUG][o.e.a.s.TransportSearchAction] [runTask-0] [meJUNXYoT1iBSTnJgI6Unw][test][2]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[test], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true, allow_selectors=true, include_failure_indices=false], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={}}] lastShard [false] org.elasticsearch.transport.RemoteTransportException: [runTask-0][127.0.0.1:9300][indices:data/read/search[phase/query]]  
Caused by: java.lang.NullPointerException: testing123
[ no stack trace ]

(Coord node: after 6 above failures)

[2025-03-28T08:17:51,946][DEBUG][o.e.a.s.TransportSearchAction] [runTask-0] All shards failed for phase: [query] org.elasticsearch.ElasticsearchException$1: testing123
[ long stack trace for ElasticsearchException ]
Caused by: java.lang.NullPointerException: testing123
[ no stack trace ]

(Coord node: WARN level for status >= 500, DEBUG else)

[2025-03-28T08:17:51,946][WARN ][r.suppressed             ] [runTask-0] path: /test/_search, params: {index=test}, status: 500 Failed to execute phase [query], all shards failed; shardFailures {[CrhugeEAQNGHtZ14Y6Apjg][test][0]: org.elasticsearch.transport.RemoteTransportException: [runTask-1][127.0.0.1:9301][indices:data/read/search[phase/query]]  
Caused by: java.lang.NullPointerException: testing123  
}{[meJUNXYoT1iBSTnJgI6Unw][test][1]: org.elasticsearch.transport.RemoteTransportException: [runTask-0][127.0.0.1:9300][indices:data/read/search[phase/query]]  
Caused by: java.lang.NullPointerException: testing123  
}{[CrhugeEAQNGHtZ14Y6Apjg][test][2]: org.elasticsearch.transport.RemoteTransportException: [runTask-1][127.0.0.1:9301][indices:data/read/search[phase/query]]  
Caused by: java.lang.NullPointerException: testing123  
}
[ long stack trace ]
Caused by: java.lang.NullPointerException: testing123
[ no stack trace ]

(Data node: this PR's new log - we get 6 of these spread across the nodes, one per shard)

[2025-03-28T08:17:51,944][DEBUG][o.e.s.SearchService      ] [runTask-1] [CrhugeEAQNGHtZ14Y6Apjg][test][2]: failed to execute search request java.lang.NullPointerException: testing123
	at [email protected]/org.elasticsearch.search.SearchService.throwException(SearchService.java:768)
	at [email protected]/org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:802)
	at [email protected]/org.elasticsearch.search.SearchService.lambda$executeQueryPhase$6(SearchService.java:648)
... [ full stack trace ]

Edit - after b34afc1, the new log will match the level of the r.suppressed log, so it would be WARN in this example.

Comment on lines +560 to +569
org.apache.logging.log4j.util.Supplier<String> messageSupplier = () -> format(
"[%s]%s: failed to execute search request",
nodeId,
shardId
);
if (ExceptionsHelper.status(e).getStatus() < 500 || ExceptionsHelper.isNodeOrShardUnavailableTypeException(e)) {
logger.debug(messageSupplier, e);
} else {
logger.warn(messageSupplier, e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.18.1 v8.19.0 v9.0.1 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants