-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log stack traces on data nodes before they are cleared for transport #125732
base: main
Are you sure you want to change the base?
Changes from 7 commits
fdaba56
79a06aa
e38d20c
6cb677f
78d58a4
b34afc1
9f527eb
8eca23f
93eddcc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
pr: 125732 | ||
summary: Log stack traces on data nodes before they are cleared for transport | ||
area: Search | ||
type: bug | ||
issues: [] |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -156,6 +156,7 @@ | |
import java.util.function.Supplier; | ||
|
||
import static org.elasticsearch.TransportVersions.ERROR_TRACE_IN_TRANSPORT_HEADER; | ||
import static org.elasticsearch.common.Strings.format; | ||
import static org.elasticsearch.core.TimeValue.timeValueHours; | ||
import static org.elasticsearch.core.TimeValue.timeValueMillis; | ||
import static org.elasticsearch.core.TimeValue.timeValueMinutes; | ||
|
@@ -538,12 +539,16 @@ protected void doClose() { | |
* @param <T> the type of the response | ||
* @param listener the action listener to be wrapped | ||
* @param version channel version of the request | ||
* @param nodeId id of the current node | ||
* @param shardId id of the shard being searched | ||
* @param threadPool with context where to write the new header | ||
* @return the wrapped action listener | ||
*/ | ||
static <T> ActionListener<T> maybeWrapListenerForStackTrace( | ||
ActionListener<T> listener, | ||
TransportVersion version, | ||
String nodeId, | ||
ShardId shardId, | ||
ThreadPool threadPool | ||
) { | ||
boolean header = true; | ||
|
@@ -552,6 +557,16 @@ static <T> ActionListener<T> maybeWrapListenerForStackTrace( | |
} | ||
if (header == false) { | ||
return listener.delegateResponse((l, e) -> { | ||
org.apache.logging.log4j.util.Supplier<String> messageSupplier = () -> format( | ||
"[%s]%s: failed to execute search request", | ||
nodeId, | ||
shardId | ||
); | ||
if (ExceptionsHelper.status(e).getStatus() < 500 || ExceptionsHelper.isNodeOrShardUnavailableTypeException(e)) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good catch on the shard unavailable errors, good to replicate what we have in RestResponse. Would it be a good idea to leave a comment about the need to keep this aligned with RestResponse, for posterity? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea - added. |
||
logger.debug(messageSupplier, e); | ||
} else { | ||
logger.warn(messageSupplier, e); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😍 |
||
ExceptionsHelper.unwrapCausesAndSuppressed(e, err -> { | ||
err.setStackTrace(EMPTY_STACK_TRACE_ARRAY); | ||
return false; | ||
|
@@ -563,7 +578,13 @@ static <T> ActionListener<T> maybeWrapListenerForStackTrace( | |
} | ||
|
||
public void executeDfsPhase(ShardSearchRequest request, SearchShardTask task, ActionListener<SearchPhaseResult> listener) { | ||
listener = maybeWrapListenerForStackTrace(listener, request.getChannelVersion(), threadPool); | ||
listener = maybeWrapListenerForStackTrace( | ||
listener, | ||
request.getChannelVersion(), | ||
clusterService.localNode().getId(), | ||
request.shardId(), | ||
threadPool | ||
); | ||
final IndexShard shard = getShard(request); | ||
rewriteAndFetchShardRequest(shard, request, listener.delegateFailure((l, rewritten) -> { | ||
// fork the execution in the search thread pool | ||
|
@@ -607,7 +628,13 @@ public void executeQueryPhase(ShardSearchRequest request, CancellableTask task, | |
rewriteAndFetchShardRequest( | ||
shard, | ||
request, | ||
maybeWrapListenerForStackTrace(listener, request.getChannelVersion(), threadPool).delegateFailure((l, orig) -> { | ||
maybeWrapListenerForStackTrace( | ||
listener, | ||
request.getChannelVersion(), | ||
clusterService.localNode().getId(), | ||
request.shardId(), | ||
threadPool | ||
).delegateFailure((l, orig) -> { | ||
// check if we can shortcut the query phase entirely. | ||
if (orig.canReturnNullResponseIfMatchNoDocs()) { | ||
assert orig.scroll() == null; | ||
|
@@ -805,9 +832,15 @@ private SearchPhaseResult executeQueryPhase(ShardSearchRequest request, Cancella | |
} | ||
|
||
public void executeRankFeaturePhase(RankFeatureShardRequest request, SearchShardTask task, ActionListener<RankFeatureResult> listener) { | ||
listener = maybeWrapListenerForStackTrace(listener, request.getShardSearchRequest().getChannelVersion(), threadPool); | ||
final ReaderContext readerContext = findReaderContext(request.contextId(), request); | ||
final ShardSearchRequest shardSearchRequest = readerContext.getShardSearchRequest(request.getShardSearchRequest()); | ||
listener = maybeWrapListenerForStackTrace( | ||
listener, | ||
shardSearchRequest.getChannelVersion(), | ||
clusterService.localNode().getId(), | ||
shardSearchRequest.shardId(), | ||
threadPool | ||
); | ||
final Releasable markAsUsed = readerContext.markAsUsed(getKeepAlive(shardSearchRequest)); | ||
runAsync(getExecutor(readerContext.indexShard()), () -> { | ||
try (SearchContext searchContext = createContext(readerContext, shardSearchRequest, task, ResultsType.RANK_FEATURE, false)) { | ||
|
@@ -856,8 +889,14 @@ public void executeQueryPhase( | |
ActionListener<ScrollQuerySearchResult> listener, | ||
TransportVersion version | ||
) { | ||
listener = maybeWrapListenerForStackTrace(listener, version, threadPool); | ||
final LegacyReaderContext readerContext = (LegacyReaderContext) findReaderContext(request.contextId(), request); | ||
listener = maybeWrapListenerForStackTrace( | ||
listener, | ||
version, | ||
clusterService.localNode().getId(), | ||
readerContext.indexShard().shardId(), | ||
threadPool | ||
); | ||
final Releasable markAsUsed; | ||
try { | ||
markAsUsed = readerContext.markAsUsed(getScrollKeepAlive(request.scroll())); | ||
|
@@ -905,9 +944,15 @@ public void executeQueryPhase( | |
ActionListener<QuerySearchResult> listener, | ||
TransportVersion version | ||
) { | ||
listener = maybeWrapListenerForStackTrace(listener, version, threadPool); | ||
final ReaderContext readerContext = findReaderContext(request.contextId(), request.shardSearchRequest()); | ||
final ShardSearchRequest shardSearchRequest = readerContext.getShardSearchRequest(request.shardSearchRequest()); | ||
listener = maybeWrapListenerForStackTrace( | ||
listener, | ||
version, | ||
clusterService.localNode().getId(), | ||
shardSearchRequest.shardId(), | ||
threadPool | ||
); | ||
final Releasable markAsUsed = readerContext.markAsUsed(getKeepAlive(shardSearchRequest)); | ||
rewriteAndFetchShardRequest(readerContext.indexShard(), shardSearchRequest, listener.delegateFailure((l, rewritten) -> { | ||
// fork the execution in the search thread pool | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see where we are logging the trace? It seems we are only logging that its actually removed.
I am thinking we should log the exception as a
WARN
if:However, I defer to @javanna here. He has much better context around what we are trying to achieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing
e
to the logger here outputs the stack trace. Maybe my log message isn't clear... think something like "search failed with exception:" would be better?Interesting point on
WARN
- do you think aWARN
log per shard for the same underlying error is acceptable here? Luca and I decided it would be difficult to dedupe these at the moment. It might become easier with batched query execution.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see.
I think that this shouldn't be debug. Maybe debug for all exceptions. But we should log
WARN
for things that are 5xx, indicating the failure.I think adding a "Clearing stack trace before...." is unnecessary. Logging isn't just for our debugging, but also for users.
I am not sure indicating that the trace is being removed is overly useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the log message to be more clear for users, and raised the level
WARN
for 5XX. Thanks for the idea there - I think it makes sense for this new message to log at the same level as rest suppressed.