-
Notifications
You must be signed in to change notification settings - Fork 3.3k
HBASE-29372: Meta cache clear metrics and logs shouldn't use "UnknownException" #6961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int | |||
// any of the regions in the MultiAction and do not update cache if exception is | |||
// from failing to submit action to thread pool | |||
if (clearServerCache) { | |||
updateCachedLocations(server, regionName, row, | |||
ClientExceptionsUtil.isMetaClearingException(t) ? null : t); | |||
updateCachedLocations(server, regionName, row, t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also solves the frustration of seeing "UnknownException" when inspecting meta cache clear exception metrics. This has made it quite difficult to track down what triggered the meta cache clear.
I think it's always better to provide more context than less. Even if an exception is meta cache clearing (though it will be now), I'd still prefer to know the exact exception type that cleared the meta cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 would be good to preserve the exception for updateCachedLocations
Since we currently pass null to updateCachedLocations
if we have a meta cache clearing exception, does that means that we never update the cache clearing exception metric properly for cache clears coming from receiveGlobalFailure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we'll do is basically "mask" the cache clearing exception by report an UnknownException
. The code for that lives in the metrics class. It's annoying b/c that coupled with the lack of any logging in this code path makes it really difficult to determine what caused these meta cache clears.
errorsByServer.reportServerError(server); | ||
Retry canRetry = errorsByServer.canTryMore(numAttempt) ? Retry.YES : Retry.NO_RETRIES_EXHAUSTED; | ||
boolean clearServerCache = false; | ||
|
||
if (!(t instanceof RejectedExecutionException)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforces the constraints added in https://issues.apache.org/jira/browse/HBASE-27491
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if you instead push RejectedExecutionException
down into ClientExceptionsUtil.isMetaClearingException
.
How about adding another collection of execution-exceptions for the family of various ExecutorService interaction errors, like is done with networking/connection exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, adding that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hgromer I think you dropped my earlier comment.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java
Show resolved
Hide resolved
@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int | |||
// any of the regions in the MultiAction and do not update cache if exception is | |||
// from failing to submit action to thread pool | |||
if (clearServerCache) { | |||
updateCachedLocations(server, regionName, row, | |||
ClientExceptionsUtil.isMetaClearingException(t) ? null : t); | |||
updateCachedLocations(server, regionName, row, t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 would be good to preserve the exception for updateCachedLocations
Since we currently pass null to updateCachedLocations
if we have a meta cache clearing exception, does that means that we never update the cache clearing exception metric properly for cache clears coming from receiveGlobalFailure
?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the new scope of the work is
- Pass the meta clearing exception to
updateCachedLocations
so we can properly capture the exception in the client metric instead of unknown exception - Pushing
RejectedExecutionException
down intoisMetaClearingException
I am wondering if it make sense to split these into two different issues given the change in scope? My initial thought is in agreement that RejectedExecutionException
should not be meta cache clearing, but maybe it would be beneficial to handle that change independently of the client metric fix , there may be implications beyond batch operations for that change.
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java
Show resolved
Hide resolved
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java
Show resolved
Hide resolved
It'd make sense for me to have that be it's own thing (one I probably wouldn't own to be completely honest), cc @ndimiduk since I was following his suggestion |
…t 'UnknownException'
I've updated the PR to soleley add some logging and emitting the real meta cache clearing exception in meta cache clear metrics. cc @ndimiduk @droudnitsky |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
The hadoopcheck issue seems unrelated |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is devilishly difficult to follow. There's a complex commit history around this behavior that includes attempted refactors and reverts. It seems like the test harness is pretty good at catching issues though.
It looks like you're partially undoing #4914, which I think is fine. Reading though that PR, I believe the correct implementation would have been to push knowledge of the RejectedExecutionException
down into ClientExceptionsUtil#isMetaClearingException()
. Instead it externalised the logic up here in AsyncRequestFutureImpl
, making it more difficult to follow. I think you haven't done enough and should go ahead and push the exception check down further.
Do you have any suggestions for we we verify that this exception/null pass-down hasn't subtly broken something?
errorsByServer.reportServerError(server); | ||
Retry canRetry = errorsByServer.canTryMore(numAttempt) ? Retry.YES : Retry.NO_RETRIES_EXHAUSTED; | ||
boolean clearServerCache; | ||
|
||
if (t instanceof RejectedExecutionException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should add RejectedExecutionException
to the predicate in ClientExceptionsUtil#isMetaClearingException()
. Seems dicy to have a special case test here. Or am I missing some wider context?
hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncRequestFutureImpl.java
Show resolved
Hide resolved
Oh. I see now that you and droudnitsky are attempting to cut scope. I don't feel great about leaving the |
The refactor on #578 looked like a good effort. Too bad it had to be reverted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndimiduk I agree it would have made sense for #4914 to introduce the change through isMetaClearingException
. My suggestion to split the scope here is driven by the complexity in this codepath , my thinking being that it'd be easier to reason about the metrics fix and pushing RejectedExecutionException
down into isMetaClearingException
in two seperate PRs, the latter change is wider in scope. I cannot think of any reason not to push the exception down into isMetaClearingException
but it may benefit from some thinking/review independent of the metrics fix, thats just my personal thought.
Looks good to me thank you Hernan
Okay fair enough, let's take this as it is. |
…xception" (apache#6961) Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>
…xception" (apache#6961) Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>
…xception" (apache#6961) Co-authored-by: Hernan Gelaf-Romer <[email protected]> Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Ray Mattingly <[email protected]>
…xception" (apache#6961) (#180) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Ray Mattingly <[email protected]> Co-authored-by: Hernan Gelaf-Romer <[email protected]>
No description provided.