Skip to content

HBASE-29372: Meta cache clear metrics and logs shouldn't use "UnknownException" #6961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 4, 2025

Conversation

hgromer
Copy link
Contributor

@hgromer hgromer commented May 5, 2025

No description provided.

@hgromer hgromer marked this pull request as draft May 5, 2025 17:53
@hgromer hgromer changed the title HBASE-29265: Operation timeouts can create a pathological feedback loop with multigets HBASE-29265: Batch calls to overloaded cluster can cause meta hotspotting May 5, 2025
@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@hgromer hgromer marked this pull request as ready for review May 6, 2025 14:11
@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int
// any of the regions in the MultiAction and do not update cache if exception is
// from failing to submit action to thread pool
if (clearServerCache) {
updateCachedLocations(server, regionName, row,
ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
updateCachedLocations(server, regionName, row, t);
Copy link
Contributor Author

@hgromer hgromer May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also solves the frustration of seeing "UnknownException" when inspecting meta cache clear exception metrics. This has made it quite difficult to track down what triggered the meta cache clear.

I think it's always better to provide more context than less. Even if an exception is meta cache clearing (though it will be now), I'd still prefer to know the exact exception type that cleared the meta cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 would be good to preserve the exception for updateCachedLocations

Since we currently pass null to updateCachedLocations if we have a meta cache clearing exception, does that means that we never update the cache clearing exception metric properly for cache clears coming from receiveGlobalFailure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we'll do is basically "mask" the cache clearing exception by report an UnknownException. The code for that lives in the metrics class. It's annoying b/c that coupled with the lack of any logging in this code path makes it really difficult to determine what caused these meta cache clears.

@hgromer
Copy link
Contributor Author

hgromer commented May 6, 2025

cc @ndimiduk @rmdmattingly @krconv

errorsByServer.reportServerError(server);
Retry canRetry = errorsByServer.canTryMore(numAttempt) ? Retry.YES : Retry.NO_RETRIES_EXHAUSTED;
boolean clearServerCache = false;

if (!(t instanceof RejectedExecutionException)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforces the constraints added in https://issues.apache.org/jira/browse/HBASE-27491

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if you instead push RejectedExecutionException down into ClientExceptionsUtil.isMetaClearingException.

How about adding another collection of execution-exceptions for the family of various ExecutorService interaction errors, like is done with networking/connection exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, adding that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hgromer I think you dropped my earlier comment.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@@ -783,8 +787,7 @@ private void receiveGlobalFailure(MultiAction rsActions, ServerName server, int
// any of the regions in the MultiAction and do not update cache if exception is
// from failing to submit action to thread pool
if (clearServerCache) {
updateCachedLocations(server, regionName, row,
ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
updateCachedLocations(server, regionName, row, t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 would be good to preserve the exception for updateCachedLocations

Since we currently pass null to updateCachedLocations if we have a meta cache clearing exception, does that means that we never update the cache clearing exception metric properly for cache clears coming from receiveGlobalFailure?

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

Copy link
Contributor

@droudnitsky droudnitsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the new scope of the work is

  1. Pass the meta clearing exception to updateCachedLocations so we can properly capture the exception in the client metric instead of unknown exception
  2. Pushing RejectedExecutionException down into isMetaClearingException

I am wondering if it make sense to split these into two different issues given the change in scope? My initial thought is in agreement that RejectedExecutionException should not be meta cache clearing, but maybe it would be beneficial to handle that change independently of the client metric fix , there may be implications beyond batch operations for that change.

@hgromer
Copy link
Contributor Author

hgromer commented May 28, 2025

Looks like the new scope of the work is

  1. Pass the meta clearing exception to updateCachedLocations so we can properly capture the exception in the client metric instead of unknown exception
  2. Pushing RejectedExecutionException down into isMetaClearingException

I am wondering if it make sense to split these into two different issues given the change in scope? My initial thought is in agreement that RejectedExecutionException should not be meta cache clearing, but maybe it would be beneficial to handle that change independently of the client metric fix , there may be implications beyond batch operations for that change.

It'd make sense for me to have that be it's own thing (one I probably wouldn't own to be completely honest), cc @ndimiduk since I was following his suggestion

@hgromer hgromer changed the title HBASE-29265: Batch calls to overloaded cluster can cause meta hotspotting HBASE-29265: Meta cache clear metrics and logs shouldn't use "UnknownException" May 29, 2025
@hgromer
Copy link
Contributor Author

hgromer commented May 29, 2025

I've updated the PR to soleley add some logging and emitting the real meta cache clearing exception in meta cache clear metrics. cc @ndimiduk @droudnitsky

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@hgromer
Copy link
Contributor Author

hgromer commented May 30, 2025

The hadoopcheck issue seems unrelated

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 44s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 20s branch-2 passed
+1 💚 compile 0m 19s branch-2 passed
+1 💚 javadoc 0m 17s branch-2 passed
+1 💚 shadedjars 5m 31s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 2m 30s the patch passed
+1 💚 compile 0m 20s the patch passed
+1 💚 javac 0m 20s the patch passed
+1 💚 javadoc 0m 16s the patch passed
+1 💚 shadedjars 5m 28s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 7m 58s hbase-client in the patch passed.
27m 51s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #6961
JIRA Issue HBASE-29265
Optional Tests javac javadoc unit compile shadedjars
uname Linux c3ae91027442 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / d138309
Default Java Temurin-1.8.0_412-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/testReport/
Max. process+thread count 366 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 52s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 32s branch-2 passed
+1 💚 compile 0m 56s branch-2 passed
+1 💚 checkstyle 0m 31s branch-2 passed
+1 💚 spotbugs 1m 15s branch-2 passed
+1 💚 spotless 1m 3s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 48s the patch passed
+1 💚 compile 1m 4s the patch passed
+1 💚 javac 1m 4s the patch passed
+1 💚 blanks 0m 1s The patch has no blanks issues.
+1 💚 checkstyle 0m 25s the patch passed
+1 💚 spotbugs 1m 13s the patch passed
+1 💚 hadoopcheck 20m 38s Patch does not cause any errors with Hadoop 2.10.2 or 3.3.6 3.4.0.
+1 💚 spotless 1m 21s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 16s The patch does not generate ASF License warnings.
39m 10s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #6961
JIRA Issue HBASE-29265
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 8cf4818e13ab 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / d138309
Default Java Eclipse Adoptium-11.0.23+9
Max. process+thread count 79 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 55s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 53s branch-2 passed
+1 💚 compile 0m 27s branch-2 passed
+1 💚 javadoc 0m 22s branch-2 passed
+1 💚 shadedjars 7m 12s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 33s the patch passed
+1 💚 compile 0m 23s the patch passed
+1 💚 javac 0m 23s the patch passed
+1 💚 javadoc 0m 19s the patch passed
+1 💚 shadedjars 6m 59s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 7m 56s hbase-client in the patch passed.
33m 13s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #6961
JIRA Issue HBASE-29265
Optional Tests javac javadoc unit compile shadedjars
uname Linux b5185e8fb582 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / d138309
Default Java Eclipse Adoptium-11.0.23+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/testReport/
Max. process+thread count 367 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 41s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 21s branch-2 passed
+1 💚 compile 0m 23s branch-2 passed
+1 💚 javadoc 0m 18s branch-2 passed
+1 💚 shadedjars 6m 21s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 7s the patch passed
+1 💚 compile 0m 23s the patch passed
+1 💚 javac 0m 23s the patch passed
+1 💚 javadoc 0m 17s the patch passed
+1 💚 shadedjars 6m 20s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 8m 15s hbase-client in the patch passed.
30m 32s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #6961
JIRA Issue HBASE-29265
Optional Tests javac javadoc unit compile shadedjars
uname Linux 8ef610a3672c 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / d138309
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/testReport/
Max. process+thread count 371 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6961/5/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Member

@ndimiduk ndimiduk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is devilishly difficult to follow. There's a complex commit history around this behavior that includes attempted refactors and reverts. It seems like the test harness is pretty good at catching issues though.

It looks like you're partially undoing #4914, which I think is fine. Reading though that PR, I believe the correct implementation would have been to push knowledge of the RejectedExecutionException down into ClientExceptionsUtil#isMetaClearingException(). Instead it externalised the logic up here in AsyncRequestFutureImpl, making it more difficult to follow. I think you haven't done enough and should go ahead and push the exception check down further.

Do you have any suggestions for we we verify that this exception/null pass-down hasn't subtly broken something?

errorsByServer.reportServerError(server);
Retry canRetry = errorsByServer.canTryMore(numAttempt) ? Retry.YES : Retry.NO_RETRIES_EXHAUSTED;
boolean clearServerCache;

if (t instanceof RejectedExecutionException) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should add RejectedExecutionException to the predicate in ClientExceptionsUtil#isMetaClearingException(). Seems dicy to have a special case test here. Or am I missing some wider context?

@ndimiduk ndimiduk requested a review from Apache9 June 3, 2025 09:25
@ndimiduk
Copy link
Member

ndimiduk commented Jun 3, 2025

Looks like the new scope of the work is

  1. Pass the meta clearing exception to updateCachedLocations so we can properly capture the exception in the client metric instead of unknown exception
  2. Pushing RejectedExecutionException down into isMetaClearingException

I am wondering if it make sense to split these into two different issues given the change in scope? My initial thought is in agreement that RejectedExecutionException should not be meta cache clearing, but maybe it would be beneficial to handle that change independently of the client metric fix , there may be implications beyond batch operations for that change.

It'd make sense for me to have that be it's own thing (one I probably wouldn't own to be completely honest), cc @ndimiduk since I was following his suggestion

Oh. I see now that you and droudnitsky are attempting to cut scope. I don't feel great about leaving the RejectedExecutionException as it is, but if you'd prefer to isolate that change, then let's proceed with what you have.

@ndimiduk
Copy link
Member

ndimiduk commented Jun 3, 2025

The refactor on #578 looked like a good effort. Too bad it had to be reverted.

Copy link
Contributor

@droudnitsky droudnitsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndimiduk I agree it would have made sense for #4914 to introduce the change through isMetaClearingException. My suggestion to split the scope here is driven by the complexity in this codepath , my thinking being that it'd be easier to reason about the metrics fix and pushing RejectedExecutionException down into isMetaClearingException in two seperate PRs, the latter change is wider in scope. I cannot think of any reason not to push the exception down into isMetaClearingException but it may benefit from some thinking/review independent of the metrics fix, thats just my personal thought.

Looks good to me thank you Hernan

@ndimiduk
Copy link
Member

ndimiduk commented Jun 3, 2025

Okay fair enough, let's take this as it is.

@hgromer hgromer changed the title HBASE-29265: Meta cache clear metrics and logs shouldn't use "UnknownException" HBASE-29372: Meta cache clear metrics and logs shouldn't use "UnknownException" Jun 4, 2025
@rmdmattingly rmdmattingly merged commit 6fb44d0 into apache:branch-2 Jun 4, 2025
1 check passed
hgromer added a commit to HubSpot/hbase that referenced this pull request Jun 4, 2025
…xception" (apache#6961)

Co-authored-by: Hernan Gelaf-Romer <[email protected]>
Signed-off-by: Duo Zhang <[email protected]>
Signed-off-by: Ray Mattingly <[email protected]>
hgromer added a commit to HubSpot/hbase that referenced this pull request Jun 4, 2025
…xception" (apache#6961)

Co-authored-by: Hernan Gelaf-Romer <[email protected]>
Signed-off-by: Duo Zhang <[email protected]>
Signed-off-by: Ray Mattingly <[email protected]>
hgromer added a commit to HubSpot/hbase that referenced this pull request Jun 4, 2025
…xception" (apache#6961)

Co-authored-by: Hernan Gelaf-Romer <[email protected]>
Signed-off-by: Duo Zhang <[email protected]>
Signed-off-by: Ray Mattingly <[email protected]>
hgromer added a commit to HubSpot/hbase that referenced this pull request Jun 4, 2025
…xception" (apache#6961) (#180)

Signed-off-by: Duo Zhang <[email protected]>
Signed-off-by: Ray Mattingly <[email protected]>
Co-authored-by: Hernan Gelaf-Romer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants