-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Properly handle grpc dial errors in the throttler metric aggregation #18073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly handle grpc dial errors in the throttler metric aggregation #18073
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Arthur Schreiber <[email protected]>
c173bc5
to
0534dc5
Compare
if s, ok := status.FromError(err); ok { | ||
return s.Code() == codes.Unavailable || s.Code() == codes.DeadlineExceeded | ||
} | ||
|
||
switch err := err.(type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the logic down here makes any sense, because we never have or had access to a net.OpError
here. I left this in place, but I think this should be removed to be less confusing.
func TestIsDialTCPError(t *testing.T) { | ||
// Verify that IsDialTCPError actually recognizes grpc dial errors | ||
cc, err := grpcclient.DialContext(t.Context(), ":0", true, grpc.WithTransportCredentials(insecure.NewCredentials())) | ||
require.NoError(t, err) | ||
defer cc.Close() | ||
|
||
err = cc.Invoke(context.Background(), "/Fail", nil, nil) | ||
|
||
require.True(t, base.IsDialTCPError(err)) | ||
require.True(t, base.IsDialTCPError(fmt.Errorf("wrapped: %w", err))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a require.False
case here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a259112
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi! can you sign your commit please, the DCO check is failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Signed-off-by: Mohamed Hamza <[email protected]>
a259112
to
eeaad49
Compare
…r-dial-errors Signed-off-by: Mohamed Hamza <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #18073 +/- ##
==========================================
+ Coverage 67.54% 67.56% +0.02%
==========================================
Files 1601 1601
Lines 261484 261487 +3
==========================================
+ Hits 176612 176673 +61
+ Misses 84872 84814 -58 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…18073) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…18073) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…itessio#18073) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…itessio#18073) (#155) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Arthur Schreiber <[email protected]>
…itessio#18073) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…itessio#18073) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…18073) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…18073) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…18073) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…18073) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…ic aggregation (#18073) (#18229) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: Arthur Schreiber <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]>
…ic aggregation (#18073) (#18231) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Mohamed Hamza <[email protected]>
* [release-22.0] Bump to `v22.0.1-SNAPSHOT` after the `v22.0.0` release (vitessio#18225) Signed-off-by: Florent Poinsard <[email protected]> Co-authored-by: Florent Poinsard <[email protected]> * [release-22.0] fix: Preserve multi-column TupleExpr in tuple simplifier (vitessio#18216) (vitessio#18220) Signed-off-by: Harshit Gangal <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Harshit Gangal <[email protected]> * [release-22.0] Properly handle grpc dial errors in the throttler metric aggregation (vitessio#18073) (vitessio#18231) Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Mohamed Hamza <[email protected]> * [release-22.0] test: TestQueryTimeoutWithShardTargeting fix flaky test (vitessio#18242) (vitessio#18250) Signed-off-by: Harshit Gangal <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-22.0] make sure to give MEMBER OF the correct precedence (vitessio#18237) (vitessio#18245) Signed-off-by: Andres Taylor <[email protected]> Co-authored-by: Andrés Taylor <[email protected]> * [release-22.0] Fix evalengine crashes on unexpected types (vitessio#18254) (vitessio#18258) Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-22.0] Fix subquery merging regression introduced in vitessio#11379 (vitessio#18260) (vitessio#18263) Signed-off-by: Arthur Schreiber <[email protected]> Co-authored-by: Andrés Taylor <[email protected]> * [release-22.0] json array insert test (vitessio#18284) (vitessio#18286) Signed-off-by: Harshit Gangal <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Harshit Gangal <[email protected]> * [release-22.0] Fix `SET` and `START TRANSACTION` in create procedure statements (vitessio#18279) (vitessio#18293) Signed-off-by: Manan Gupta <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-22.0] Fix deadlock in semi-sync monitor (vitessio#18276) (vitessio#18290) Signed-off-by: Manan Gupta <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-22.0] Upgrade the Golang version to `go1.24.3` (vitessio#18239) Signed-off-by: GitHub <[email protected]> Signed-off-by: Andres Taylor <[email protected]> Co-authored-by: frouioui <[email protected]> Co-authored-by: Andres Taylor <[email protected]> * [release-22.0] Atomic Copy: Handle error that was ignored while streaming tables and log it (vitessio#18313) (vitessio#18316) Signed-off-by: Rohit Nayak <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-22.0] fix: handle dml query for None opcode (vitessio#18326) (vitessio#18345) Signed-off-by: Harshit Gangal <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Harshit Gangal <[email protected]> * [release-22.0] fix: keep LIMIT/OFFSET even when merging UNION queries (vitessio#18361) (vitessio#18363) Signed-off-by: Andres Taylor <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Andres Taylor <[email protected]> * [release-22.0] Fix: Deadlock in `Close` and `write` in semi-sync monitor. (vitessio#18359) (vitessio#18368) Signed-off-by: Manan Gupta <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-22.0] Upgrade the Golang version to `go1.24.4` (vitessio#18329) Signed-off-by: GitHub <[email protected]> Signed-off-by: Manan Gupta <[email protected]> Co-authored-by: frouioui <[email protected]> Co-authored-by: Manan Gupta <[email protected]> * [release-22.0] fix version issue when using --mysql-shell-speedup-restore=true (vitessio#18310) (vitessio#18356) Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-22.0] Split workflow with flaky vdiff2 e2e test. Skip flaky Migrate test. (vitessio#18300) (vitessio#18334) Signed-off-by: Rohit Nayak <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-22.0] Throttler: keep watching topo even on error (vitessio#18223) (vitessio#18322) Signed-off-by: Shlomi Noach <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> * [release-22.0] Code Freeze for `v22.0.1` (vitessio#18374) Signed-off-by: Manan Gupta <[email protected]> * [release-22.0] Release of `v22.0.1` (vitessio#18375) Signed-off-by: Manan Gupta <[email protected]> * add private repo config to new CI file Signed-off-by: Tim Vaillancourt <[email protected]> * `make generate_ci_workflows` Signed-off-by: Tim Vaillancourt <[email protected]> --------- Signed-off-by: Florent Poinsard <[email protected]> Signed-off-by: Harshit Gangal <[email protected]> Signed-off-by: Arthur Schreiber <[email protected]> Signed-off-by: Mohamed Hamza <[email protected]> Signed-off-by: Andres Taylor <[email protected]> Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: GitHub <[email protected]> Signed-off-by: Rohit Nayak <[email protected]> Signed-off-by: Shlomi Noach <[email protected]> Signed-off-by: Tim Vaillancourt <[email protected]> Co-authored-by: vitess-bot <[email protected]> Co-authored-by: Florent Poinsard <[email protected]> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Harshit Gangal <[email protected]> Co-authored-by: Mohamed Hamza <[email protected]> Co-authored-by: Andrés Taylor <[email protected]> Co-authored-by: frouioui <[email protected]> Co-authored-by: Manan Gupta <[email protected]> Co-authored-by: Manan Gupta <[email protected]>
Description
This fixes an issue in the throttler code for aggregating metric results.
Most of this code was lifted from freno, where metrics were collected using a MySQL connection, and the errors returned on connection failure were low level tcp dial errors.
When the code was moved from freno to Vitess, the connections were changed to use gRPC, and the errors returned on connection failure are now grpc dial errors instead. On top of that, the errors were actually not being fully passed through, as they are being wrapped in a
fmt.Errorf
call (but without using%w
to actually allow access to the original error).This caused metric aggregation to incorrectly treat connection errors as an "unhealthy" state, causing all users of the throttler (like
VReplication
) to throttle.This fixes this issue by correctly wrapping the error using
%w
and then checking the grpc status code inIsDialTCPError
.Related Issue(s)
Fixes: #18022
Checklist
Deployment Notes