-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Properly handle grpc dial errors in the throttler metric aggregation #18073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Properly handle grpc dial errors in the throttler metric aggregation #18073
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Arthur Schreiber <[email protected]>
c173bc5
to
0534dc5
Compare
if s, ok := status.FromError(err); ok { | ||
return s.Code() == codes.Unavailable || s.Code() == codes.DeadlineExceeded | ||
} | ||
|
||
switch err := err.(type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the logic down here makes any sense, because we never have or had access to a net.OpError
here. I left this in place, but I think this should be removed to be less confusing.
func TestIsDialTCPError(t *testing.T) { | ||
// Verify that IsDialTCPError actually recognizes grpc dial errors | ||
cc, err := grpcclient.DialContext(t.Context(), ":0", true, grpc.WithTransportCredentials(insecure.NewCredentials())) | ||
require.NoError(t, err) | ||
defer cc.Close() | ||
|
||
err = cc.Invoke(context.Background(), "/Fail", nil, nil) | ||
|
||
require.True(t, base.IsDialTCPError(err)) | ||
require.True(t, base.IsDialTCPError(fmt.Errorf("wrapped: %w", err))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a require.False
case here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a259112
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi! can you sign your commit please, the DCO check is failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Signed-off-by: Mohamed Hamza <[email protected]>
a259112
to
eeaad49
Compare
…r-dial-errors Signed-off-by: Mohamed Hamza <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #18073 +/- ##
=======================================
Coverage ? 67.56%
=======================================
Files ? 1601
Lines ? 261487
Branches ? 0
=======================================
Hits ? 176673
Misses ? 84814
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
This fixes an issue in the throttler code for aggregating metric results.
Most of this code was lifted from freno, where metrics were collected using a MySQL connection, and the errors returned on connection failure were low level tcp dial errors.
When the code was moved from freno to Vitess, the connections were changed to use gRPC, and the errors returned on connection failure are now grpc dial errors instead. On top of that, the errors were actually not being fully passed through, as they are being wrapped in a
fmt.Errorf
call (but without using%w
to actually allow access to the original error).This caused metric aggregation to incorrectly treat connection errors as an "unhealthy" state, causing all users of the throttler (like
VReplication
) to throttle.This fixes this issue by correctly wrapping the error using
%w
and then checking the grpc status code inIsDialTCPError
.Related Issue(s)
Fixes: #18022
Checklist
Deployment Notes