Skip to content

Properly handle grpc dial errors in the throttler metric aggregation #18073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

arthurschreiber
Copy link
Contributor

@arthurschreiber arthurschreiber commented Mar 31, 2025

Description

This fixes an issue in the throttler code for aggregating metric results.

Most of this code was lifted from freno, where metrics were collected using a MySQL connection, and the errors returned on connection failure were low level tcp dial errors.

When the code was moved from freno to Vitess, the connections were changed to use gRPC, and the errors returned on connection failure are now grpc dial errors instead. On top of that, the errors were actually not being fully passed through, as they are being wrapped in a fmt.Errorf call (but without using %w to actually allow access to the original error).

This caused metric aggregation to incorrectly treat connection errors as an "unhealthy" state, causing all users of the throttler (like VReplication) to throttle.

This fixes this issue by correctly wrapping the error using %w and then checking the grpc status code in IsDialTCPError.

Related Issue(s)

Fixes: #18022

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Mar 31, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Mar 31, 2025
@github-actions github-actions bot added this to the v22.0.0 milestone Mar 31, 2025
@arthurschreiber arthurschreiber marked this pull request as ready for review March 31, 2025 13:02
@arthurschreiber arthurschreiber added Type: Bug Component: Throttler Backport to: release-20.0 Needs to be backport to release-20.0 Backport to: release-21.0 Needs to be backport to release-21.0 and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Mar 31, 2025
Signed-off-by: Arthur Schreiber <[email protected]>
@arthurschreiber arthurschreiber force-pushed the arthur/fix-throttler-dial-errors branch from c173bc5 to 0534dc5 Compare March 31, 2025 13:16
if s, ok := status.FromError(err); ok {
return s.Code() == codes.Unavailable || s.Code() == codes.DeadlineExceeded
}

switch err := err.(type) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the logic down here makes any sense, because we never have or had access to a net.OpError here. I left this in place, but I think this should be removed to be less confusing.

Comment on lines 846 to 856
func TestIsDialTCPError(t *testing.T) {
// Verify that IsDialTCPError actually recognizes grpc dial errors
cc, err := grpcclient.DialContext(t.Context(), ":0", true, grpc.WithTransportCredentials(insecure.NewCredentials()))
require.NoError(t, err)
defer cc.Close()

err = cc.Invoke(context.Background(), "/Fail", nil, nil)

require.True(t, base.IsDialTCPError(err))
require.True(t, base.IsDialTCPError(fmt.Errorf("wrapped: %w", err)))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a require.False case here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a259112

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi! can you sign your commit please, the DCO check is failing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@frouioui frouioui mentioned this pull request Mar 31, 2025
72 tasks
@frouioui frouioui modified the milestones: v22.0.0, v23.0.0 Apr 1, 2025
@frouioui frouioui added the Backport to: release-22.0 Needs to be backport to release-22.0 label Apr 1, 2025
This was referenced Apr 1, 2025
This was referenced Apr 7, 2025
Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15 mhamza15 force-pushed the arthur/fix-throttler-dial-errors branch from a259112 to eeaad49 Compare April 22, 2025 18:24
Copy link

codecov bot commented Apr 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@673e0d3). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #18073   +/-   ##
=======================================
  Coverage        ?   67.56%           
=======================================
  Files           ?     1601           
  Lines           ?   261487           
  Branches        ?        0           
=======================================
  Hits            ?   176673           
  Misses          ?    84814           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vitess-bot vitess-bot mentioned this pull request Apr 24, 2025
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport to: release-20.0 Needs to be backport to release-20.0 Backport to: release-21.0 Needs to be backport to release-21.0 Backport to: release-22.0 Needs to be backport to release-22.0 Component: Throttler Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: Throttler doesn't ignore connection errors
3 participants