Skip to content

Properly handle grpc dial errors in the throttler metric aggregation #18073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 30, 2025

Conversation

arthurschreiber
Copy link
Contributor

@arthurschreiber arthurschreiber commented Mar 31, 2025

Description

This fixes an issue in the throttler code for aggregating metric results.

Most of this code was lifted from freno, where metrics were collected using a MySQL connection, and the errors returned on connection failure were low level tcp dial errors.

When the code was moved from freno to Vitess, the connections were changed to use gRPC, and the errors returned on connection failure are now grpc dial errors instead. On top of that, the errors were actually not being fully passed through, as they are being wrapped in a fmt.Errorf call (but without using %w to actually allow access to the original error).

This caused metric aggregation to incorrectly treat connection errors as an "unhealthy" state, causing all users of the throttler (like VReplication) to throttle.

This fixes this issue by correctly wrapping the error using %w and then checking the grpc status code in IsDialTCPError.

Related Issue(s)

Fixes: #18022

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Mar 31, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Mar 31, 2025
@github-actions github-actions bot added this to the v22.0.0 milestone Mar 31, 2025
@arthurschreiber arthurschreiber marked this pull request as ready for review March 31, 2025 13:02
@arthurschreiber arthurschreiber added Type: Bug Component: Throttler Backport to: release-20.0 Needs to be backport to release-20.0 Backport to: release-21.0 Needs to be backport to release-21.0 and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Mar 31, 2025
Signed-off-by: Arthur Schreiber <[email protected]>
@arthurschreiber arthurschreiber force-pushed the arthur/fix-throttler-dial-errors branch from c173bc5 to 0534dc5 Compare March 31, 2025 13:16
if s, ok := status.FromError(err); ok {
return s.Code() == codes.Unavailable || s.Code() == codes.DeadlineExceeded
}

switch err := err.(type) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the logic down here makes any sense, because we never have or had access to a net.OpError here. I left this in place, but I think this should be removed to be less confusing.

Comment on lines 846 to 856
func TestIsDialTCPError(t *testing.T) {
// Verify that IsDialTCPError actually recognizes grpc dial errors
cc, err := grpcclient.DialContext(t.Context(), ":0", true, grpc.WithTransportCredentials(insecure.NewCredentials()))
require.NoError(t, err)
defer cc.Close()

err = cc.Invoke(context.Background(), "/Fail", nil, nil)

require.True(t, base.IsDialTCPError(err))
require.True(t, base.IsDialTCPError(fmt.Errorf("wrapped: %w", err)))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a require.False case here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a259112

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi! can you sign your commit please, the DCO check is failing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@frouioui frouioui mentioned this pull request Mar 31, 2025
72 tasks
@frouioui frouioui modified the milestones: v22.0.0, v23.0.0 Apr 1, 2025
@frouioui frouioui added the Backport to: release-22.0 Needs to be backport to release-22.0 label Apr 1, 2025
This was referenced Apr 1, 2025
This was referenced Apr 7, 2025
Signed-off-by: Mohamed Hamza <[email protected]>
@mhamza15 mhamza15 force-pushed the arthur/fix-throttler-dial-errors branch from a259112 to eeaad49 Compare April 22, 2025 18:24
Copy link

codecov bot commented Apr 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.56%. Comparing base (673e0d3) to head (9727de2).
Report is 11 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18073      +/-   ##
==========================================
+ Coverage   67.54%   67.56%   +0.02%     
==========================================
  Files        1601     1601              
  Lines      261484   261487       +3     
==========================================
+ Hits       176612   176673      +61     
+ Misses      84872    84814      -58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vitess-bot vitess-bot mentioned this pull request Apr 24, 2025
31 tasks
@frouioui frouioui mentioned this pull request Apr 28, 2025
37 tasks
@deepthi deepthi merged commit b3d80b2 into vitessio:main Apr 30, 2025
103 of 104 checks passed
vitess-bot pushed a commit that referenced this pull request Apr 30, 2025
…18073)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
vitess-bot pushed a commit that referenced this pull request Apr 30, 2025
…18073)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
mhamza15 added a commit to github/vitess-gh that referenced this pull request May 5, 2025
…itessio#18073)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
mhamza15 added a commit to github/vitess-gh that referenced this pull request May 5, 2025
…itessio#18073) (#155)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Arthur Schreiber <[email protected]>
arthurschreiber added a commit to github/vitess-gh that referenced this pull request May 6, 2025
…itessio#18073)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
arthurschreiber added a commit to github/vitess-gh that referenced this pull request May 6, 2025
…itessio#18073)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
arthurschreiber pushed a commit that referenced this pull request May 6, 2025
…18073)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
arthurschreiber pushed a commit that referenced this pull request May 6, 2025
…18073)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
arthurschreiber added a commit that referenced this pull request May 6, 2025
…18073)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
arthurschreiber pushed a commit that referenced this pull request May 6, 2025
…18073)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
frouioui pushed a commit that referenced this pull request May 7, 2025
…ic aggregation (#18073) (#18229)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: Arthur Schreiber <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
frouioui pushed a commit that referenced this pull request May 7, 2025
…ic aggregation (#18073) (#18231)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Mohamed Hamza <[email protected]>
timvaillancourt added a commit to slackhq/vitess that referenced this pull request Jun 18, 2025
* [release-22.0] Bump to `v22.0.1-SNAPSHOT` after the `v22.0.0` release (vitessio#18225)

Signed-off-by: Florent Poinsard <[email protected]>
Co-authored-by: Florent Poinsard <[email protected]>

* [release-22.0] fix: Preserve multi-column TupleExpr in tuple simplifier (vitessio#18216) (vitessio#18220)

Signed-off-by: Harshit Gangal <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Harshit Gangal <[email protected]>

* [release-22.0] Properly handle grpc dial errors in the throttler metric aggregation (vitessio#18073) (vitessio#18231)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Mohamed Hamza <[email protected]>

* [release-22.0] test: TestQueryTimeoutWithShardTargeting fix flaky test (vitessio#18242) (vitessio#18250)

Signed-off-by: Harshit Gangal <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

* [release-22.0] make sure to give MEMBER OF the correct precedence (vitessio#18237) (vitessio#18245)

Signed-off-by: Andres Taylor <[email protected]>
Co-authored-by: Andrés Taylor <[email protected]>

* [release-22.0] Fix evalengine crashes on unexpected types (vitessio#18254) (vitessio#18258)

Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

* [release-22.0] Fix subquery merging regression introduced in vitessio#11379 (vitessio#18260) (vitessio#18263)

Signed-off-by: Arthur Schreiber <[email protected]>
Co-authored-by: Andrés Taylor <[email protected]>

* [release-22.0] json array insert test (vitessio#18284) (vitessio#18286)

Signed-off-by: Harshit Gangal <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Harshit Gangal <[email protected]>

* [release-22.0] Fix `SET` and `START TRANSACTION` in create procedure statements (vitessio#18279) (vitessio#18293)

Signed-off-by: Manan Gupta <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

* [release-22.0] Fix deadlock in semi-sync monitor (vitessio#18276) (vitessio#18290)

Signed-off-by: Manan Gupta <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

* [release-22.0] Upgrade the Golang version to `go1.24.3` (vitessio#18239)

Signed-off-by: GitHub <[email protected]>
Signed-off-by: Andres Taylor <[email protected]>
Co-authored-by: frouioui <[email protected]>
Co-authored-by: Andres Taylor <[email protected]>

* [release-22.0] Atomic Copy: Handle error that was ignored while streaming tables and log it (vitessio#18313) (vitessio#18316)

Signed-off-by: Rohit Nayak <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

* [release-22.0] fix: handle dml query for None opcode (vitessio#18326) (vitessio#18345)

Signed-off-by: Harshit Gangal <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Harshit Gangal <[email protected]>

* [release-22.0] fix: keep LIMIT/OFFSET even when merging UNION queries (vitessio#18361) (vitessio#18363)

Signed-off-by: Andres Taylor <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Andres Taylor <[email protected]>

* [release-22.0] Fix: Deadlock in `Close` and `write` in semi-sync monitor. (vitessio#18359) (vitessio#18368)

Signed-off-by: Manan Gupta <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

* [release-22.0] Upgrade the Golang version to `go1.24.4` (vitessio#18329)

Signed-off-by: GitHub <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Co-authored-by: frouioui <[email protected]>
Co-authored-by: Manan Gupta <[email protected]>

* [release-22.0] fix version issue when using --mysql-shell-speedup-restore=true (vitessio#18310) (vitessio#18356)

Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

* [release-22.0] Split workflow with flaky vdiff2 e2e test. Skip flaky Migrate test. (vitessio#18300) (vitessio#18334)

Signed-off-by: Rohit Nayak <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

* [release-22.0] Throttler: keep watching topo even on error (vitessio#18223) (vitessio#18322)

Signed-off-by: Shlomi Noach <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>

* [release-22.0] Code Freeze for `v22.0.1` (vitessio#18374)

Signed-off-by: Manan Gupta <[email protected]>

* [release-22.0] Release of `v22.0.1` (vitessio#18375)

Signed-off-by: Manan Gupta <[email protected]>

* add private repo config to new CI file

Signed-off-by: Tim Vaillancourt <[email protected]>

* `make generate_ci_workflows`

Signed-off-by: Tim Vaillancourt <[email protected]>

---------

Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Harshit Gangal <[email protected]>
Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Mohamed Hamza <[email protected]>
Signed-off-by: Andres Taylor <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: GitHub <[email protected]>
Signed-off-by: Rohit Nayak <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Co-authored-by: vitess-bot <[email protected]>
Co-authored-by: Florent Poinsard <[email protected]>
Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
Co-authored-by: Harshit Gangal <[email protected]>
Co-authored-by: Mohamed Hamza <[email protected]>
Co-authored-by: Andrés Taylor <[email protected]>
Co-authored-by: frouioui <[email protected]>
Co-authored-by: Manan Gupta <[email protected]>
Co-authored-by: Manan Gupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport to: release-20.0 Needs to be backport to release-20.0 Backport to: release-21.0 Needs to be backport to release-21.0 Backport to: release-22.0 Needs to be backport to release-22.0 Component: Throttler Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: Throttler doesn't ignore connection errors
5 participants