-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mvcc: avoid double decrement of watcher gauge on close/cancel race #19600
base: main
Are you sure you want to change the base?
mvcc: avoid double decrement of watcher gauge on close/cancel race #19600
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kjgorman The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @kjgorman. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This occurs specifically when the watch is for a compacted revision, as there's a possible interleaving of cancel/close that invokes the `cancelWatch` function twice. It's fairly difficult to provoke the race condition but it is possible to observe on `main` the racing test can fail with a negative gauge: ``` $ go test ./... -run TestNewWatcherCountGauge/compacted_watch,_close/cancel_race --- FAIL: TestNewWatcherCountGauge (0.34s) watchable_store_test.go:86: # HELP etcd_debugging_mvcc_watcher_total Total number of watchers. # TYPE etcd_debugging_mvcc_watcher_total gauge -etcd_debugging_mvcc_watcher_total -1 +etcd_debugging_mvcc_watcher_total 0 FAIL FAIL go.etcd.io/etcd/server/v3/storage/mvcc 0.830s ? go.etcd.io/etcd/server/v3/storage/mvcc/testutil [no test files] FAIL ``` It seems as though it is partially expected for the cancel function to be invoked multiple times and to handle that safely (i.e., the existing `ch == nil` check) - the bug here is that in the `if/else if` branches it comes "too late", and multiple invocations where `wa.compacted` is true will both decrement the counter. Shifting the case up one ensures that we can't follow that decrement branch multiple times. In fact, it seems logically more sensible to put this `wa.ch == nil` case _first_, as a guard for the function being invoked multiple times, but moving i before the sync/unsynced watch set delete functions could have a greater inadvertent functional impact (i.e., if we never deleted cancelled watches from these sets it would presumably introduce a leak), so from an abundance of caution I've made the smallest change I think will fix my issue. Signed-off-by: Kieran Gorman <[email protected]>
41c0e72
to
395a038
Compare
I verified your test case against the main branch, with
With the change:
/ok-to-test |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
... and 20 files with indirect coverage changes @@ Coverage Diff @@
## main #19600 +/- ##
==========================================
- Coverage 68.85% 68.79% -0.07%
==========================================
Files 421 421
Lines 35897 35897
==========================================
- Hits 24717 24694 -23
- Misses 9753 9778 +25
+ Partials 1427 1425 -2 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution looks good to me, but it's not my area of expertise.
/cc @fuweid @ahrtr @serathius
This occurs specifically when the watch is for a compacted revision, as there's a possible interleaving of cancel/close that invokes the
cancelWatch
function twice.It's fairly difficult to provoke the race condition but it is possible to observe on
main
the racing test can fail with a negative gauge:It seems as though it is partially expected for the cancel function to be invoked multiple times and to handle that safely (i.e., the existing
ch == nil
check) - the bug here is that in theif/else if
branches it comes "too late", and multiple invocations wherewa.compacted
is true will both decrement the counter. Shifting the case up one ensures that we can't follow that decrement branch multiple times.In fact, it seems logically more sensible to put this
wa.ch == nil
case first, as a guard for the function being invoked multiple times, but moving i before the sync/unsynced watch set delete functions could have a greater inadvertent functional impact (i.e., if we never deleted cancelled watches from these sets it would presumably introduce a leak), so from an abundance of caution I've made the smallest change I think will fix my issue.Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.