Skip to content

fix(graceful-eviction): retry on optimistic-lock conflicts when patching eviction tasks#7494

Open
RyanAtNetflix wants to merge 1 commit into
karmada-io:masterfrom
RyanAtNetflix:eviction-storm/02-graceful-eviction-retry-on-conflict
Open

fix(graceful-eviction): retry on optimistic-lock conflicts when patching eviction tasks#7494
RyanAtNetflix wants to merge 1 commit into
karmada-io:masterfrom
RyanAtNetflix:eviction-storm/02-graceful-eviction-retry-on-conflict

Conversation

@RyanAtNetflix
Copy link
Copy Markdown

What type of PR is this?

/kind bug

What this PR does / why we need it:

The {RB,CRB}GracefulEvictionController periodically patches Spec.GracefulEvictionTasks on bindings to remove completed or expired tasks. The patch uses MergeFromWithOptimisticLock, which encodes the cached metadata.resourceVersion and asks the apiserver to reject the write if the object has changed. The taint manager's binding-eviction worker writes the same field via Update with the same resourceVersion guard. When the two collide between each other's read and write the apiserver returns StatusReasonConflict, and there is no inline retry — the only recovery is the controller's exponential workqueue backoff, which starts at 5ms and grows to 1000s.

In a cluster-failover storm where the same binding cycles through both controllers many times in close succession, the conflict tax dominates wall-clock latency. A 5K-binding lab observed a 99.7% retry rate on the resource-binding-graceful-eviction-controller workqueue (1072 retries across 1075 adds), with workers spending most of their time asleep in backoff rather than doing real work.

This change wraps the read-modify-patch in retry.RetryOnConflict for both the namespaced and cluster-scoped variants. The signature of syncBinding shifts from (ctx, *Binding) to (ctx, types.NamespacedName) because the binding object now lives inside the retry closure: each attempt does a fresh Get so the next patch's MergeFromWithOptimisticLock base carries the latest resourceVersion. Reconcile is simplified accordingly — the apierrors.IsNotFound short-circuit moves up to wrap the syncBinding call.

The eviction event is emitted only on the successful attempt (EmitClusterEvictionEventForResourceBinding / …ForClusterResourceBinding are now called with a nil error, since by the time we reach the loop the patch has succeeded).

Which issue(s) this PR fixes:

Fixes #7483

Special notes for your reviewer:

This is the graceful-eviction-controller half of a conflict pair; the taint-manager half is in a sibling PR (#TODO link to PR1 once filed). Either PR is independently correct, but the throughput improvement only materializes when both ship — they are the two writers that collide on Spec.GracefulEvictionTasks. A third PR makes the eviction worker pool size configurable.

Tests:

  • TestRBGracefulEvictionController_syncBinding_retryOnConflict and the …CRB… twin use interceptor.Funcs to inject two consecutive Conflict responses on Patch and assert the expired eviction task is still removed once the wrapper recovers.
  • TestRBGracefulEvictionController_syncBinding_notFoundSurfacedToReconcile and the …CRB… twin verify that NotFound is surfaced from the closure so that Reconcile can short-circuit, since the closure no longer returns a no-op early.
  • go test -race ./pkg/controllers/gracefuleviction/... is clean.

One heads-up: the pre-existing TestCRBGracefulEvictionController_syncBinding tolerance on the wall-clock-derived retryAfter value is loosened from 100ms to 1s. The original tolerance was already brittle — it compared a metav1.Now() captured at test setup against one captured inside the controller after fake-client setup work. The new Get inside the retry closure pushed the gap past 100ms on slower hosts. 1s is still well within the meaning of the assertion, which is "the controller computed roughly 3 minutes." Happy to factor this differently (e.g. inject a clock) if reviewers prefer.

Local benchmark on a 5K-binding kind lab failing one cluster, with both this PR and the taint-manager PR applied: sustained ~72 RB/s, peak 108 RB/s (vs. ~1.2 RB/s on stock controller-manager defaults). Methodology and raw CSV available on request.

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: Fixed an optimistic-lock conflict storm in the resource-binding and cluster-resource-binding graceful-eviction controllers that caused eviction-task cleanup to stall during cluster-failover events. Both controllers now retry on conflict in-line instead of bouncing through the workqueue's exponential backoff.

…ing eviction tasks

The {RB,CRB}GracefulEvictionController periodically patches
Spec.GracefulEvictionTasks on bindings to remove completed or expired
tasks. The patch uses MergeFromWithOptimisticLock, which encodes the
cached metadata.resourceVersion and asks the apiserver to reject the
write if the object has changed. The taint manager's binding-eviction
worker writes the same field via Update with the same resourceVersion
guard. When the two collide between each other's read and write the
apiserver returns StatusReasonConflict, and there is no inline retry —
the only recovery is the controller's exponential workqueue backoff,
which starts at 5ms and grows to 1000s.

In a cluster-failover storm where the same binding cycles through both
controllers many times in close succession, the conflict tax dominates
wall-clock latency. A 5K-binding lab observed a 99.7% retry rate on the
resource-binding-graceful-eviction-controller workqueue (1072 retries
across 1075 adds), with workers spending most of their time asleep in
backoff rather than doing real work.

This change wraps the read-modify-patch in retry.RetryOnConflict for
both the namespaced and cluster-scoped variants. The signature of
syncBinding shifts from `(ctx, *Binding)` to
`(ctx, types.NamespacedName)` because the binding object now lives
inside the retry closure: each attempt does a fresh Get so the next
patch's MergeFromWithOptimisticLock base carries the latest
resourceVersion. Reconcile is simplified accordingly — the apierrors.IsNotFound
short-circuit moves up to wrap the syncBinding call.

The eviction event is emitted only on the successful attempt
(EmitClusterEvictionEventForResourceBinding / ...ForClusterResourceBinding
are now called with a nil error, since by the time we reach the loop
the patch has succeeded).

A dedicated test
(Test{RB,CRB}GracefulEvictionController_syncBinding_retryOnConflict)
uses an interceptor.Funcs to inject two consecutive Conflict responses
on Patch and asserts that the expired eviction task is still removed
once the wrapper recovers. A companion test
(Test{RB,CRB}GracefulEvictionController_syncBinding_notFoundSurfacedToReconcile)
verifies that NotFound is surfaced from the closure so that Reconcile
can short-circuit, since the closure no longer returns a no-op early.

The pre-existing TestCRBGracefulEvictionController_syncBinding tolerance
on the wall-clock-derived retryAfter value is loosened from 100ms to 1s.
The original tolerance was already brittle: it compared a `metav1.Now()`
captured at test setup against one captured inside the controller after
fake-client setup work. The new Get inside the retry closure pushed the
gap past 100ms on slower hosts. 1s is still well within the meaning of
the assertion, which is "the controller computed roughly 3 minutes."

Signed-off-by: Ryan Copley <ryan@ryancopley.com>
Copilot AI review requested due to automatic review settings May 8, 2026 16:04
@karmada-bot karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label May 8, 2026
@karmada-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zach593 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a performance bottleneck in the graceful-eviction controllers caused by optimistic-lock conflicts. Previously, collisions with the taint-manager resulted in frequent workqueue backoffs, significantly increasing latency during cluster-failover events. By implementing inline retries for patch operations, the controllers can now resolve these conflicts immediately, drastically improving throughput and stability.

Highlights

  • Optimistic Locking Fix: Wrapped the read-modify-patch logic in retry.RetryOnConflict to handle concurrency issues between the graceful-eviction controller and the taint-manager.
  • Refactored syncBinding: Updated syncBinding to accept types.NamespacedName instead of the binding object, enabling fresh refetches of the resource within the retry closure.
  • Improved Error Handling: Updated Reconcile to properly surface NotFound errors from the retry closure, ensuring correct short-circuiting behavior.
  • Enhanced Test Coverage: Added unit tests using interceptors to simulate conflict scenarios and verify that the controllers successfully recover and perform the necessary cleanup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses optimistic-lock conflict storms between the taint-manager eviction worker and the {RB,CRB}GracefulEvictionController when both write Spec.GracefulEvictionTasks on the same binding objects. It improves eviction-task cleanup throughput under heavy churn by retrying conflicts inline rather than relying on the controller-runtime workqueue’s exponential backoff.

Changes:

  • Wrap the read/modify/patch flow in retry.RetryOnConflict for both ResourceBinding and ClusterResourceBinding graceful-eviction controllers.
  • Refactor syncBinding to accept a types.NamespacedName so each retry attempt performs a fresh Get (ensuring the optimistic-lock base has an up-to-date resourceVersion).
  • Add unit tests that inject Conflict errors on Patch and assert successful eventual cleanup; adjust an existing CRB retry-after tolerance to reduce wall-clock brittleness.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
pkg/controllers/gracefuleviction/rb_graceful_eviction_controller.go Refactors RB graceful-eviction sync to refetch+patch under RetryOnConflict and simplifies Reconcile accordingly.
pkg/controllers/gracefuleviction/crb_graceful_eviction_controller.go Same conflict-retry refactor for CRB graceful-eviction sync path.
pkg/controllers/gracefuleviction/rb_graceful_eviction_controller_test.go Updates existing tests for new syncBinding signature and adds conflict/NotFound-focused regression tests.
pkg/controllers/gracefuleviction/crb_graceful_eviction_controller_test.go Updates existing tests for new signature, loosens retry-after tolerance, and adds conflict/NotFound regression tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the graceful eviction controllers for ClusterResourceBinding and ResourceBinding to use retry.RetryOnConflict when patching eviction tasks. This ensures that optimistic lock conflicts result in an immediate refetch and retry, improving the efficiency of concurrent updates. Unit tests were expanded to cover conflict scenarios and timing-sensitive assertions. The reviewer suggested a defensive programming improvement to reset state variables at the beginning of the retry closure to ensure robustness against future code modifications.

Comment on lines +116 to +118
evictedClusters = evictedCluster
bindingForEvent = modifiedObj
retryAfter = nextRetry(keptTask, c.GracefulEvictionTimeout, metav1.Now().Time)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variables evictedClusters, bindingForEvent, and retryAfter are updated inside the retry closure. While they are correctly reset in early return paths (lines 91-93 and 105-106), ensure that if a Patch call fails with a conflict and is retried, these variables are not left with stale data from the failed attempt. Currently, they are only assigned after a successful Patch (lines 116-118), which is correct. However, for better clarity and to avoid potential issues if the logic grows, consider resetting them at the very beginning of the closure.

Comment on lines +117 to +119
evictedClusters = evictedCluster
bindingForEvent = modifiedObj
retryAfter = nextRetry(keptTask, c.GracefulEvictionTimeout, metav1.Now().Time)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the cluster-scoped controller, the variables evictedClusters, bindingForEvent, and retryAfter are updated inside the retry closure. They are assigned after a successful Patch (lines 117-119), which ensures that only the results of the successful attempt are used for event emission. Consider resetting these variables at the start of the closure to prevent any risk of using stale data from a previous failed attempt in case the logic is modified in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

3 participants