feat(controller-manager): make taint-manager binding-eviction concurrency configurable#7495
Conversation
…ency configurable
The NoExecuteTaintManager runs two AsyncWorkers — bindingEvictionWorker
and clusterBindingEvictionWorker — that handle the eviction queues for
ResourceBinding and ClusterResourceBinding respectively. Both worker
pools are sized by NoExecuteTaintManager.ConcurrentReconciles, which
karmada-controller-manager has been hardcoding to 3 since the taint
manager landed:
ConcurrentReconciles: 3,
There is no flag to override this. In a cluster-failover storm, where
the taint manager is the producer that appends GracefulEvictionTasks
for every binding on a failed cluster, that hardcoded ceiling caps
end-to-end eviction throughput regardless of every other tuning knob —
--resource-eviction-rate, --rate-limiter-qps, --kube-api-qps. A
5K-binding lab measured a sustained eviction rate of roughly 1.2 RB/s
at the default; lifting the ceiling to 50 in the same lab raised
sustained throughput to ~82 RB/s (peak 101).
This change introduces --concurrent-binding-eviction-syncs (matching
the existing --concurrent-cluster-syncs / --concurrent-work-syncs
naming convention) with a default of 5 — the same default the rest of
the Concurrent* family uses. Operators with large clusters and fast
control-plane apiservers can raise it to lift the eviction-rate ceiling;
operators with constrained apiservers can lower it. The default is a
modest bump from 3 to 5 rather than the 50 that maxed out throughput
in the lab, because the right sustained value depends on apiserver
sizing — picking 50 unilaterally would shift load that operators
might not be ready for. The flag's help text calls this out and points
operators at the failover-storm trade-off.
Wiring:
* cmd/controller-manager/app/options.Options gains
ConcurrentBindingEvictionSyncs and the matching pflag.
* Validation rejects values <= 0.
* pkg/controllers/context.Options carries it through to the
controller registry.
* cmd/controller-manager/app/controllermanager.go's startClusterController
reads ctx.Opts.ConcurrentBindingEvictionSyncs in place of the
hardcoded 3.
Signed-off-by: Ryan Copley <ryan@ryancopley.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a performance bottleneck in the Karmada controller-manager by making the taint-manager's binding-eviction concurrency configurable. By replacing a hardcoded limit with a tunable flag, operators can now adjust eviction throughput to better suit their specific cluster scale and API server capacity, particularly during high-load scenarios like cluster-failover storms. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a configurable concurrency limit for the taint-manager's binding-eviction reconciles, replacing a hardcoded value with the new ConcurrentBindingEvictionSyncs option and CLI flag. Feedback from the review suggests clarifying the documentation to accurately describe the limit as a global ceiling and to specify that the concurrency value applies to each eviction queue individually.
| // ConcurrentBindingEvictionSyncs is the number of taint-manager binding-eviction | ||
| // reconciles (covering both ResourceBinding and ClusterResourceBinding queues) | ||
| // that are allowed to run concurrently. Raising this lifts the per-cluster cap | ||
| // on how quickly the taint manager can append GracefulEvictionTasks during a | ||
| // cluster-failover storm; lowering it bounds load on karmada-apiserver. |
There was a problem hiding this comment.
The comment mentions a "per-cluster cap", but the NoExecuteTaintManager workers are global to the controller-manager instance. While the bottleneck is most apparent when a single cluster fails, the concurrency limit applies to the total number of bindings being processed from all clusters. Clarifying this as a global ceiling would be more accurate.
| // ConcurrentBindingEvictionSyncs is the number of taint-manager binding-eviction | |
| // reconciles (covering both ResourceBinding and ClusterResourceBinding queues) | |
| // that are allowed to run concurrently. Raising this lifts the per-cluster cap | |
| // on how quickly the taint manager can append GracefulEvictionTasks during a | |
| // cluster-failover storm; lowering it bounds load on karmada-apiserver. | |
| // ConcurrentBindingEvictionSyncs is the number of taint-manager binding-eviction | |
| // reconciles (covering both ResourceBinding and ClusterResourceBinding queues) | |
| // that are allowed to run concurrently. Raising this lifts the ceiling | |
| // on how quickly the taint manager can append GracefulEvictionTasks during a | |
| // cluster-failover storm; lowering it bounds load on karmada-apiserver. |
| flags.IntVar(&o.ConcurrentClusterPropagationPolicySyncs, "concurrent-cluster-propagation-policy-syncs", 1, "The number of ClusterPropagationPolicy that are allowed to sync concurrently.") | ||
| flags.IntVar(&o.ConcurrentResourceTemplateSyncs, "concurrent-resource-template-syncs", 5, "The number of resource templates that are allowed to sync concurrently.") | ||
| flags.IntVar(&o.ConcurrentDependentResourceSyncs, "concurrent-dependent-resource-syncs", 2, "The number of dependent resource that are allowed to sync concurrently.") | ||
| flags.IntVar(&o.ConcurrentBindingEvictionSyncs, "concurrent-binding-eviction-syncs", 5, "The number of taint-manager binding-eviction reconciles that are allowed to run concurrently across both the ResourceBinding and ClusterResourceBinding eviction queues. Raising this lifts the cap on how quickly Karmada can append GracefulEvictionTasks during a cluster-failover storm.") |
There was a problem hiding this comment.
The phrase "across both ... queues" in the help text is slightly ambiguous. Since the implementation in NoExecuteTaintManager uses this value to size the worker pool for each queue independently (meaning the total concurrency is 2 * ConcurrentBindingEvictionSyncs), it would be clearer to use "for each of" instead of "across both".
| flags.IntVar(&o.ConcurrentBindingEvictionSyncs, "concurrent-binding-eviction-syncs", 5, "The number of taint-manager binding-eviction reconciles that are allowed to run concurrently across both the ResourceBinding and ClusterResourceBinding eviction queues. Raising this lifts the cap on how quickly Karmada can append GracefulEvictionTasks during a cluster-failover storm.") | |
| flags.IntVar(&o.ConcurrentBindingEvictionSyncs, "concurrent-binding-eviction-syncs", 5, "The number of taint-manager binding-eviction reconciles that are allowed to run concurrently for each of the ResourceBinding and ClusterResourceBinding eviction queues. Raising this lifts the cap on how quickly Karmada can append GracefulEvictionTasks during a cluster-failover storm.") |
| // ConcurrentBindingEvictionSyncs is the number of taint-manager binding-eviction | ||
| // reconciles (covering both ResourceBinding and ClusterResourceBinding queues) | ||
| // that are allowed to run concurrently. |
There was a problem hiding this comment.
For clarity and consistency with the flag's behavior, it's better to specify that the concurrency limit applies to each queue individually.
| // ConcurrentBindingEvictionSyncs is the number of taint-manager binding-eviction | |
| // reconciles (covering both ResourceBinding and ClusterResourceBinding queues) | |
| // that are allowed to run concurrently. | |
| // ConcurrentBindingEvictionSyncs is the number of taint-manager binding-eviction | |
| // (for each of the ResourceBinding and ClusterResourceBinding queues) | |
| // that are allowed to run concurrently. |
There was a problem hiding this comment.
Pull request overview
Adds an operator-facing concurrency knob to the controller-manager so the taint-manager’s binding-eviction worker pool size is no longer hardcoded, improving eviction throughput during cluster-failover storms while allowing operators to tune control-plane load.
Changes:
- Introduces
--concurrent-binding-eviction-syncs(default 5) and validates it must be> 0. - Plumbs the new option through controller context and uses it to size
NoExecuteTaintManager.ConcurrentReconciles(replacing the previous hardcoded3). - Regenerates
karmada-controller-managercommand-line flags documentation.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
cmd/controller-manager/app/options/options.go |
Adds the new option field and CLI flag wiring for binding-eviction concurrency. |
cmd/controller-manager/app/options/validation.go |
Validates ConcurrentBindingEvictionSyncs is greater than zero. |
cmd/controller-manager/app/options/validation_test.go |
Extends validation tests to cover invalid concurrency values. |
pkg/controllers/context/context.go |
Extends controller context options to carry the new concurrency setting. |
cmd/controller-manager/app/controllermanager.go |
Uses the configured concurrency when constructing the taint-manager and passes it into controller context. |
docs/command-line-flags/karmada-controller-manager.md |
Documents the new flag (regenerated). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // that are allowed to run concurrently. Raising this lifts the per-cluster cap | ||
| // on how quickly the taint manager can append GracefulEvictionTasks during a | ||
| // cluster-failover storm; lowering it bounds load on karmada-apiserver. |
| flags.IntVar(&o.ConcurrentClusterPropagationPolicySyncs, "concurrent-cluster-propagation-policy-syncs", 1, "The number of ClusterPropagationPolicy that are allowed to sync concurrently.") | ||
| flags.IntVar(&o.ConcurrentResourceTemplateSyncs, "concurrent-resource-template-syncs", 5, "The number of resource templates that are allowed to sync concurrently.") | ||
| flags.IntVar(&o.ConcurrentDependentResourceSyncs, "concurrent-dependent-resource-syncs", 2, "The number of dependent resource that are allowed to sync concurrently.") | ||
| flags.IntVar(&o.ConcurrentBindingEvictionSyncs, "concurrent-binding-eviction-syncs", 5, "The number of taint-manager binding-eviction reconciles that are allowed to run concurrently across both the ResourceBinding and ClusterResourceBinding eviction queues. Raising this lifts the cap on how quickly Karmada can append GracefulEvictionTasks during a cluster-failover storm.") |
| --cluster-startup-grace-period duration Specifies the grace period of allowing a cluster to be unresponsive during startup before marking it unhealthy. (default 1m0s) | ||
| --cluster-status-update-frequency duration Specifies how often karmada-controller-manager posts cluster status to karmada-apiserver. (default 10s) | ||
| --cluster-success-threshold duration The duration of successes for the cluster to be considered healthy after recovery. (default 30s) | ||
| --concurrent-binding-eviction-syncs int The number of taint-manager binding-eviction reconciles that are allowed to run concurrently across both the ResourceBinding and ClusterResourceBinding eviction queues. Raising this lifts the cap on how quickly Karmada can append GracefulEvictionTasks during a cluster-failover storm. (default 5) |
What type of PR is this?
/kind feature
What this PR does / why we need it:
The
NoExecuteTaintManagerruns twoAsyncWorkers —bindingEvictionWorkerandclusterBindingEvictionWorker— that handle the eviction queues forResourceBindingandClusterResourceBindingrespectively. Both worker pools are sized byNoExecuteTaintManager.ConcurrentReconciles, whichkarmada-controller-managerhas been hardcoding to3since the taint manager landed:There is no flag to override this. In a cluster-failover storm, where the taint manager is the producer that appends
GracefulEvictionTasksfor every binding on a failed cluster, that hardcoded ceiling caps end-to-end eviction throughput regardless of every other tuning knob —--resource-eviction-rate,--rate-limiter-qps,--kube-api-qps. A 5K-binding lab measured a sustained eviction rate of roughly 1.2 RB/s at the default; lifting the ceiling to 50 in the same lab raised sustained throughput to ~82 RB/s (peak 101).This change introduces
--concurrent-binding-eviction-syncs(matching the existing--concurrent-cluster-syncs/--concurrent-work-syncsnaming convention) with a default of5— the same default the rest of theConcurrent*family uses. Operators with large clusters and fast control-plane apiservers can raise it to lift the eviction-rate ceiling; operators with constrained apiservers can lower it. The default is a modest bump from 3 to 5 rather than the 50 that maxed out throughput in the lab, because the right sustained value depends on apiserver sizing — picking 50 unilaterally would shift load that operators might not be ready for. The flag's help text calls this out and points operators at the failover-storm trade-off.Wiring:
cmd/controller-manager/app/options.OptionsgainsConcurrentBindingEvictionSyncsand the matchingpflag.<= 0.pkg/controllers/context.Optionscarries it through to the controller registry.cmd/controller-manager/app/controllermanager.go'sstartClusterControllerreadsctx.Opts.ConcurrentBindingEvictionSyncsin place of the hardcoded3.docs/command-line-flags/karmada-controller-manager.mdis regenerated viahack/update-command-line-flags.sh.Which issue(s) this PR fixes:
Fixes #7483
Special notes for your reviewer:
This is the third of a 3-PR series addressing eviction throughput during a cluster-failover storm. The first two PRs (#TODO link to PR1, #TODO link to PR2) eliminate an optimistic-lock conflict storm in the read-modify-write paths between the taint manager and the graceful-eviction controllers. With the conflict storm gone, the hardcoded worker-pool ceiling becomes the next bottleneck — this PR removes that ceiling without changing the default behavior in any meaningful way.
Default value is the main thing I'd like input on. I picked
5to match the rest of theConcurrent*family and to keep the default-config behavior close to the historical3. Reviewers may have a stronger view based on apiserver-sizing assumptions for typical Karmada deployments. The 5K-binding lab maxed out at 50, but I am not comfortable making that the upstream default without broader validation across cluster shapes.The flag name
--concurrent-binding-eviction-syncscovers bothResourceBindingandClusterResourceBindingeviction queues, since they shareNoExecuteTaintManager.ConcurrentReconcilestoday. If reviewers prefer separate flags per queue, I can split it.Tests:
TestValidateControllerManagerConfigurationextended to cover the new validation rule (rejecting<= 0).go test ./cmd/controller-manager/... ./pkg/controllers/context/...is clean.go build ./...is clean.hack/update-command-line-flags.shran cleanly; the regeneratedkarmada-controller-manager.mdis included in this commit so theverify-command-line-flags.shcheck stays green.Local benchmark on a 5K-binding kind lab failing one cluster, with all three PRs applied and the flag set to
50: sustained ~72 RB/s, peak 108 RB/s.Does this PR introduce a user-facing change?: