[Proposal]: Extend WorkloadRebalancer with Strategy-based Rebalancing#7662
[Proposal]: Extend WorkloadRebalancer with Strategy-based Rebalancing#7662zhy76 wants to merge 1 commit into
Conversation
Signed-off-by: 浩韵 <zuohaiyu.zhy@alibaba-inc.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive proposal to evolve the WorkloadRebalancer into a robust, strategy-based execution framework. By moving away from a single-trigger model, the proposed design enables more sophisticated multi-cluster workload management, such as preserving ready replicas during rescheduling and performing safe, source-preserving migrations. This framework provides the necessary primitives for users to handle complex rescheduling scenarios while maintaining a clear, task-oriented lifecycle for workload distribution. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request proposes extending Karmada's WorkloadRebalancer with strategy-based rebalancing, introducing a framework for safe migration and ready-preserving rescheduling. The reviewer feedback highlights several critical design and implementation improvements: making the Strategy field optional to preserve backward compatibility, refactoring the controller's reconciliation loop to aggregate errors instead of returning early (which would skip status patching), clarifying how workload spec changes are enforced or handled during migration, and documenting the requirement for executors to persist intermediate migration states in underlying resources.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // Strategy describes how the selected workloads should be rebalanced. | ||
| // It must be specified for the strategy-based rebalance semantics proposed here. | ||
| Strategy RebalanceStrategy `json:"strategy"` |
There was a problem hiding this comment.
To ensure backward compatibility with existing WorkloadRebalancer resources that do not have the strategy field, the Strategy field should be optional and defined as a pointer (*RebalanceStrategy) with omitempty in its JSON tag. This allows the controller to default any omitted strategy to the legacy Reschedule strategy with Full mode, preventing decoding or validation failures for older resources.
| // Strategy describes how the selected workloads should be rebalanced. | |
| // It must be specified for the strategy-based rebalance semantics proposed here. | |
| Strategy RebalanceStrategy `json:"strategy"` | |
| // Strategy describes how the selected workloads should be rebalanced. | |
| // +optional | |
| Strategy *RebalanceStrategy `json:"strategy,omitempty"` |
| if err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Returning an error immediately inside the loop will prevent subsequent workloads from being reconciled in the current run. More importantly, it skips the c.patchStatus(ctx, wr, observed) call entirely, meaning any progress or status updates made by other workloads (or even the current workload before it failed) will not be persisted.
Consider aggregating errors during the loop, continuing to reconcile other workloads, patching the status with the latest progress, and then returning the aggregated errors at the end of the reconciliation.
| if err != nil { | |
| return err | |
| } | |
| if err != nil { | |
| errs = append(errs, err) | |
| continue | |
| } |
| | --- | --- | | ||
| | `from` / `to` must be determined before the WR controller executes. | Missing `from` or `to` is treated as `InvalidStrategyArgs`. | | ||
| | `from` / `to` cannot be changed after entering `Running`. | This avoids target drift during execution, which would make source/target object state hard to recover. | | ||
| | Workload spec cannot be changed during migration. | This avoids changes to the unit list, replica count, placement, or readiness semantics while the executor is deciding which units have migrated or need rollback. | |
There was a problem hiding this comment.
Since the target workload (e.g., Deployment) is a separate resource, the WorkloadRebalancer controller cannot directly prevent users or other controllers from modifying its spec during migration unless there is an admission/validating webhook.
Please clarify how this constraint will be enforced. If a validating webhook is not planned, the executor should be designed to handle workload spec changes gracefully (e.g., by failing the migration safely or recalculating the migration units based on the new spec).
| This avoids relying on in-memory state after controller restart and avoids creating two sources of truth between `phase` and | ||
| the actual object state. |
There was a problem hiding this comment.
Since MigrationUnit is reconstructed from scratch on every reconciliation and is not persisted in the WorkloadRebalancer status, any state regarding which units are currently "in-flight" or "target open" must be stored directly in the underlying Kubernetes resources (such as annotations or spec fields on the ResourceBinding or target workload).
It would be helpful to explicitly document this requirement for the workload-specific unit executors, ensuring they are designed to persist any intermediate migration state in the actual resources rather than relying on the controller's memory or status.
|
@RainbowMango @zhzhuang-zju PTAL, Thanks~ |
There was a problem hiding this comment.
Pull request overview
This PR adds a new scheduling proposal documenting how WorkloadRebalancer could evolve into a strategy-driven execution framework for multiple rescheduling scenarios (full reschedule, preserve-ready reschedule, and source-preserving safe migration), without changing current runtime behavior yet.
Changes:
- Adds a new proposal doc defining strategy-based
WorkloadRebalancersemantics (RescheduleandSafeMigration). - Documents proposed controller/executor architecture, execution flow, progress/status model, and test plan.
- Proposes API surface changes (new
spec.strategy,spec.cancel, and updatedttlSecondsAfterFinishedsemantics).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| --- | ||
| title: Extend WorkloadRebalancer with Strategy-based Rebalancing | ||
| authors: |
|
|
||
| In a multi-cluster environment, workload distribution may gradually drift away from the desired state as clusters recover | ||
| from failures, capacity changes, new clusters are added, workloads scale out, or member cluster utilization changes. Karmada | ||
| needs a set of composable rescheduling primitives that users or operation systems can invoke in different rebalancing |
| | `spec.cancel` | Common | Requests cancellation for an ongoing rebalance. | | ||
| | `spec.ttlSecondsAfterFinished` | Common | Automatically cleans up only when all workloads finish successfully; failed, canceled, or no-progress-timeout objects are kept for troubleshooting. | | ||
| | `strategy.type` | Common | Required. Strategy name. Valid values are `Reschedule` and `SafeMigration`. | |
| // TTLSecondsAfterFinished limits successful finished WorkloadRebalancers. | ||
| // Failed, canceled, or no-progress-timeout objects should be kept for troubleshooting. | ||
| TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"` |
| // Strategy describes how the selected workloads should be rebalanced. | ||
| // It must be specified for the strategy-based rebalance semantics proposed here. | ||
| Strategy RebalanceStrategy `json:"strategy"` |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7662 +/- ##
==========================================
- Coverage 42.06% 42.05% -0.01%
==========================================
Files 879 879
Lines 54827 54827
==========================================
- Hits 23061 23059 -2
Misses 30022 30022
- Partials 1744 1746 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
RainbowMango
left a comment
There was a problem hiding this comment.
Thanks.
/assign
Putting it into my queue.
And you are always welcome to discuss it at the community meeting.
What type of PR is this?
/kind documentation
What this PR does / why we need it:
This PR adds a proposal for extending
WorkloadRebalancerwith strategy-based rebalancing semantics.The proposal describes how
WorkloadRebalancercan evolve from a single Fresh scheduling trigger into a reusable execution framework for multiple rescheduling scenarios, including:The goal is to define common execution primitives for future rescheduling features while keeping workload-specific readiness and migration semantics behind extensible executors.
Which issue(s) this PR fixes:
Fixes #7621
Special notes for your reviewer:
This is a proposal-only PR. It does not change runtime behavior or APIs yet.
Does this PR introduce a user-facing change?: