Skip to content

[Proposal]: Extend WorkloadRebalancer with Strategy-based Rebalancing#7662

Open
zhy76 wants to merge 1 commit into
karmada-io:masterfrom
zhy76:feat/wr-proposal
Open

[Proposal]: Extend WorkloadRebalancer with Strategy-based Rebalancing#7662
zhy76 wants to merge 1 commit into
karmada-io:masterfrom
zhy76:feat/wr-proposal

Conversation

@zhy76

@zhy76 zhy76 commented Jun 23, 2026

Copy link
Copy Markdown
Member

What type of PR is this?

/kind documentation

What this PR does / why we need it:

This PR adds a proposal for extending WorkloadRebalancer with strategy-based rebalancing semantics.

The proposal describes how WorkloadRebalancer can evolve from a single Fresh scheduling trigger into a reusable execution framework for multiple rescheduling scenarios, including:

  • full rescheduling;
  • preserving ready replicas while rescheduling only unavailable replicas;
  • source-preserving safe migration with workload-specific migration units.

The goal is to define common execution primitives for future rescheduling features while keeping workload-specific readiness and migration semantics behind extensible executors.

Which issue(s) this PR fixes:

Fixes #7621

Special notes for your reviewer:

This is a proposal-only PR. It does not change runtime behavior or APIs yet.

Does this PR introduce a user-facing change?:

NONE

Signed-off-by: 浩韵 <zuohaiyu.zhy@alibaba-inc.com>
Copilot AI review requested due to automatic review settings June 23, 2026 12:44
@karmada-bot karmada-bot added the kind/documentation Categorizes issue or PR as related to documentation. label Jun 23, 2026
@karmada-bot karmada-bot requested review from Tingtal and seanlaii June 23, 2026 12:44
@karmada-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign whitewindmills for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive proposal to evolve the WorkloadRebalancer into a robust, strategy-based execution framework. By moving away from a single-trigger model, the proposed design enables more sophisticated multi-cluster workload management, such as preserving ready replicas during rescheduling and performing safe, source-preserving migrations. This framework provides the necessary primitives for users to handle complex rescheduling scenarios while maintaining a clear, task-oriented lifecycle for workload distribution.

Highlights

  • Strategy-based Rebalancing Framework: Introduces a new strategy-based execution framework for WorkloadRebalancer, allowing for more granular control over rescheduling scenarios beyond simple 'Fresh' scheduling.
  • New Rebalancing Strategies: Defines two primary strategy families: 'Reschedule' (supporting Full and PreserveReady modes) and 'SafeMigration' (enabling source-preserving migration between member clusters).
  • Extensible Executor Pattern: Proposes a modular controller architecture where strategy-specific logic is encapsulated in executors, keeping the core controller lightweight and extensible for future workload types.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 23, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes extending Karmada's WorkloadRebalancer with strategy-based rebalancing, introducing a framework for safe migration and ready-preserving rescheduling. The reviewer feedback highlights several critical design and implementation improvements: making the Strategy field optional to preserve backward compatibility, refactoring the controller's reconciliation loop to aggregate errors instead of returning early (which would skip status patching), clarifying how workload spec changes are enforced or handled during migration, and documenting the requirement for executors to persist intermediate migration states in underlying resources.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +245 to +247
// Strategy describes how the selected workloads should be rebalanced.
// It must be specified for the strategy-based rebalance semantics proposed here.
Strategy RebalanceStrategy `json:"strategy"`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure backward compatibility with existing WorkloadRebalancer resources that do not have the strategy field, the Strategy field should be optional and defined as a pointer (*RebalanceStrategy) with omitempty in its JSON tag. This allows the controller to default any omitted strategy to the legacy Reschedule strategy with Full mode, preventing decoding or validation failures for older resources.

Suggested change
// Strategy describes how the selected workloads should be rebalanced.
// It must be specified for the strategy-based rebalance semantics proposed here.
Strategy RebalanceStrategy `json:"strategy"`
// Strategy describes how the selected workloads should be rebalanced.
// +optional
Strategy *RebalanceStrategy `json:"strategy,omitempty"`

Comment on lines +372 to +374
if err != nil {
return err
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning an error immediately inside the loop will prevent subsequent workloads from being reconciled in the current run. More importantly, it skips the c.patchStatus(ctx, wr, observed) call entirely, meaning any progress or status updates made by other workloads (or even the current workload before it failed) will not be persisted.

Consider aggregating errors during the loop, continuing to reconcile other workloads, patching the status with the latest progress, and then returning the aggregated errors at the end of the reconciliation.

Suggested change
if err != nil {
return err
}
if err != nil {
errs = append(errs, err)
continue
}

| --- | --- |
| `from` / `to` must be determined before the WR controller executes. | Missing `from` or `to` is treated as `InvalidStrategyArgs`. |
| `from` / `to` cannot be changed after entering `Running`. | This avoids target drift during execution, which would make source/target object state hard to recover. |
| Workload spec cannot be changed during migration. | This avoids changes to the unit list, replica count, placement, or readiness semantics while the executor is deciding which units have migrated or need rollback. |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the target workload (e.g., Deployment) is a separate resource, the WorkloadRebalancer controller cannot directly prevent users or other controllers from modifying its spec during migration unless there is an admission/validating webhook.

Please clarify how this constraint will be enforced. If a validating webhook is not planned, the executor should be designed to handle workload spec changes gracefully (e.g., by failing the migration safely or recalculating the migration units based on the new spec).

Comment on lines +490 to +491
This avoids relying on in-memory state after controller restart and avoids creating two sources of truth between `phase` and
the actual object state.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since MigrationUnit is reconstructed from scratch on every reconciliation and is not persisted in the WorkloadRebalancer status, any state regarding which units are currently "in-flight" or "target open" must be stored directly in the underlying Kubernetes resources (such as annotations or spec fields on the ResourceBinding or target workload).

It would be helpful to explicitly document this requirement for the workload-specific unit executors, ensuring they are designed to persist any intermediate migration state in the actual resources rather than relying on the controller's memory or status.

@zhy76

zhy76 commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

@RainbowMango @zhzhuang-zju PTAL, Thanks~

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new scheduling proposal documenting how WorkloadRebalancer could evolve into a strategy-driven execution framework for multiple rescheduling scenarios (full reschedule, preserve-ready reschedule, and source-preserving safe migration), without changing current runtime behavior yet.

Changes:

  • Adds a new proposal doc defining strategy-based WorkloadRebalancer semantics (Reschedule and SafeMigration).
  • Documents proposed controller/executor architecture, execution flow, progress/status model, and test plan.
  • Proposes API surface changes (new spec.strategy, spec.cancel, and updated ttlSecondsAfterFinished semantics).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +3
---
title: Extend WorkloadRebalancer with Strategy-based Rebalancing
authors:

In a multi-cluster environment, workload distribution may gradually drift away from the desired state as clusters recover
from failures, capacity changes, new clusters are added, workloads scale out, or member cluster utilization changes. Karmada
needs a set of composable rescheduling primitives that users or operation systems can invoke in different rebalancing
Comment on lines +224 to +226
| `spec.cancel` | Common | Requests cancellation for an ongoing rebalance. |
| `spec.ttlSecondsAfterFinished` | Common | Automatically cleans up only when all workloads finish successfully; failed, canceled, or no-progress-timeout objects are kept for troubleshooting. |
| `strategy.type` | Common | Required. Strategy name. Valid values are `Reschedule` and `SafeMigration`. |
Comment on lines +249 to +251
// TTLSecondsAfterFinished limits successful finished WorkloadRebalancers.
// Failed, canceled, or no-progress-timeout objects should be kept for troubleshooting.
TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`
Comment on lines +245 to +247
// Strategy describes how the selected workloads should be rebalanced.
// It must be specified for the strategy-based rebalance semantics proposed here.
Strategy RebalanceStrategy `json:"strategy"`
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.05%. Comparing base (658499d) to head (586f6fc).
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7662      +/-   ##
==========================================
- Coverage   42.06%   42.05%   -0.01%     
==========================================
  Files         879      879              
  Lines       54827    54827              
==========================================
- Hits        23061    23059       -2     
  Misses      30022    30022              
- Partials     1744     1746       +2     
Flag Coverage Δ
unittests 42.05% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RainbowMango RainbowMango left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
/assign
Putting it into my queue.

And you are always welcome to discuss it at the community meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/documentation Categorizes issue or PR as related to documentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Question: Guidance on safe rescheduling for complex workloads with service continuity and resource-pool balancing goals

5 participants