scheduler: tolerate partial cluster failures in replica estimation by aeron-gh · Pull Request #7664 · karmada-io/karmada

aeron-gh · 2026-06-24T18:57:37Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

When multiple replica estimators are configured (e.g. general-estimator and scheduler-estimator), a failure to estimate a single cluster made the scheduler discard the whole estimator result. Every cluster then fell back to the general-estimator, including the clusters that the scheduler-estimator had estimated successfully.

This changes the estimation to degrade per cluster instead of all-or-nothing:

getClusterReplicasConcurrently and getClusterComponentSetsConcurrently now mark a failed cluster as UnauthenticReplica instead of leaving its slot empty, while still aggregating the error for logging.
calAvailableReplicas and calculateMultiTemplateAvailableSets merge the partial results instead of skipping the estimator entirely. mergeReplicaResults already ignores UnauthenticReplica entries, so only the clusters that succeeded are merged and a failed cluster keeps its previous value.

The descheduler path (FillUnschedulableReplicas) keeps its existing behavior and is unaffected.

Which issue(s) this PR fixes:

Fixes #6815

Special notes for your reviewer:

Unit tests cover both the single-template replica path and the multi-template component-set path:

Test_getClusterReplicasConcurrently_partialFailure
Test_runSingleTemplateEstimator_partialFailure
Test_mergeReplicaResults
a new partial-failure case in Test_calculateMultiTemplateAvailableSets

go test ./pkg/estimator/client/... ./pkg/scheduler/core/... passes.

Does this PR introduce a user-facing change?:

`karmada-scheduler`: A failure to estimate available replicas for a single cluster no longer discards the estimation results of the other clusters.

When the scheduler-estimator failed to estimate replicas for a single cluster, the entire estimator result was discarded and every cluster fell back to the general-estimator, even the clusters that had been estimated successfully. Mark a failed cluster as UnauthenticReplica instead of dropping the whole batch, and let the caller merge the partial results. mergeReplicaResults already skips UnauthenticReplica entries, so only the clusters that succeeded are merged while a failed cluster keeps its previous value. The same handling is applied to the multi-template component-set path. The descheduler path keeps its existing all-or-nothing behavior and is unaffected. Signed-off-by: aeron-gh <agab0323@gmail.com>

karmada-bot · 2026-06-24T18:57:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign garrybest for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/estimator/OWNERS
pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot · 2026-06-24T18:57:47Z

Welcome @aeron-gh! It looks like this is your first PR to karmada-io/karmada 🎉

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

gemini-code-assist · 2026-06-24T19:04:23Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the robustness of the Karmada scheduler by allowing it to tolerate partial failures during replica estimation. Previously, if an estimator failed for any single cluster, the entire result set was discarded, causing a fallback to the general estimator for all clusters. The changes ensure that successful estimations are preserved and merged, while failed clusters are explicitly marked, preventing unnecessary fallback and improving scheduling accuracy.

Highlights

Partial Failure Handling: Modified replica estimation to process partial results instead of discarding the entire estimator response when a single cluster fails.
Error Propagation: Updated concurrent estimation functions to mark failed clusters as 'UnauthenticReplica' while still allowing successful results to be merged.
Testing: Added comprehensive unit tests to verify that partial failures in both single-template and multi-template estimation paths are handled correctly.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request improves estimator error handling by allowing partial results to be returned and merged when some clusters fail, rather than discarding the entire batch. Failed clusters are marked as UnauthenticReplica. Unit tests have been added to cover these partial failure scenarios. The reviewer suggested an improvement in calculateMultiTemplateAvailableSets to return early if the estimator fails completely (i.e., when the response is empty) to avoid unnecessary processing and potential log spam in production.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-24T19:06:12Z

 	if err != nil {
+		// Don't return early: on a partial failure resp still carries the clusters that
+		// succeeded, so fall through and build a result from what is available.
 		klog.Errorf("Failed to calculate available component set with estimator(%s) for workload(%s, kind=%s, %s): %v",
 			estCtx.estimatorName, estCtx.spec.Resource.APIVersion, estCtx.spec.Resource.Kind, namespacedKey, err)
-		return nil, err
 	}


If the estimator fails completely (i.e., err != nil and len(resp) == 0), continuing to the subsequent loop will cause klog.Warningf to be triggered for every single cluster in estCtx.clusters. This can lead to severe log spam in production environments. Returning early when len(resp) == 0 avoids this issue while still preserving the partial failure handling when some results are returned.

Suggested change

if err != nil {

// Don't return early: on a partial failure resp still carries the clusters that

// succeeded, so fall through and build a result from what is available.

klog.Errorf("Failed to calculate available component set with estimator(%s) for workload(%s, kind=%s, %s): %v",

estCtx.estimatorName, estCtx.spec.Resource.APIVersion, estCtx.spec.Resource.Kind, namespacedKey, err)

return nil, err

}

if err != nil {

// Don't return early: on a partial failure resp still carries the clusters that

// succeeded, so fall through and build a result from what is available.

klog.Errorf("Failed to calculate available component set with estimator(%s) for workload(%s, kind=%s, %s): %v",

estCtx.estimatorName, estCtx.spec.Resource.APIVersion, estCtx.spec.Resource.Kind, namespacedKey, err)

if len(resp) == 0 {

return []workv1alpha2.TargetCluster{}, err

}

}

codecov-commenter · 2026-06-24T19:16:17Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.10%. Comparing base (658499d) to head (7254d8b).

Files with missing lines	Patch %	Lines
pkg/estimator/client/accurate.go	50.00%	1 Missing ⚠️
pkg/scheduler/core/util.go	50.00%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7664      +/-   ##
==========================================
+ Coverage   42.06%   42.10%   +0.04%     
==========================================
  Files         879      879              
  Lines       54827    54828       +1     
==========================================
+ Hits        23061    23087      +26     
+ Misses      30022    29994      -28     
- Partials     1744     1747       +3

Flag	Coverage Δ
unittests	`42.10% <60.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI review requested due to automatic review settings June 24, 2026 18:57

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 24, 2026

karmada-bot requested review from Garrybest and whitewindmills June 24, 2026 18:57

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 24, 2026

Copilot started reviewing on behalf of aeron-gh June 24, 2026 18:58 View session

Copilot AI reviewed Jun 24, 2026

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scheduler: tolerate partial cluster failures in replica estimation#7664

scheduler: tolerate partial cluster failures in replica estimation#7664
aeron-gh wants to merge 1 commit into
karmada-io:masterfrom
aeron-gh:fix/6815-partial-estimator-failures

aeron-gh commented Jun 24, 2026

Uh oh!

karmada-bot commented Jun 24, 2026

Uh oh!

karmada-bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Uh oh!

codecov-commenter commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

aeron-gh commented Jun 24, 2026

Uh oh!

karmada-bot commented Jun 24, 2026

Uh oh!

karmada-bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot commented Jun 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 24, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants