Skip to content

scheduler: tolerate partial cluster failures in replica estimation#7664

Open
aeron-gh wants to merge 1 commit into
karmada-io:masterfrom
aeron-gh:fix/6815-partial-estimator-failures
Open

scheduler: tolerate partial cluster failures in replica estimation#7664
aeron-gh wants to merge 1 commit into
karmada-io:masterfrom
aeron-gh:fix/6815-partial-estimator-failures

Conversation

@aeron-gh

Copy link
Copy Markdown

What type of PR is this?

/kind feature

What this PR does / why we need it:

When multiple replica estimators are configured (e.g. general-estimator and scheduler-estimator), a failure to estimate a single cluster made the scheduler discard the whole estimator result. Every cluster then fell back to the general-estimator, including the clusters that the scheduler-estimator had estimated successfully.

This changes the estimation to degrade per cluster instead of all-or-nothing:

  • getClusterReplicasConcurrently and getClusterComponentSetsConcurrently now mark a failed cluster as UnauthenticReplica instead of leaving its slot empty, while still aggregating the error for logging.
  • calAvailableReplicas and calculateMultiTemplateAvailableSets merge the partial results instead of skipping the estimator entirely. mergeReplicaResults already ignores UnauthenticReplica entries, so only the clusters that succeeded are merged and a failed cluster keeps its previous value.

The descheduler path (FillUnschedulableReplicas) keeps its existing behavior and is unaffected.

Which issue(s) this PR fixes:

Fixes #6815

Special notes for your reviewer:

Unit tests cover both the single-template replica path and the multi-template component-set path:

  • Test_getClusterReplicasConcurrently_partialFailure
  • Test_runSingleTemplateEstimator_partialFailure
  • Test_mergeReplicaResults
  • a new partial-failure case in Test_calculateMultiTemplateAvailableSets

go test ./pkg/estimator/client/... ./pkg/scheduler/core/... passes.

Does this PR introduce a user-facing change?:

`karmada-scheduler`: A failure to estimate available replicas for a single cluster no longer discards the estimation results of the other clusters.

When the scheduler-estimator failed to estimate replicas for a single
cluster, the entire estimator result was discarded and every cluster
fell back to the general-estimator, even the clusters that had been
estimated successfully.

Mark a failed cluster as UnauthenticReplica instead of dropping the whole
batch, and let the caller merge the partial results. mergeReplicaResults
already skips UnauthenticReplica entries, so only the clusters that
succeeded are merged while a failed cluster keeps its previous value. The
same handling is applied to the multi-template component-set path.

The descheduler path keeps its existing all-or-nothing behavior and is
unaffected.

Signed-off-by: aeron-gh <agab0323@gmail.com>
Copilot AI review requested due to automatic review settings June 24, 2026 18:57
@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 24, 2026
@karmada-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign garrybest for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot

Copy link
Copy Markdown
Contributor

Welcome @aeron-gh! It looks like this is your first PR to karmada-io/karmada 🎉

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 24, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the robustness of the Karmada scheduler by allowing it to tolerate partial failures during replica estimation. Previously, if an estimator failed for any single cluster, the entire result set was discarded, causing a fallback to the general estimator for all clusters. The changes ensure that successful estimations are preserved and merged, while failed clusters are explicitly marked, preventing unnecessary fallback and improving scheduling accuracy.

Highlights

  • Partial Failure Handling: Modified replica estimation to process partial results instead of discarding the entire estimator response when a single cluster fails.
  • Error Propagation: Updated concurrent estimation functions to mark failed clusters as 'UnauthenticReplica' while still allowing successful results to be merged.
  • Testing: Added comprehensive unit tests to verify that partial failures in both single-template and multi-template estimation paths are handled correctly.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves estimator error handling by allowing partial results to be returned and merged when some clusters fail, rather than discarding the entire batch. Failed clusters are marked as UnauthenticReplica. Unit tests have been added to cover these partial failure scenarios. The reviewer suggested an improvement in calculateMultiTemplateAvailableSets to return early if the estimator fails completely (i.e., when the response is empty) to avoid unnecessary processing and potential log spam in production.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 87 to 92
if err != nil {
// Don't return early: on a partial failure resp still carries the clusters that
// succeeded, so fall through and build a result from what is available.
klog.Errorf("Failed to calculate available component set with estimator(%s) for workload(%s, kind=%s, %s): %v",
estCtx.estimatorName, estCtx.spec.Resource.APIVersion, estCtx.spec.Resource.Kind, namespacedKey, err)
return nil, err
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the estimator fails completely (i.e., err != nil and len(resp) == 0), continuing to the subsequent loop will cause klog.Warningf to be triggered for every single cluster in estCtx.clusters. This can lead to severe log spam in production environments. Returning early when len(resp) == 0 avoids this issue while still preserving the partial failure handling when some results are returned.

Suggested change
if err != nil {
// Don't return early: on a partial failure resp still carries the clusters that
// succeeded, so fall through and build a result from what is available.
klog.Errorf("Failed to calculate available component set with estimator(%s) for workload(%s, kind=%s, %s): %v",
estCtx.estimatorName, estCtx.spec.Resource.APIVersion, estCtx.spec.Resource.Kind, namespacedKey, err)
return nil, err
}
if err != nil {
// Don't return early: on a partial failure resp still carries the clusters that
// succeeded, so fall through and build a result from what is available.
klog.Errorf("Failed to calculate available component set with estimator(%s) for workload(%s, kind=%s, %s): %v",
estCtx.estimatorName, estCtx.spec.Resource.APIVersion, estCtx.spec.Resource.Kind, namespacedKey, err)
if len(resp) == 0 {
return []workv1alpha2.TargetCluster{}, err
}
}

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.10%. Comparing base (658499d) to head (7254d8b).

Files with missing lines Patch % Lines
pkg/estimator/client/accurate.go 50.00% 1 Missing ⚠️
pkg/scheduler/core/util.go 50.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7664      +/-   ##
==========================================
+ Coverage   42.06%   42.10%   +0.04%     
==========================================
  Files         879      879              
  Lines       54827    54828       +1     
==========================================
+ Hits        23061    23087      +26     
+ Misses      30022    29994      -28     
- Partials     1744     1747       +3     
Flag Coverage Δ
unittests 42.10% <60.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle partial cluster failures in scheduler estimator gracefully

4 participants