scheduler: tolerate partial cluster failures in replica estimation#7664
scheduler: tolerate partial cluster failures in replica estimation#7664aeron-gh wants to merge 1 commit into
Conversation
When the scheduler-estimator failed to estimate replicas for a single cluster, the entire estimator result was discarded and every cluster fell back to the general-estimator, even the clusters that had been estimated successfully. Mark a failed cluster as UnauthenticReplica instead of dropping the whole batch, and let the caller merge the partial results. mergeReplicaResults already skips UnauthenticReplica entries, so only the clusters that succeeded are merged while a failed cluster keeps its previous value. The same handling is applied to the multi-template component-set path. The descheduler path keeps its existing all-or-nothing behavior and is unaffected. Signed-off-by: aeron-gh <agab0323@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @aeron-gh! It looks like this is your first PR to karmada-io/karmada 🎉 |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the robustness of the Karmada scheduler by allowing it to tolerate partial failures during replica estimation. Previously, if an estimator failed for any single cluster, the entire result set was discarded, causing a fallback to the general estimator for all clusters. The changes ensure that successful estimations are preserved and merged, while failed clusters are explicitly marked, preventing unnecessary fallback and improving scheduling accuracy. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request improves estimator error handling by allowing partial results to be returned and merged when some clusters fail, rather than discarding the entire batch. Failed clusters are marked as UnauthenticReplica. Unit tests have been added to cover these partial failure scenarios. The reviewer suggested an improvement in calculateMultiTemplateAvailableSets to return early if the estimator fails completely (i.e., when the response is empty) to avoid unnecessary processing and potential log spam in production.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if err != nil { | ||
| // Don't return early: on a partial failure resp still carries the clusters that | ||
| // succeeded, so fall through and build a result from what is available. | ||
| klog.Errorf("Failed to calculate available component set with estimator(%s) for workload(%s, kind=%s, %s): %v", | ||
| estCtx.estimatorName, estCtx.spec.Resource.APIVersion, estCtx.spec.Resource.Kind, namespacedKey, err) | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
If the estimator fails completely (i.e., err != nil and len(resp) == 0), continuing to the subsequent loop will cause klog.Warningf to be triggered for every single cluster in estCtx.clusters. This can lead to severe log spam in production environments. Returning early when len(resp) == 0 avoids this issue while still preserving the partial failure handling when some results are returned.
| if err != nil { | |
| // Don't return early: on a partial failure resp still carries the clusters that | |
| // succeeded, so fall through and build a result from what is available. | |
| klog.Errorf("Failed to calculate available component set with estimator(%s) for workload(%s, kind=%s, %s): %v", | |
| estCtx.estimatorName, estCtx.spec.Resource.APIVersion, estCtx.spec.Resource.Kind, namespacedKey, err) | |
| return nil, err | |
| } | |
| if err != nil { | |
| // Don't return early: on a partial failure resp still carries the clusters that | |
| // succeeded, so fall through and build a result from what is available. | |
| klog.Errorf("Failed to calculate available component set with estimator(%s) for workload(%s, kind=%s, %s): %v", | |
| estCtx.estimatorName, estCtx.spec.Resource.APIVersion, estCtx.spec.Resource.Kind, namespacedKey, err) | |
| if len(resp) == 0 { | |
| return []workv1alpha2.TargetCluster{}, err | |
| } | |
| } |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7664 +/- ##
==========================================
+ Coverage 42.06% 42.10% +0.04%
==========================================
Files 879 879
Lines 54827 54828 +1
==========================================
+ Hits 23061 23087 +26
+ Misses 30022 29994 -28
- Partials 1744 1747 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
What type of PR is this?
/kind feature
What this PR does / why we need it:
When multiple replica estimators are configured (e.g.
general-estimatorandscheduler-estimator), a failure to estimate a single cluster made the scheduler discard the whole estimator result. Every cluster then fell back to thegeneral-estimator, including the clusters that thescheduler-estimatorhad estimated successfully.This changes the estimation to degrade per cluster instead of all-or-nothing:
getClusterReplicasConcurrentlyandgetClusterComponentSetsConcurrentlynow mark a failed cluster asUnauthenticReplicainstead of leaving its slot empty, while still aggregating the error for logging.calAvailableReplicasandcalculateMultiTemplateAvailableSetsmerge the partial results instead of skipping the estimator entirely.mergeReplicaResultsalready ignoresUnauthenticReplicaentries, so only the clusters that succeeded are merged and a failed cluster keeps its previous value.The descheduler path (
FillUnschedulableReplicas) keeps its existing behavior and is unaffected.Which issue(s) this PR fixes:
Fixes #6815
Special notes for your reviewer:
Unit tests cover both the single-template replica path and the multi-template component-set path:
Test_getClusterReplicasConcurrently_partialFailureTest_runSingleTemplateEstimator_partialFailureTest_mergeReplicaResultsTest_calculateMultiTemplateAvailableSetsgo test ./pkg/estimator/client/... ./pkg/scheduler/core/...passes.Does this PR introduce a user-facing change?: