fix: avoid overwriting cluster summaries when node/pod listing fails#7557
fix: avoid overwriting cluster summaries when node/pod listing fails#7557Ady0333 wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the robustness of the cluster status controller by implementing early returns in the event of errors during node or pod listing. By preventing the controller from proceeding with nil or incomplete data, it ensures that existing cluster summaries are preserved rather than being incorrectly overwritten with empty values during failure scenarios. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
Hello @XiShanYongYe-Chang ! Please review this pr. |
There was a problem hiding this comment.
Code Review
This pull request updates the setCurrentClusterStatus function in cluster_status_controller.go to properly return early with an error if listing nodes or listing pods fails, preventing the execution of subsequent logic with incomplete data. I have no additional feedback to provide as there are no review comments to address.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Propagates node/pod listing failures to the caller instead of continuing with potentially incomplete cluster status calculations.
Changes:
- Return early with an error when
listNodes(...)fails. - Return early with an error when
listPods(...)fails.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9e309c7 to
4a38592
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7557 +/- ##
==========================================
+ Coverage 41.92% 42.15% +0.23%
==========================================
Files 879 879
Lines 54328 54733 +405
==========================================
+ Hits 22776 23075 +299
- Misses 29829 29914 +85
- Partials 1723 1744 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @Ady0333 |
No worries @XiShanYongYe-Chang !!! Take your time... |
|
@Ady0333 Took a look at this since it's in an area I've been working around in, and a couple of questions I have:
Might also be worth a small test that the prior summaries survive the failure path. What do you think? |
|
Good catches, thanks for looking @Tej-Katika . You're right that the repro in #7511 has nodes == nil with err == nil, so the if err != nil { return } doesn't Going to switch to your suggestion guard the assignments instead of returning, so prior if err == nil && nodes != nil {
currentClusterStatus.NodeSummary = getNodeSummary(nodes)
if pods != nil {
currentClusterStatus.ResourceSummary = getResourceSummary(nodes, pods)
}
}(or splitting the two error checks similarly). Will also add a small unit test that the prior summaries |
4a38592 to
3727812
Compare
Skip overwriting NodeSummary and ResourceSummary in setCurrentClusterStatus when listNodes or listPods returns an error or nil result, preserving the last-good values on currentClusterStatus (already deep-copied from cluster.Status) while still allowing the rest of the status to refresh. Signed-off-by: Ady0333 <adityashinde1525@gmail.com>
3727812 to
3fedcfb
Compare
What type of PR is this?
/kind bug
/kind cleanup
What this PR does / why we need it:
Skip overwriting NodeSummary and ResourceSummary in setCurrentClusterStatus when listNodes or listPods returns
an error or a nil slice. The last-good values on currentClusterStatus (already deep-copied from
cluster.Status) are preserved while the rest of the status (KubernetesVersion, APIEnablements, Ready) still
refreshes that cycle.
Without this, a transient lister failure or empty result would zero out NodeSummary.TotalNum and
ResourceSummary.Allocatable, which the scheduler then reads as "cluster has no capacity" and routes replicas
elsewhere.
Before:
After:
Which issue(s) this PR fixes:
Fixes #7511
Special notes for your reviewer:
Does this PR introduce a user-facing change?
karmada-controller-manager: Fixed a bug where transient node/pod listing failures could overwrite acluster's
NodeSummaryandResourceSummarywith zeroed values, causing the scheduler to treat a healthycluster as having zero capacity.