Skip to content

fix: avoid overwriting cluster summaries when node/pod listing fails#7557

Open
Ady0333 wants to merge 1 commit into
karmada-io:masterfrom
Ady0333:fix-cluster-status-summary-overwrite
Open

fix: avoid overwriting cluster summaries when node/pod listing fails#7557
Ady0333 wants to merge 1 commit into
karmada-io:masterfrom
Ady0333:fix-cluster-status-summary-overwrite

Conversation

@Ady0333
Copy link
Copy Markdown
Contributor

@Ady0333 Ady0333 commented May 26, 2026

What type of PR is this?

/kind bug
/kind cleanup


What this PR does / why we need it:

Skip overwriting NodeSummary and ResourceSummary in setCurrentClusterStatus when listNodes or listPods returns
an error or a nil slice. The last-good values on currentClusterStatus (already deep-copied from
cluster.Status) are preserved while the rest of the status (KubernetesVersion, APIEnablements, Ready) still
refreshes that cycle.

Without this, a transient lister failure or empty result would zero out NodeSummary.TotalNum and
ResourceSummary.Allocatable, which the scheduler then reads as "cluster has no capacity" and routes replicas
elsewhere.

Before:

  nodes, err := listNodes(clusterInformerManager)
  if err != nil {
      klog.ErrorS(err, "Failed to list nodes for Cluster", "cluster", cluster.GetName())
  }
  pods, err := listPods(clusterInformerManager)
  if err != nil {
      klog.ErrorS(err, "Failed to list pods for Cluster", "cluster", cluster.GetName())
  }
  currentClusterStatus.NodeSummary = getNodeSummary(nodes)
  currentClusterStatus.ResourceSummary = getResourceSummary(nodes, pods)

  if features.FeatureGate.Enabled(features.CustomizedClusterResourceModeling) {
      currentClusterStatus.ResourceSummary.AllocatableModelings = getAllocatableModelings(cluster, nodes, pods)
  }

After:

  nodes, nodesErr := listNodes(clusterInformerManager)
  if nodesErr != nil {
      klog.ErrorS(nodesErr, "Failed to list nodes, preserving previous summaries for cluster", "cluster",
  cluster.GetName())
  }
  pods, podsErr := listPods(clusterInformerManager)
  if podsErr != nil {
      klog.ErrorS(podsErr, "Failed to list pods, preserving previous summaries for cluster", "cluster",
  cluster.GetName())
  }
  if nodesErr == nil && podsErr == nil && nodes != nil && pods != nil {
      currentClusterStatus.NodeSummary = getNodeSummary(nodes)
      currentClusterStatus.ResourceSummary = getResourceSummary(nodes, pods)

      if features.FeatureGate.Enabled(features.CustomizedClusterResourceModeling) {
          currentClusterStatus.ResourceSummary.AllocatableModelings = getAllocatableModelings(cluster, nodes,
  pods)
      }
  }

Which issue(s) this PR fixes:

Fixes #7511

Special notes for your reviewer:

  • Minimal change, scoped to setCurrentClusterStatus.
  • No early return the caller still reaches updateStatusIfNeeded, so other status fields keep refreshing.
  • Covers both the error path and the nil, nil repro case from the issue.

Does this PR introduce a user-facing change?

karmada-controller-manager: Fixed a bug where transient node/pod listing failures could overwrite a
cluster's NodeSummary and ResourceSummary with zeroed values, causing the scheduler to treat a healthy
cluster as having zero capacity.

Copilot AI review requested due to automatic review settings May 26, 2026 13:18
@karmada-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mszacillo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot requested review from jwcesign and zach593 May 26, 2026 13:18
@karmada-bot karmada-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 26, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the robustness of the cluster status controller by implementing early returns in the event of errors during node or pod listing. By preventing the controller from proceeding with nil or incomplete data, it ensures that existing cluster summaries are preserved rather than being incorrectly overwritten with empty values during failure scenarios.

Highlights

  • Error Handling Improvement: Updated the setCurrentClusterStatus function to return early when listing nodes or pods fails, preventing the system from proceeding with incomplete data.
  • Data Integrity: Ensured that cluster summaries are not overwritten with zeroed values when transient errors occur during node or pod listing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Ady0333
Copy link
Copy Markdown
Contributor Author

Ady0333 commented May 26, 2026

Hello @XiShanYongYe-Chang ! Please review this pr.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the setCurrentClusterStatus function in cluster_status_controller.go to properly return early with an error if listing nodes or listing pods fails, preventing the execution of subsequent logic with incomplete data. I have no additional feedback to provide as there are no review comments to address.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Propagates node/pod listing failures to the caller instead of continuing with potentially incomplete cluster status calculations.

Changes:

  • Return early with an error when listNodes(...) fails.
  • Return early with an error when listPods(...) fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/controllers/status/cluster_status_controller.go Outdated
@Ady0333 Ady0333 force-pushed the fix-cluster-status-summary-overwrite branch from 9e309c7 to 4a38592 Compare May 26, 2026 13:25
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 26, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.15%. Comparing base (03b39dd) to head (3fedcfb).
⚠️ Report is 114 commits behind head on master.

Files with missing lines Patch % Lines
...kg/controllers/status/cluster_status_controller.go 0.00% 12 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7557      +/-   ##
==========================================
+ Coverage   41.92%   42.15%   +0.23%     
==========================================
  Files         879      879              
  Lines       54328    54733     +405     
==========================================
+ Hits        22776    23075     +299     
- Misses      29829    29914      +85     
- Partials     1723     1744      +21     
Flag Coverage Δ
unittests 42.15% <0.00%> (+0.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@XiShanYongYe-Chang
Copy link
Copy Markdown
Member

Hi @Ady0333
Sorry, I might not have time recently. I should start working on this PR in the next version, which will be early next month.

@Ady0333
Copy link
Copy Markdown
Contributor Author

Ady0333 commented May 27, 2026

Hi @Ady0333 Sorry, I might not have time recently. I should start working on this PR in the next version, which will be early next month.

No worries @XiShanYongYe-Chang !!! Take your time...

@Tej-Katika
Copy link
Copy Markdown
Contributor

@Ady0333 Took a look at this since it's in an area I've been working around in, and a couple of questions I have:

  • The repro case in [bug]: cluster-status-controller overwrites NodeSummary/ResourceSummary with zeros on transient lister failures, causing scheduler to treat healthy cluster as zero capacity #7511 has listNodes/listPods returning nil with no error, and that's the case that zeroes the summary. But since err == nil there, I don't think if err != nil { return } actually catches it? So I'm not sure this prevents the overwrite the issue shows. The err != nil path also seems hard to reach, since buildInformerForCluster waits for the caches to sync before we list.

  • From the thread it sounded like the idea was to skip overwriting the summaries rather than return — and returning makes the caller bail before updateStatusIfNeeded, so the KubernetesVersion/APIEnablements/Ready refresh gets skipped that cycle too. Since currentClusterStatus is already a deep copy of cluster.Status, would return conditions, nil on the error path work better here?

Might also be worth a small test that the prior summaries survive the failure path. What do you think?

@Ady0333
Copy link
Copy Markdown
Contributor Author

Ady0333 commented Jun 1, 2026

Good catches, thanks for looking @Tej-Katika .

You're right that the repro in #7511 has nodes == nil with err == nil, so the if err != nil { return } doesn't
actually cover that path. And returning early does skip updateStatusIfNeeded, which I didn't want either.

Going to switch to your suggestion guard the assignments instead of returning, so prior
NodeSummary/ResourceSummary are preserved (they're already on currentClusterStatus via the deep copy) and the
rest of the status still refreshes:

  if err == nil && nodes != nil {
      currentClusterStatus.NodeSummary = getNodeSummary(nodes)
      if pods != nil {
          currentClusterStatus.ResourceSummary = getResourceSummary(nodes, pods)
      }
  }

(or splitting the two error checks similarly). Will also add a small unit test that the prior summaries
survive the failure/nil path. Will push an update shortly.

@Ady0333 Ady0333 force-pushed the fix-cluster-status-summary-overwrite branch from 4a38592 to 3727812 Compare June 1, 2026 04:21
@karmada-bot karmada-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 1, 2026
@Ady0333 Ady0333 requested a review from Copilot June 2, 2026 01:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread pkg/controllers/status/cluster_status_controller.go Outdated
Comment thread pkg/controllers/status/cluster_status_controller.go Outdated
Skip overwriting NodeSummary and ResourceSummary in setCurrentClusterStatus
when listNodes or listPods returns an error or nil result, preserving the
last-good values on currentClusterStatus (already deep-copied from
cluster.Status) while still allowing the rest of the status to refresh.

Signed-off-by: Ady0333 <adityashinde1525@gmail.com>
@Ady0333 Ady0333 force-pushed the fix-cluster-status-summary-overwrite branch from 3727812 to 3fedcfb Compare June 2, 2026 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

6 participants