Skip to content

fix: preserve application failover timer across ResourceUnknown transitions#7432

Open
Ady0333 wants to merge 2 commits into
karmada-io:masterfrom
Ady0333:fix/application-failover-unknown-resets-timer
Open

fix: preserve application failover timer across ResourceUnknown transitions#7432
Ady0333 wants to merge 2 commits into
karmada-io:masterfrom
Ady0333:fix/application-failover-unknown-resets-timer

Conversation

@Ady0333
Copy link
Copy Markdown
Contributor

@Ady0333 Ady0333 commented Apr 25, 2026

What type of PR is this?
/kind bug


What this PR does / why we need it:
Fixes a critical bug where application failover is permanently suppressed when workloads oscillate between Unhealthy and Unknown states.


Which issue(s) this PR fixes:
Fixes #7431


Root cause:
distinguishUnhealthyClustersWithOthers was treating ResourceUnknown the same as ResourceHealthy, causing the failover timer to be deleted on every Unknown transition (agent restarts, network blips, etc.).


Changes:

  • Changed line 119 to only treat ResourceHealthy as healthy
  • Unknown clusters now preserve their unhealthy timestamps
  • Updated test to reflect correct behavior

Does this PR introduce a user-facing change?

Fixed a critical bug where application failover would never trigger if workloads oscillated between Unhealthy and Unknown states due to infrastructure events.

…th state

Fixes a critical bug where clusters with oscillating health states
(Unhealthy→Unknown→Unhealthy) never accumulated enough unhealthy time
to trigger application failover. The Unknown state was incorrectly
treated as healthy, causing deleteIrrelevantClusters to erase the
toleration timer. Now only confirmed Healthy observations reset the timer.

Signed-off-by: Ady0333 <adityashinde1525@gmail.com>
Copilot AI review requested due to automatic review settings April 25, 2026 12:11
@karmada-bot karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 25, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where application failover mechanisms were being incorrectly suppressed. By refining how the system distinguishes between healthy and unknown cluster states, the fix ensures that transient infrastructure events do not reset or clear essential failover timers, thereby improving the reliability of workload failover during periods of intermittent connectivity.

Highlights

  • Logic Correction: Modified the cluster health classification logic to exclude ResourceUnknown from the healthy cluster set.
  • Failover Preservation: Ensured that unhealthy timestamps are preserved during ResourceUnknown transitions to prevent suppression of failover timers.
  • Test Update: Updated unit tests to validate that clusters in an unknown state are no longer incorrectly categorized as healthy.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot requested review from jwcesign and zach593 April 25, 2026 12:11
@karmada-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign charlesqq for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 25, 2026
@Ady0333
Copy link
Copy Markdown
Contributor Author

Ady0333 commented Apr 25, 2026

Hello @XiShanYongYe-Chang !!! Please review this PR and let me know if any changes are required...

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the logic in the distinguishUnhealthyClustersWithOthers function within pkg/controllers/applicationfailover/common.go to exclude clusters with a ResourceUnknown health status from the others slice. The associated unit test in pkg/controllers/applicationfailover/common_test.go has been updated to align with this change. I have no feedback to provide.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes application failover timer suppression by ensuring ResourceUnknown transitions do not reset the unhealthy timestamp, allowing toleration time to accumulate across transient “unknown” periods.

Changes:

  • Updated unhealthy/other cluster classification to treat only ResourceHealthy as “others” (no longer includes ResourceUnknown).
  • Adjusted unit test expectations for the ResourceUnknown case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/controllers/applicationfailover/common.go Stops classifying ResourceUnknown as “healthy/other” so failover timestamps aren’t cleared on Unknown transitions.
pkg/controllers/applicationfailover/common_test.go Updates distinguishUnhealthyClustersWithOthers test expectation for the all-Unknown scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/controllers/applicationfailover/common_test.go
Comment thread pkg/controllers/applicationfailover/common.go
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 25, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.90%. Comparing base (923d98e) to head (b50e143).
⚠️ Report is 58 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7432      +/-   ##
==========================================
- Coverage   42.15%   41.90%   -0.26%     
==========================================
  Files         876      879       +3     
  Lines       53645    54285     +640     
==========================================
+ Hits        22614    22746     +132     
- Misses      29334    29820     +486     
- Partials     1697     1719      +22     
Flag Coverage Δ
unittests 41.90% <100.00%> (-0.26%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Rename 'others' to 'healthyClusters' in distinguishUnhealthyClustersWithOthers
and update all call sites for clarity. Add regression test to verify that
unhealthy timestamps are preserved when health transitions to Unknown.

Signed-off-by: Ady0333 <adityashinde1525@gmail.com>
@karmada-bot karmada-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: Application failover permanently suppressed when workload oscillates between Unhealthy and Unknown states

4 participants