fix: preserve application failover timer across ResourceUnknown transitions#7432
fix: preserve application failover timer across ResourceUnknown transitions#7432Ady0333 wants to merge 2 commits into
Conversation
…th state Fixes a critical bug where clusters with oscillating health states (Unhealthy→Unknown→Unhealthy) never accumulated enough unhealthy time to trigger application failover. The Unknown state was incorrectly treated as healthy, causing deleteIrrelevantClusters to erase the toleration timer. Now only confirmed Healthy observations reset the timer. Signed-off-by: Ady0333 <adityashinde1525@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where application failover mechanisms were being incorrectly suppressed. By refining how the system distinguishes between healthy and unknown cluster states, the fix ensures that transient infrastructure events do not reset or clear essential failover timers, thereby improving the reliability of workload failover during periods of intermittent connectivity. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hello @XiShanYongYe-Chang !!! Please review this PR and let me know if any changes are required... |
There was a problem hiding this comment.
Code Review
This pull request modifies the logic in the distinguishUnhealthyClustersWithOthers function within pkg/controllers/applicationfailover/common.go to exclude clusters with a ResourceUnknown health status from the others slice. The associated unit test in pkg/controllers/applicationfailover/common_test.go has been updated to align with this change. I have no feedback to provide.
There was a problem hiding this comment.
Pull request overview
Fixes application failover timer suppression by ensuring ResourceUnknown transitions do not reset the unhealthy timestamp, allowing toleration time to accumulate across transient “unknown” periods.
Changes:
- Updated unhealthy/other cluster classification to treat only
ResourceHealthyas “others” (no longer includesResourceUnknown). - Adjusted unit test expectations for the
ResourceUnknowncase.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/controllers/applicationfailover/common.go | Stops classifying ResourceUnknown as “healthy/other” so failover timestamps aren’t cleared on Unknown transitions. |
| pkg/controllers/applicationfailover/common_test.go | Updates distinguishUnhealthyClustersWithOthers test expectation for the all-Unknown scenario. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7432 +/- ##
==========================================
- Coverage 42.15% 41.90% -0.26%
==========================================
Files 876 879 +3
Lines 53645 54285 +640
==========================================
+ Hits 22614 22746 +132
- Misses 29334 29820 +486
- Partials 1697 1719 +22
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Rename 'others' to 'healthyClusters' in distinguishUnhealthyClustersWithOthers and update all call sites for clarity. Add regression test to verify that unhealthy timestamps are preserved when health transitions to Unknown. Signed-off-by: Ady0333 <adityashinde1525@gmail.com>
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes a critical bug where application failover is permanently suppressed when workloads oscillate between Unhealthy and Unknown states.
Which issue(s) this PR fixes:
Fixes #7431
Root cause:
distinguishUnhealthyClustersWithOtherswas treatingResourceUnknownthe same asResourceHealthy, causing the failover timer to be deleted on every Unknown transition (agent restarts, network blips, etc.).Changes:
ResourceHealthyas healthyDoes this PR introduce a user-facing change?