fix(e2e): retry check-cluster after HA db corruption recovery by oilbeater · Pull Request #6354 · kubeovn/kube-ovn

oilbeater · 2026-02-26T16:07:37Z

Summary

Replace the immediate ovsdb-tool check-cluster call with a WaitUntil retry loop (2s interval, 30s timeout) after HA db corruption recovery
After scaling back up, deployment readiness only indicates pod health checks passed, not that RAFT log catch-up is complete. The recovering node may still be receiving snapshots/log entries from the leader, causing transient RAFT log inconsistency that makes check-cluster fail
This follows the same pattern already used in getDbSidsFromClusterStatus() in the same file (added in commit 9e1109c)

Test plan

Run HA e2e test suite multiple times to verify no more flaky failures from ovsdb-tool check-cluster

🤖 Generated with Claude Code

After scaling back up from db corruption, deployment readiness only indicates pod health checks passed, not that RAFT log catch-up is complete. Replace the immediate ovsdb-tool check-cluster call with a WaitUntil retry loop (2s interval, 30s timeout) to tolerate transient RAFT log inconsistency during recovery. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>

gemini-code-assist · 2026-02-26T16:07:52Z

Summary of Changes

Hello @oilbeater, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the reliability of high-availability end-to-end tests by introducing a robust retry mechanism for database health checks. Specifically, it replaces an immediate check with a WaitUntil loop after HA database corruption recovery, accounting for temporary RAFT log synchronization delays and preventing premature test failures.

Highlights

Retry Mechanism for DB Health Check: Implemented a WaitUntil retry loop for 'ovsdb-tool check-cluster' after HA database corruption recovery.
Addresses Flaky Tests: Mitigates flaky e2e test failures caused by transient RAFT log inconsistencies during node recovery.
Pattern Alignment: Follows an established retry pattern already present in the 'ha_test.go' file.

Changelog

test/e2e/ha/ha_test.go
- Replaced an immediate 'ovsdb-tool check-cluster' call with a 'WaitUntil' retry loop (2s interval, 30s timeout) to check database health after HA corruption recovery.

Activity

The author identified a source of flaky e2e test failures related to HA database corruption recovery.
A solution was proposed to introduce a retry mechanism for database health checks.
The change aligns with an existing retry pattern within the same file.
A test plan was provided to verify the fix by repeatedly running the HA e2e test suite.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request effectively addresses the flakiness of ovsdb-tool check-cluster after HA database corruption recovery by introducing a retry mechanism using WaitUntil. This is a robust approach to handle transient inconsistencies. To further enhance debuggability, especially when the WaitUntil loop eventually times out, it would be beneficial to log the stdout and stderr from failed check-cluster attempts within the retry condition. This would provide more context for troubleshooting if the database fails to become healthy within the allotted time.

gemini-code-assist · 2026-02-26T16:09:34Z

test/e2e/ha/ha_test.go

+		ginkgo.By("Waiting for db file " + dbFile + " on node " + pod.Spec.NodeName + " to be healthy")
+		framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) {
+			_, _, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, checkCmd)
+			return err == nil, nil


To improve debuggability, especially if the WaitUntil loop times out, it would be helpful to log the stdout and stderr when ExecShellInPod returns an error. This provides more context about why the check-cluster command failed during a retry attempt.

if err != nil { framework.Logf("check-cluster failed for %s on %s: err = %v, stdout = %q, stderr = %q", dbFile, pod.Spec.NodeName, err, stdout, stderr) return false, nil } return true, nil

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Feb 26, 2026

dosubot bot added the test automation tests label Feb 26, 2026

gemini-code-assist bot reviewed Feb 26, 2026

View reviewed changes

oilbeater merged commit 5400b1e into master Feb 26, 2026
76 checks passed

oilbeater deleted the fix/ha-db-corruption-recovery-check branch February 26, 2026 17:32

oilbeater mentioned this pull request Mar 22, 2026

fix(e2e): add retry to pre-corruption DB health check in HA test #6495

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): retry check-cluster after HA db corruption recovery#6354

fix(e2e): retry check-cluster after HA db corruption recovery#6354
oilbeater merged 1 commit intomasterfrom
fix/ha-db-corruption-recovery-check

oilbeater commented Feb 26, 2026

Uh oh!

gemini-code-assist bot commented Feb 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oilbeater commented Feb 26, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot commented Feb 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant