fix(e2e): wait for RAFT cluster convergence in HA db corruption test by oilbeater · Pull Request #6348 · kubeovn/kube-ovn

oilbeater · 2026-02-26T07:58:50Z

Summary

After OVN database corruption recovery, the RAFT cluster may not immediately show all servers in cluster/status output, even though pods are ready and db files pass ovsdb-tool check-cluster. This causes intermittent test failure: unexpected number of servers in ovnsb db status (expected 3, got 1).
Replace the immediate server count assertion in getDbSidsFromClusterStatus() with a WaitUntil poll (2s interval, 30s timeout) that waits for all pods to report the expected number of servers before validating.
When the cluster is already converged, the poll succeeds on the first attempt with no additional delay.

Test plan

Run the HA e2e test suite multiple times to verify the flaky failure no longer occurs
Verify no regressions in other HA tests

🤖 Generated with Claude Code

After OVN database corruption recovery, the RAFT cluster may not immediately show all servers in `cluster/status` output even though pods are ready and db files are healthy. This causes intermittent failures in `getDbSidsFromClusterStatus()` when it asserts the server count right after recovery. Replace the immediate assertion with a WaitUntil poll (up to 30s) that waits for all pods to report the expected number of servers before proceeding with validation. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>

gemini-code-assist · 2026-02-26T07:59:04Z

Summary of Changes

Hello @oilbeater, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a flakiness issue in the High Availability (HA) end-to-end tests, specifically concerning the recovery of OVN databases after corruption. It ensures that the test suite correctly waits for the underlying RAFT cluster to fully converge and report all its members before validating its state, thereby preventing premature assertions and improving the reliability of the test suite.

Highlights

Test Reliability Improvement: Introduced a "WaitUntil" polling mechanism in the getDbSidsFromClusterStatus function to ensure the OVN RAFT cluster has converged and reports all expected servers before proceeding with validation.
Flaky Test Fix: Resolved an intermittent e2e test failure where the RAFT cluster status was checked too early after OVN database corruption recovery, leading to an "unexpected number of servers" error.
Configurable Waiting: Configured the polling to check every 2 seconds with a 30-second timeout, ensuring that tests do not fail due to transient cluster states and avoid unnecessary delays when the cluster is already stable.

Changelog

test/e2e/ha/ha_test.go
- Replaced an immediate assertion for the number of OVN DB servers with a WaitUntil loop to poll for cluster convergence.
- Added logic to count servers based on the "at" keyword in ovsdb-tool cluster/status output.
- Updated error messages to reflect the use of lastStdout and lastStderr from the polling loop.

Activity

No specific activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes a flaky e2e test by adding a WaitUntil poll to wait for RAFT cluster convergence. The change is correct in principle. I've suggested a refactoring to improve the implementation by removing duplicated logic and a redundant check, and to add a comment to clarify a fragile usage of a framework function. This will make the code more robust and maintainable.

gemini-code-assist · 2026-02-26T08:00:59Z

test/e2e/ha/ha_test.go

+			framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) {
+				stdout, stderr, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, cmdClusterStatus(db))
+				if err != nil {
+					return false, nil
+				}
+				lastStdout, lastStderr = stdout, stderr
+				var count int
+				for line := range strings.SplitSeq(stdout, "\n") {
+					if slices.Contains(strings.Fields(line), "at") {
+						count++
+					}
+				}
+				return count == expectedCount, nil
+			}, fmt.Sprintf("ovn%s db on pod %s to show %d servers", db, pod.Name, expectedCount))
+
+			status := parseClusterStatus(lastStdout)
+			framework.ExpectHaveLen(status.Servers, expectedCount, "unexpected number of servers in ovn%s db status in pod %s: stdout = %q, stderr = %q", db, pod.Name, lastStdout, lastStderr)


There are a couple of improvements that can be made here:

The logic to check for cluster convergence inside WaitUntil duplicates the server counting logic from parseClusterStatus. You can simplify this by calling parseClusterStatus inside the WaitUntil condition. This also makes the ExpectHaveLen check on line 156 redundant.

The framework.WaitUntil function ignores its first argument and uses a hardcoded 2-second polling interval. The current code works by coincidence because the interval you pass matches the hardcoded one, but this is fragile.

Here is a suggested refactoring that addresses both points. It uses parseClusterStatus inside the wait loop and also adds a comment to clarify the behavior of WaitUntil.

var status *clusterStatus // The first argument to WaitUntil is ignored; the poll interval is hardcoded to 2s. framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) { stdout, stderr, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, cmdClusterStatus(db)) if err != nil { return false, nil } lastStdout, lastStderr = stdout, stderr status = parseClusterStatus(stdout) return len(status.Servers) == expectedCount, nil }, fmt.Sprintf("ovn%s db on pod %s to show %d servers", db, pod.Name, expectedCount))

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 26, 2026

dosubot bot added the test automation tests label Feb 26, 2026

gemini-code-assist bot reviewed Feb 26, 2026

View reviewed changes

oilbeater merged commit 9e1109c into master Feb 26, 2026
72 of 76 checks passed

oilbeater deleted the fix/ha-test-wait-cluster-convergence branch February 26, 2026 08:59

oilbeater mentioned this pull request Mar 22, 2026

fix(e2e): add retry to pre-corruption DB health check in HA test #6495

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): wait for RAFT cluster convergence in HA db corruption test#6348

fix(e2e): wait for RAFT cluster convergence in HA db corruption test#6348
oilbeater merged 1 commit intomasterfrom
fix/ha-test-wait-cluster-convergence

oilbeater commented Feb 26, 2026

Uh oh!

gemini-code-assist bot commented Feb 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oilbeater commented Feb 26, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot commented Feb 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant