Skip to content

fix(e2e): wait for RAFT cluster convergence in HA db corruption test#6348

Merged
oilbeater merged 1 commit intomasterfrom
fix/ha-test-wait-cluster-convergence
Feb 26, 2026
Merged

fix(e2e): wait for RAFT cluster convergence in HA db corruption test#6348
oilbeater merged 1 commit intomasterfrom
fix/ha-test-wait-cluster-convergence

Conversation

@oilbeater
Copy link
Copy Markdown
Collaborator

Summary

  • After OVN database corruption recovery, the RAFT cluster may not immediately show all servers in cluster/status output, even though pods are ready and db files pass ovsdb-tool check-cluster. This causes intermittent test failure: unexpected number of servers in ovnsb db status (expected 3, got 1).
  • Replace the immediate server count assertion in getDbSidsFromClusterStatus() with a WaitUntil poll (2s interval, 30s timeout) that waits for all pods to report the expected number of servers before validating.
  • When the cluster is already converged, the poll succeeds on the first attempt with no additional delay.

Test plan

  • Run the HA e2e test suite multiple times to verify the flaky failure no longer occurs
  • Verify no regressions in other HA tests

🤖 Generated with Claude Code

After OVN database corruption recovery, the RAFT cluster may not
immediately show all servers in `cluster/status` output even though
pods are ready and db files are healthy. This causes intermittent
failures in `getDbSidsFromClusterStatus()` when it asserts the server
count right after recovery.

Replace the immediate assertion with a WaitUntil poll (up to 30s) that
waits for all pods to report the expected number of servers before
proceeding with validation.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Feb 26, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @oilbeater, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a flakiness issue in the High Availability (HA) end-to-end tests, specifically concerning the recovery of OVN databases after corruption. It ensures that the test suite correctly waits for the underlying RAFT cluster to fully converge and report all its members before validating its state, thereby preventing premature assertions and improving the reliability of the test suite.

Highlights

  • Test Reliability Improvement: Introduced a "WaitUntil" polling mechanism in the getDbSidsFromClusterStatus function to ensure the OVN RAFT cluster has converged and reports all expected servers before proceeding with validation.
  • Flaky Test Fix: Resolved an intermittent e2e test failure where the RAFT cluster status was checked too early after OVN database corruption recovery, leading to an "unexpected number of servers" error.
  • Configurable Waiting: Configured the polling to check every 2 seconds with a 30-second timeout, ensuring that tests do not fail due to transient cluster states and avoid unnecessary delays when the cluster is already stable.
Changelog
  • test/e2e/ha/ha_test.go
    • Replaced an immediate assertion for the number of OVN DB servers with a WaitUntil loop to poll for cluster convergence.
    • Added logic to count servers based on the "at" keyword in ovsdb-tool cluster/status output.
    • Updated error messages to reflect the use of lastStdout and lastStderr from the polling loop.
Activity
  • No specific activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added the test automation tests label Feb 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a flaky e2e test by adding a WaitUntil poll to wait for RAFT cluster convergence. The change is correct in principle. I've suggested a refactoring to improve the implementation by removing duplicated logic and a redundant check, and to add a comment to clarify a fragile usage of a framework function. This will make the code more robust and maintainable.

Comment on lines +140 to +156
framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) {
stdout, stderr, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, cmdClusterStatus(db))
if err != nil {
return false, nil
}
lastStdout, lastStderr = stdout, stderr
var count int
for line := range strings.SplitSeq(stdout, "\n") {
if slices.Contains(strings.Fields(line), "at") {
count++
}
}
return count == expectedCount, nil
}, fmt.Sprintf("ovn%s db on pod %s to show %d servers", db, pod.Name, expectedCount))

status := parseClusterStatus(lastStdout)
framework.ExpectHaveLen(status.Servers, expectedCount, "unexpected number of servers in ovn%s db status in pod %s: stdout = %q, stderr = %q", db, pod.Name, lastStdout, lastStderr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a couple of improvements that can be made here:

  1. The logic to check for cluster convergence inside WaitUntil duplicates the server counting logic from parseClusterStatus. You can simplify this by calling parseClusterStatus inside the WaitUntil condition. This also makes the ExpectHaveLen check on line 156 redundant.
  2. The framework.WaitUntil function ignores its first argument and uses a hardcoded 2-second polling interval. The current code works by coincidence because the interval you pass matches the hardcoded one, but this is fragile.

Here is a suggested refactoring that addresses both points. It uses parseClusterStatus inside the wait loop and also adds a comment to clarify the behavior of WaitUntil.

			var status *clusterStatus
			// The first argument to WaitUntil is ignored; the poll interval is hardcoded to 2s.
			framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) {
				stdout, stderr, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, cmdClusterStatus(db))
				if err != nil {
					return false, nil
				}
				lastStdout, lastStderr = stdout, stderr
				status = parseClusterStatus(stdout)
				return len(status.Servers) == expectedCount, nil
			}, fmt.Sprintf("ovn%s db on pod %s to show %d servers", db, pod.Name, expectedCount))

@oilbeater oilbeater merged commit 9e1109c into master Feb 26, 2026
72 of 76 checks passed
@oilbeater oilbeater deleted the fix/ha-test-wait-cluster-convergence branch February 26, 2026 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S This PR changes 10-29 lines, ignoring generated files. test automation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant