fix(e2e): wait for RAFT cluster convergence in HA db corruption test#6348
fix(e2e): wait for RAFT cluster convergence in HA db corruption test#6348
Conversation
After OVN database corruption recovery, the RAFT cluster may not immediately show all servers in `cluster/status` output even though pods are ready and db files are healthy. This causes intermittent failures in `getDbSidsFromClusterStatus()` when it asserts the server count right after recovery. Replace the immediate assertion with a WaitUntil poll (up to 30s) that waits for all pods to report the expected number of servers before proceeding with validation. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Summary of ChangesHello @oilbeater, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a flakiness issue in the High Availability (HA) end-to-end tests, specifically concerning the recovery of OVN databases after corruption. It ensures that the test suite correctly waits for the underlying RAFT cluster to fully converge and report all its members before validating its state, thereby preventing premature assertions and improving the reliability of the test suite. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request fixes a flaky e2e test by adding a WaitUntil poll to wait for RAFT cluster convergence. The change is correct in principle. I've suggested a refactoring to improve the implementation by removing duplicated logic and a redundant check, and to add a comment to clarify a fragile usage of a framework function. This will make the code more robust and maintainable.
| framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) { | ||
| stdout, stderr, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, cmdClusterStatus(db)) | ||
| if err != nil { | ||
| return false, nil | ||
| } | ||
| lastStdout, lastStderr = stdout, stderr | ||
| var count int | ||
| for line := range strings.SplitSeq(stdout, "\n") { | ||
| if slices.Contains(strings.Fields(line), "at") { | ||
| count++ | ||
| } | ||
| } | ||
| return count == expectedCount, nil | ||
| }, fmt.Sprintf("ovn%s db on pod %s to show %d servers", db, pod.Name, expectedCount)) | ||
|
|
||
| status := parseClusterStatus(lastStdout) | ||
| framework.ExpectHaveLen(status.Servers, expectedCount, "unexpected number of servers in ovn%s db status in pod %s: stdout = %q, stderr = %q", db, pod.Name, lastStdout, lastStderr) |
There was a problem hiding this comment.
There are a couple of improvements that can be made here:
- The logic to check for cluster convergence inside
WaitUntilduplicates the server counting logic fromparseClusterStatus. You can simplify this by callingparseClusterStatusinside theWaitUntilcondition. This also makes theExpectHaveLencheck on line 156 redundant. - The
framework.WaitUntilfunction ignores its first argument and uses a hardcoded 2-second polling interval. The current code works by coincidence because the interval you pass matches the hardcoded one, but this is fragile.
Here is a suggested refactoring that addresses both points. It uses parseClusterStatus inside the wait loop and also adds a comment to clarify the behavior of WaitUntil.
var status *clusterStatus
// The first argument to WaitUntil is ignored; the poll interval is hardcoded to 2s.
framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) {
stdout, stderr, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, cmdClusterStatus(db))
if err != nil {
return false, nil
}
lastStdout, lastStderr = stdout, stderr
status = parseClusterStatus(stdout)
return len(status.Servers) == expectedCount, nil
}, fmt.Sprintf("ovn%s db on pod %s to show %d servers", db, pod.Name, expectedCount))
Summary
cluster/statusoutput, even though pods are ready and db files passovsdb-tool check-cluster. This causes intermittent test failure:unexpected number of servers in ovnsb db status(expected 3, got 1).getDbSidsFromClusterStatus()with aWaitUntilpoll (2s interval, 30s timeout) that waits for all pods to report the expected number of servers before validating.Test plan
🤖 Generated with Claude Code