fix(e2e): retry check-cluster after HA db corruption recovery#6354
fix(e2e): retry check-cluster after HA db corruption recovery#6354
Conversation
After scaling back up from db corruption, deployment readiness only indicates pod health checks passed, not that RAFT log catch-up is complete. Replace the immediate ovsdb-tool check-cluster call with a WaitUntil retry loop (2s interval, 30s timeout) to tolerate transient RAFT log inconsistency during recovery. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Summary of ChangesHello @oilbeater, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the reliability of high-availability end-to-end tests by introducing a robust retry mechanism for database health checks. Specifically, it replaces an immediate check with a WaitUntil loop after HA database corruption recovery, accounting for temporary RAFT log synchronization delays and preventing premature test failures. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request effectively addresses the flakiness of ovsdb-tool check-cluster after HA database corruption recovery by introducing a retry mechanism using WaitUntil. This is a robust approach to handle transient inconsistencies. To further enhance debuggability, especially when the WaitUntil loop eventually times out, it would be beneficial to log the stdout and stderr from failed check-cluster attempts within the retry condition. This would provide more context for troubleshooting if the database fails to become healthy within the allotted time.
| ginkgo.By("Waiting for db file " + dbFile + " on node " + pod.Spec.NodeName + " to be healthy") | ||
| framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) { | ||
| _, _, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, checkCmd) | ||
| return err == nil, nil |
There was a problem hiding this comment.
To improve debuggability, especially if the WaitUntil loop times out, it would be helpful to log the stdout and stderr when ExecShellInPod returns an error. This provides more context about why the check-cluster command failed during a retry attempt.
if err != nil {
framework.Logf("check-cluster failed for %s on %s: err = %v, stdout = %q, stderr = %q", dbFile, pod.Spec.NodeName, err, stdout, stderr)
return false, nil
}
return true, nil
Summary
ovsdb-tool check-clustercall with aWaitUntilretry loop (2s interval, 30s timeout) after HA db corruption recoverycheck-clusterfailgetDbSidsFromClusterStatus()in the same file (added in commit 9e1109c)Test plan
ovsdb-tool check-cluster🤖 Generated with Claude Code