Skip to content

fix(e2e): retry check-cluster after HA db corruption recovery#6354

Merged
oilbeater merged 1 commit intomasterfrom
fix/ha-db-corruption-recovery-check
Feb 26, 2026
Merged

fix(e2e): retry check-cluster after HA db corruption recovery#6354
oilbeater merged 1 commit intomasterfrom
fix/ha-db-corruption-recovery-check

Conversation

@oilbeater
Copy link
Copy Markdown
Collaborator

Summary

  • Replace the immediate ovsdb-tool check-cluster call with a WaitUntil retry loop (2s interval, 30s timeout) after HA db corruption recovery
  • After scaling back up, deployment readiness only indicates pod health checks passed, not that RAFT log catch-up is complete. The recovering node may still be receiving snapshots/log entries from the leader, causing transient RAFT log inconsistency that makes check-cluster fail
  • This follows the same pattern already used in getDbSidsFromClusterStatus() in the same file (added in commit 9e1109c)

Test plan

  • Run HA e2e test suite multiple times to verify no more flaky failures from ovsdb-tool check-cluster

🤖 Generated with Claude Code

After scaling back up from db corruption, deployment readiness only
indicates pod health checks passed, not that RAFT log catch-up is
complete. Replace the immediate ovsdb-tool check-cluster call with a
WaitUntil retry loop (2s interval, 30s timeout) to tolerate transient
RAFT log inconsistency during recovery.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Feb 26, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @oilbeater, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the reliability of high-availability end-to-end tests by introducing a robust retry mechanism for database health checks. Specifically, it replaces an immediate check with a WaitUntil loop after HA database corruption recovery, accounting for temporary RAFT log synchronization delays and preventing premature test failures.

Highlights

  • Retry Mechanism for DB Health Check: Implemented a WaitUntil retry loop for 'ovsdb-tool check-cluster' after HA database corruption recovery.
  • Addresses Flaky Tests: Mitigates flaky e2e test failures caused by transient RAFT log inconsistencies during node recovery.
  • Pattern Alignment: Follows an established retry pattern already present in the 'ha_test.go' file.
Changelog
  • test/e2e/ha/ha_test.go
    • Replaced an immediate 'ovsdb-tool check-cluster' call with a 'WaitUntil' retry loop (2s interval, 30s timeout) to check database health after HA corruption recovery.
Activity
  • The author identified a source of flaky e2e test failures related to HA database corruption recovery.
  • A solution was proposed to introduce a retry mechanism for database health checks.
  • The change aligns with an existing retry pattern within the same file.
  • A test plan was provided to verify the fix by repeatedly running the HA e2e test suite.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added the test automation tests label Feb 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the flakiness of ovsdb-tool check-cluster after HA database corruption recovery by introducing a retry mechanism using WaitUntil. This is a robust approach to handle transient inconsistencies. To further enhance debuggability, especially when the WaitUntil loop eventually times out, it would be beneficial to log the stdout and stderr from failed check-cluster attempts within the retry condition. This would provide more context for troubleshooting if the database fails to become healthy within the allotted time.

ginkgo.By("Waiting for db file " + dbFile + " on node " + pod.Spec.NodeName + " to be healthy")
framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) {
_, _, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, checkCmd)
return err == nil, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve debuggability, especially if the WaitUntil loop times out, it would be helpful to log the stdout and stderr when ExecShellInPod returns an error. This provides more context about why the check-cluster command failed during a retry attempt.

			if err != nil {
				framework.Logf("check-cluster failed for %s on %s: err = %v, stdout = %q, stderr = %q", dbFile, pod.Spec.NodeName, err, stdout, stderr)
				return false, nil
			}
			return true, nil

@oilbeater oilbeater merged commit 5400b1e into master Feb 26, 2026
76 checks passed
@oilbeater oilbeater deleted the fix/ha-db-corruption-recovery-check branch February 26, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS This PR changes 0-9 lines, ignoring generated files. test automation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant