Skip to content

Commit 3f45d76

Browse files
oilbeaterclaude
andcommitted
fix(e2e): retry check-cluster after HA db corruption recovery
After scaling back up from db corruption, deployment readiness only indicates pod health checks passed, not that RAFT log catch-up is complete. Replace the immediate ovsdb-tool check-cluster call with a WaitUntil retry loop (2s interval, 30s timeout) to tolerate transient RAFT log inconsistency during recovery. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
1 parent 6240894 commit 3f45d76

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

test/e2e/ha/ha_test.go

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -287,9 +287,11 @@ func corruptAndRecover(f *framework.Framework, deploy *appsv1.Deployment, dbFile
287287
newNodes.Clear()
288288
for pod := range slices.Values(pods.Items) {
289289
newNodes.Insert(pod.Spec.NodeName)
290-
ginkgo.By("Checking whether db file " + dbFile + " on node " + pod.Spec.NodeName + " is healthy")
291-
stdout, stderr, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, checkCmd)
292-
framework.ExpectNoError(err, fmt.Sprintf("failed to check db file %q: stdout = %q, stderr = %q", dbFile, stdout, stderr))
290+
ginkgo.By("Waiting for db file " + dbFile + " on node " + pod.Spec.NodeName + " to be healthy")
291+
framework.WaitUntil(2*time.Second, 30*time.Second, func(_ context.Context) (bool, error) {
292+
_, _, err := framework.ExecShellInPod(context.Background(), f, pod.Namespace, pod.Name, checkCmd)
293+
return err == nil, nil
294+
}, fmt.Sprintf("db file %s on node %s to be healthy", dbFile, pod.Spec.NodeName))
293295
}
294296
framework.ExpectEqual(newNodes, nodes, "the set of nodes hosting ovn-central pods should be the same as before")
295297

0 commit comments

Comments
 (0)