fix(ha): detect and recover from split-brain leader with incomplete cluster membership#6363
fix(ha): detect and recover from split-brain leader with incomplete cluster membership#6363
Conversation
…luster membership After a DB file corruption recovery, a node may rejoin the cluster and receive an incomplete snapshot (e.g., due to a broken pipe during transfer). If the election timer expires before the snapshot sync completes, this node can self-elect as leader with an empty/incomplete database that is missing some cluster members from its configuration. This causes the missing servers to be permanently excluded from the cluster, resulting in a split-brain state where ovn-northd cannot find a healthy SB leader and the affected pod never becomes ready. Add a cluster integrity check in the leader checker that verifies a leader node has the expected number of cluster members (derived from NODE_IPS). If a leader has fewer members than expected, the process exits to trigger re-election, allowing the cluster to re-form with correct membership. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue in OVN's High Availability setup where database corruption recovery could lead to a persistent split-brain scenario. By proactively checking the cluster's membership integrity, the system can now identify and self-correct situations where a leader operates with an incomplete view of its peers, ensuring robust and reliable cluster operations. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a cluster integrity check to detect and recover from split-brain scenarios in OVN HA setups. The overall approach of checking the number of cluster members against the expected count and forcing a re-election if there's a mismatch is sound. The implementation is straightforward. I have one suggestion to improve the robustness of parsing the cluster status output, which is currently a bit fragile.
| serverCount := 0 | ||
| for line := range strings.SplitSeq(output, "\n") { | ||
| if slices.Contains(strings.Fields(line), "at") { | ||
| serverCount++ | ||
| } | ||
| } |
There was a problem hiding this comment.
The current method of parsing the cluster/status output by splitting each line into fields and checking for the presence of "at" is a bit fragile. It might lead to incorrect server counts if the word "at" appears in other parts of the output for any reason in future OVN versions. A more robust approach would be to locate the "Servers:" section and count the indented lines that follow.
Additionally, the use of strings.SplitSeq with a range expression is a feature of Go 1.22. Using strings.Split would be more compatible with older Go versions and fits well with the more robust parsing logic.
serverCount := 0
inServersSection := false
for _, line := range strings.Split(output, "\n") {
if !inServersSection {
if line == "Servers:" {
inServersSection = true
}
continue
}
// The server list is indented. An unindented or empty line
// signifies the end of the list.
if len(line) == 0 || (line[0] != ' ' && line[0] != '\t') {
break
}
serverCount++
}…luster membership After a DB file corruption recovery, a node may rejoin the cluster and receive an incomplete snapshot (e.g., due to a broken pipe during transfer). If the election timer expires before the snapshot sync completes, this node can self-elect as leader with an empty/incomplete database that is missing some cluster members from its configuration. This causes the missing servers to be permanently excluded from the cluster, resulting in a split-brain state where ovn-northd cannot find a healthy SB leader and the affected pod never becomes ready. Add a cluster integrity check in the leader checker that verifies a leader node has the expected number of cluster members (derived from NODE_IPS). If a leader has fewer members than expected, remove the corrupted db file and exit the process. On restart, ovn_db_pre_start will rebuild the db file from the raft header and rejoin the cluster with a clean state, breaking the split-brain cycle. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
HA tests intentionally trigger leader switchover and split-brain recovery, which causes expected ovn-central pod restarts. Remove the restart check from kube-ovn-ha-e2e job to avoid false positives. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
…luster membership (#6363) Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
…luster membership (#6363) Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
…luster membership (#6363) Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Summary
NODE_IPS) and exits to trigger re-electionovn-centralpod never becomes ready after DB corruption recoveryRoot Cause Analysis
The failure was traced through OVN logs from all 3 nodes:
Broken pipe) during snapshot transfer, leaving fa10 with an incomplete/empty databaseThe fix detects this state by comparing the
Serverscount fromcluster/statusagainst the expected count fromNODE_IPS. NB recovery succeeds because its leader remains stable throughout; the issue only affects SB due to the leadership transfer timing.Test plan
make lintpassesmake build-gocompiles successfullyovn db should recover automatically from db file corruptionmultiple times to confirm the fix🤖 Generated with Claude Code