Skip to content

fix(ha): detect and recover from split-brain leader with incomplete cluster membership#6363

Merged
oilbeater merged 3 commits intomasterfrom
fix/stale-leader-detection
Feb 27, 2026
Merged

fix(ha): detect and recover from split-brain leader with incomplete cluster membership#6363
oilbeater merged 3 commits intomasterfrom
fix/stale-leader-detection

Conversation

@oilbeater
Copy link
Copy Markdown
Collaborator

Summary

  • After a DB file corruption recovery, a node may self-elect as leader with an incomplete cluster membership (missing some servers), causing a permanent split-brain state
  • Add a cluster integrity check in the leader checker that detects when a leader has fewer cluster members than expected (from NODE_IPS) and exits to trigger re-election
  • This fixes an intermittent HA e2e test failure where the third ovn-central pod never becomes ready after DB corruption recovery

Root Cause Analysis

The failure was traced through OVN logs from all 3 nodes:

  1. During DB corruption recovery, the corrupted node (fa10) rejoins the SB cluster and receives a snapshot from the leader (7770)
  2. The connection breaks (Broken pipe) during snapshot transfer, leaving fa10 with an incomplete/empty database
  3. Within 5.4 seconds (election timeout), fa10 self-elects as term 3 leader with only 1/1 vote, because it doesn't see other cluster members yet
  4. fa10's cluster configuration only contains 2 of 3 servers (missing 5c97), and this incomplete configuration gets propagated to the other follower
  5. The missing server (5c97) is permanently excluded from the cluster, and ovn-northd cannot find a healthy SB leader

The fix detects this state by comparing the Servers count from cluster/status against the expected count from NODE_IPS. NB recovery succeeds because its leader remains stable throughout; the issue only affects SB due to the leadership transfer timing.

Test plan

  • Verify make lint passes
  • Verify make build-go compiles successfully
  • Run HA e2e test ovn db should recover automatically from db file corruption multiple times to confirm the fix
  • Verify the check does not trigger during normal operations (scale up/down, rolling restart)

🤖 Generated with Claude Code

…luster membership

After a DB file corruption recovery, a node may rejoin the cluster and
receive an incomplete snapshot (e.g., due to a broken pipe during
transfer). If the election timer expires before the snapshot sync
completes, this node can self-elect as leader with an empty/incomplete
database that is missing some cluster members from its configuration.

This causes the missing servers to be permanently excluded from the
cluster, resulting in a split-brain state where ovn-northd cannot find
a healthy SB leader and the affected pod never becomes ready.

Add a cluster integrity check in the leader checker that verifies a
leader node has the expected number of cluster members (derived from
NODE_IPS). If a leader has fewer members than expected, the process
exits to trigger re-election, allowing the cluster to re-form with
correct membership.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Feb 27, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue in OVN's High Availability setup where database corruption recovery could lead to a persistent split-brain scenario. By proactively checking the cluster's membership integrity, the system can now identify and self-correct situations where a leader operates with an incomplete view of its peers, ensuring robust and reliable cluster operations.

Highlights

  • Split-Brain Recovery: Implemented a mechanism to detect and recover from a split-brain state where an OVN leader has an incomplete cluster membership after database corruption recovery.
  • Cluster Integrity Check: Introduced a new checkDBClusterIntegrity function that verifies if the leader node has the expected number of cluster members, exiting to trigger re-election if an inconsistency is found.
  • HA E2E Test Fix: Resolved an intermittent HA end-to-end test failure where ovn-central pods would not become ready due to permanent exclusion of missing servers from the cluster.
Changelog
  • pkg/ovn_leader_checker/ovn.go
    • Added checkDBClusterIntegrity function to verify the number of cluster members against the expected count.
    • Integrated the checkDBClusterIntegrity function into the doOvnLeaderCheck routine for both NB and SB databases.
Activity
  • No human activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added the bug Something isn't working label Feb 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cluster integrity check to detect and recover from split-brain scenarios in OVN HA setups. The overall approach of checking the number of cluster members against the expected count and forcing a re-election if there's a mismatch is sound. The implementation is straightforward. I have one suggestion to improve the robustness of parsing the cluster status output, which is currently a bit fragile.

Comment on lines +330 to +335
serverCount := 0
for line := range strings.SplitSeq(output, "\n") {
if slices.Contains(strings.Fields(line), "at") {
serverCount++
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current method of parsing the cluster/status output by splitting each line into fields and checking for the presence of "at" is a bit fragile. It might lead to incorrect server counts if the word "at" appears in other parts of the output for any reason in future OVN versions. A more robust approach would be to locate the "Servers:" section and count the indented lines that follow.

Additionally, the use of strings.SplitSeq with a range expression is a feature of Go 1.22. Using strings.Split would be more compatible with older Go versions and fits well with the more robust parsing logic.

    serverCount := 0
    inServersSection := false
    for _, line := range strings.Split(output, "\n") {
        if !inServersSection {
            if line == "Servers:" {
                inServersSection = true
            }
            continue
        }

        // The server list is indented. An unindented or empty line
        // signifies the end of the list.
        if len(line) == 0 || (line[0] != ' ' && line[0] != '\t') {
            break
        }
        serverCount++
    }

oilbeater and others added 2 commits February 27, 2026 11:41
…luster membership

After a DB file corruption recovery, a node may rejoin the cluster and
receive an incomplete snapshot (e.g., due to a broken pipe during
transfer). If the election timer expires before the snapshot sync
completes, this node can self-elect as leader with an empty/incomplete
database that is missing some cluster members from its configuration.

This causes the missing servers to be permanently excluded from the
cluster, resulting in a split-brain state where ovn-northd cannot find
a healthy SB leader and the affected pod never becomes ready.

Add a cluster integrity check in the leader checker that verifies a
leader node has the expected number of cluster members (derived from
NODE_IPS). If a leader has fewer members than expected, remove the
corrupted db file and exit the process. On restart, ovn_db_pre_start
will rebuild the db file from the raft header and rejoin the cluster
with a clean state, breaking the split-brain cycle.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
HA tests intentionally trigger leader switchover and split-brain
recovery, which causes expected ovn-central pod restarts. Remove the
restart check from kube-ovn-ha-e2e job to avoid false positives.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
@oilbeater oilbeater merged commit e229c02 into master Feb 27, 2026
76 checks passed
@oilbeater oilbeater deleted the fix/stale-leader-detection branch February 27, 2026 13:44
oilbeater added a commit that referenced this pull request Mar 31, 2026
…luster membership (#6363)

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
oilbeater added a commit that referenced this pull request Mar 31, 2026
…luster membership (#6363)

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
oilbeater added a commit that referenced this pull request Mar 31, 2026
…luster membership (#6363)

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant