Skip to content

fix(e2e): reuse OVSDB client in address set tests to avoid HA deadlock#6368

Merged
oilbeater merged 1 commit intomasterfrom
fix-e2e-address-set-ovsdb-client-reuse
Mar 1, 2026
Merged

fix(e2e): reuse OVSDB client in address set tests to avoid HA deadlock#6368
oilbeater merged 1 commit intomasterfrom
fix-e2e-address-set-ovsdb-client-reuse

Conversation

@oilbeater
Copy link
Copy Markdown
Collaborator

Summary

  • Cache the OVN NB client across WaitForAddressSet* calls using sync.Once instead of creating a new connection (2 OVSDB connections: schema + monitor) per call
  • Tolerate transient "not connected" errors during the poll loop, allowing libovsdb's automatic reconnection to recover after leader failover

Root Cause

In HA E2E tests, the address set test "should manage address set when EnableAddressSet is true" intermittently fails with either:

  1. Suite timeout (34+ minutes): The Monitor RPC hangs indefinitely during OVSDB client creation
  2. "not connected" error: The connection drops immediately after Monitor completes

Both failures share the same root cause — a deadlock in libovsdb triggered by RAFT leader change during the initial Monitor RPC setup:

  1. c.Monitor() holds rpcMutex.RLock(), waiting for monitor_cond RPC response
  2. The current OVN NB node loses leadership; the follower stops responding to monitor_cond
  3. watchForLeaderChange goroutine detects the change via _Server monitor update, tries rpcMutex.Lock()blocked by Monitor's RLock
  4. Go's sync.RWMutex is write-preferring: once Lock() is pending, new RLock() calls also block
  5. handleInactivityProbes tries RLock() for Echo probe → also blocked
  6. Complete deadlock: no goroutine can proceed, no recovery mechanism exists

The test previously created 6 OVSDB connections (2 per WaitForAddressSet* call × 3 calls), each going through the full Connect + Monitor dance, significantly increasing the probability of hitting this race window.

Fix

  • Client caching: Reuse a single long-lived OVSDB client (as OVSDB connections are designed to be), reducing connections from 6 to 2 and eliminating repeated Monitor RPC setups
  • Poll resilience: Log and retry on transient List errors instead of immediately failing, giving libovsdb's built-in reconnection logic time to recover

Test plan

  • Verify the address set E2E test passes in HA environments
  • Confirm no regressions in non-HA E2E test runs

🤖 Generated with Claude Code

Cache the OVN NB client across WaitForAddressSet* calls instead of
creating a new connection each time. In HA environments, repeated
connect/monitor cycles can trigger a deadlock in libovsdb when a leader
change occurs during the initial Monitor RPC setup: the Monitor goroutine
holds rpcMutex.RLock while the leader-change handler blocks on
rpcMutex.Lock, and Go's write-preferring RWMutex then also blocks the
inactivity probe's RLock, leaving no recovery path.

Also tolerate transient "not connected" errors during the poll loop
instead of failing immediately, giving libovsdb's automatic reconnection
time to recover after a leader failover.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 1, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an intermittent deadlock issue observed in HA E2E tests, specifically within address set management, which was caused by OVSDB client connection instability during leader failovers. By introducing client caching and enhancing the resilience of the polling mechanism to transient connection errors, the changes significantly improve the reliability and stability of these tests in high-availability environments.

Highlights

  • OVN NB Client Caching: Implemented caching of the OVN NB client across WaitForAddressSet* calls using sync.Once to reduce the number of OVSDB connections and repeated Monitor RPC setups, addressing a deadlock issue during HA leader changes.
  • Transient Error Tolerance: Modified the address set polling logic to tolerate transient "not connected" errors during client.List operations, allowing libovsdb's automatic reconnection to recover after HA leader failovers instead of immediately failing the test.
Changelog
  • test/e2e/framework/ovn_address_set.go
    • Introduced new global variables (ovnnbClientOnce, ovnnbClientErr, ovnnbCachedNbCli, ovnnbCachedModel) for caching the OVN NB client and its associated models using sync.Once.
    • Removed the defer client.Close() call in WaitForAddressSetCondition to prevent premature client closure and enable client reuse across multiple calls.
    • Modified the error handling within the WaitForAddressSetCondition poll loop to log and retry on transient client.List errors (e.g., "not connected") instead of immediately returning an error.
    • Refactored the getOVNNbClient function to initialize and return a single, cached OVN NB client and models using sync.Once, ensuring that the client is created only once per test run.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added bug Something isn't working test automation tests labels Mar 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an intermittent deadlock in HA E2E tests by reusing a single OVSDB client across address set test functions, which is a good approach. The change to tolerate transient connection errors in the polling loop also improves resilience.

However, I've identified a potential issue in the client caching implementation in getOVNNbClient. The current approach will only monitor the tables provided in the first call, which could lead to subtle bugs if this framework function is reused with different tables in other tests. I've added a specific comment with details and suggestions.

Comment on lines +184 to +192
ovnnbClientOnce.Do(func() {
ovnnbCachedNbCli, ovnnbCachedModel, ovnnbClientErr = ovs.NewDynamicOvnNbClient(
ovnnbAddr,
ovnNbTimeoutSeconds,
ovsdbConnTimeout,
ovsdbInactivityTimeout,
tables...,
)
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of client caching has a potential flaw. The tables... argument is captured by the closure inside ovnnbClientOnce.Do, which means the OVSDB client will only be initialized to monitor the tables from the first call to getOVNNbClient.

Subsequent calls with a different set of tables will receive the cached client, which is not monitoring the newly requested tables. This will likely lead to unexpected behavior or test failures.

While this might work for the current usage within this file (where it's always called with ovnnb.AddressSetTable), it makes the function fragile and unsafe for general use in a framework.

Consider one of the following fixes:

  1. Rename the function to be more specific (e.g., getAddrSetOVNNbClient) and hardcode the table name, removing the tables argument. This makes its limited scope clear.
  2. Add a check to ensure all callers use the same set of tables, and fail the test otherwise.
  3. Implement a more robust caching mechanism that can handle multiple sets of tables (though this might be overly complex for this context).

@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 22542692796

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 23.089%

Totals Coverage Status
Change from base Build 22538259073: 0.0%
Covered Lines: 12561
Relevant Lines: 54402

💛 - Coveralls

@oilbeater oilbeater merged commit ede1d71 into master Mar 1, 2026
76 of 77 checks passed
@oilbeater oilbeater deleted the fix-e2e-address-set-ovsdb-client-reuse branch March 1, 2026 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files. test automation tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants