fix(e2e): reuse OVSDB client in address set tests to avoid HA deadlock by oilbeater · Pull Request #6368 · kubeovn/kube-ovn

oilbeater · 2026-03-01T11:42:45Z

Summary

Cache the OVN NB client across WaitForAddressSet* calls using sync.Once instead of creating a new connection (2 OVSDB connections: schema + monitor) per call
Tolerate transient "not connected" errors during the poll loop, allowing libovsdb's automatic reconnection to recover after leader failover

Root Cause

In HA E2E tests, the address set test "should manage address set when EnableAddressSet is true" intermittently fails with either:

Suite timeout (34+ minutes): The Monitor RPC hangs indefinitely during OVSDB client creation
"not connected" error: The connection drops immediately after Monitor completes

Both failures share the same root cause — a deadlock in libovsdb triggered by RAFT leader change during the initial Monitor RPC setup:

c.Monitor() holds rpcMutex.RLock(), waiting for monitor_cond RPC response
The current OVN NB node loses leadership; the follower stops responding to monitor_cond
watchForLeaderChange goroutine detects the change via _Server monitor update, tries rpcMutex.Lock() → blocked by Monitor's RLock
Go's sync.RWMutex is write-preferring: once Lock() is pending, new RLock() calls also block
handleInactivityProbes tries RLock() for Echo probe → also blocked
Complete deadlock: no goroutine can proceed, no recovery mechanism exists

The test previously created 6 OVSDB connections (2 per WaitForAddressSet* call × 3 calls), each going through the full Connect + Monitor dance, significantly increasing the probability of hitting this race window.

Fix

Client caching: Reuse a single long-lived OVSDB client (as OVSDB connections are designed to be), reducing connections from 6 to 2 and eliminating repeated Monitor RPC setups
Poll resilience: Log and retry on transient List errors instead of immediately failing, giving libovsdb's built-in reconnection logic time to recover

Test plan

Verify the address set E2E test passes in HA environments
Confirm no regressions in non-HA E2E test runs

🤖 Generated with Claude Code

Cache the OVN NB client across WaitForAddressSet* calls instead of creating a new connection each time. In HA environments, repeated connect/monitor cycles can trigger a deadlock in libovsdb when a leader change occurs during the initial Monitor RPC setup: the Monitor goroutine holds rpcMutex.RLock while the leader-change handler blocks on rpcMutex.Lock, and Go's write-preferring RWMutex then also blocks the inactivity probe's RLock, leaving no recovery path. Also tolerate transient "not connected" errors during the poll loop instead of failing immediately, giving libovsdb's automatic reconnection time to recover after a leader failover. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>

gemini-code-assist · 2026-03-01T11:42:59Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an intermittent deadlock issue observed in HA E2E tests, specifically within address set management, which was caused by OVSDB client connection instability during leader failovers. By introducing client caching and enhancing the resilience of the polling mechanism to transient connection errors, the changes significantly improve the reliability and stability of these tests in high-availability environments.

Highlights

OVN NB Client Caching: Implemented caching of the OVN NB client across WaitForAddressSet* calls using sync.Once to reduce the number of OVSDB connections and repeated Monitor RPC setups, addressing a deadlock issue during HA leader changes.
Transient Error Tolerance: Modified the address set polling logic to tolerate transient "not connected" errors during client.List operations, allowing libovsdb's automatic reconnection to recover after HA leader failovers instead of immediately failing the test.

Changelog

test/e2e/framework/ovn_address_set.go
- Introduced new global variables (ovnnbClientOnce, ovnnbClientErr, ovnnbCachedNbCli, ovnnbCachedModel) for caching the OVN NB client and its associated models using sync.Once.
- Removed the defer client.Close() call in WaitForAddressSetCondition to prevent premature client closure and enable client reuse across multiple calls.
- Modified the error handling within the WaitForAddressSetCondition poll loop to log and retry on transient client.List errors (e.g., "not connected") instead of immediately returning an error.
- Refactored the getOVNNbClient function to initialize and return a single, cached OVN NB client and models using sync.Once, ensuring that the client is created only once per test run.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses an intermittent deadlock in HA E2E tests by reusing a single OVSDB client across address set test functions, which is a good approach. The change to tolerate transient connection errors in the polling loop also improves resilience.

However, I've identified a potential issue in the client caching implementation in getOVNNbClient. The current approach will only monitor the tables provided in the first call, which could lead to subtle bugs if this framework function is reused with different tables in other tests. I've added a specific comment with details and suggestions.

gemini-code-assist · 2026-03-01T11:44:25Z

test/e2e/framework/ovn_address_set.go

+	ovnnbClientOnce.Do(func() {
+		ovnnbCachedNbCli, ovnnbCachedModel, ovnnbClientErr = ovs.NewDynamicOvnNbClient(
+			ovnnbAddr,
+			ovnNbTimeoutSeconds,
+			ovsdbConnTimeout,
+			ovsdbInactivityTimeout,
+			tables...,
+		)
+	})


The current implementation of client caching has a potential flaw. The tables... argument is captured by the closure inside ovnnbClientOnce.Do, which means the OVSDB client will only be initialized to monitor the tables from the first call to getOVNNbClient.

Subsequent calls with a different set of tables will receive the cached client, which is not monitoring the newly requested tables. This will likely lead to unexpected behavior or test failures.

While this might work for the current usage within this file (where it's always called with ovnnb.AddressSetTable), it makes the function fragile and unsafe for general use in a framework.

Consider one of the following fixes:

Rename the function to be more specific (e.g., getAddrSetOVNNbClient) and hardcode the table name, removing the tables argument. This makes its limited scope clear.

Add a check to ensure all callers use the same set of tables, and fail the test otherwise.

Implement a more robust caching mechanism that can handle multiple sets of tables (though this might be overly complex for this context).

coveralls · 2026-03-01T11:46:56Z

Pull Request Test Coverage Report for Build 22542692796

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 23.089%

Totals
Change from base Build 22538259073:	0.0%
Covered Lines:	12561
Relevant Lines:	54402

💛 - Coveralls

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 1, 2026

dosubot bot added bug Something isn't working test automation tests labels Mar 1, 2026

gemini-code-assist bot reviewed Mar 1, 2026

View reviewed changes

oilbeater merged commit ede1d71 into master Mar 1, 2026
76 of 77 checks passed

oilbeater deleted the fix-e2e-address-set-ovsdb-client-reuse branch March 1, 2026 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): reuse OVSDB client in address set tests to avoid HA deadlock#6368

fix(e2e): reuse OVSDB client in address set tests to avoid HA deadlock#6368
oilbeater merged 1 commit intomasterfrom
fix-e2e-address-set-ovsdb-client-reuse

oilbeater commented Mar 1, 2026

Uh oh!

gemini-code-assist bot commented Mar 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 1, 2026

Uh oh!

coveralls commented Mar 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oilbeater commented Mar 1, 2026

Summary

Root Cause

Fix

Test plan

Uh oh!

gemini-code-assist bot commented Mar 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coveralls commented Mar 1, 2026

Pull Request Test Coverage Report for Build 22542692796

Details

💛 - Coveralls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants