fix(e2e): reuse OVSDB client in address set tests to avoid HA deadlock#6368
fix(e2e): reuse OVSDB client in address set tests to avoid HA deadlock#6368
Conversation
Cache the OVN NB client across WaitForAddressSet* calls instead of creating a new connection each time. In HA environments, repeated connect/monitor cycles can trigger a deadlock in libovsdb when a leader change occurs during the initial Monitor RPC setup: the Monitor goroutine holds rpcMutex.RLock while the leader-change handler blocks on rpcMutex.Lock, and Go's write-preferring RWMutex then also blocks the inactivity probe's RLock, leaving no recovery path. Also tolerate transient "not connected" errors during the poll loop instead of failing immediately, giving libovsdb's automatic reconnection time to recover after a leader failover. Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an intermittent deadlock issue observed in HA E2E tests, specifically within address set management, which was caused by OVSDB client connection instability during leader failovers. By introducing client caching and enhancing the resilience of the polling mechanism to transient connection errors, the changes significantly improve the reliability and stability of these tests in high-availability environments. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses an intermittent deadlock in HA E2E tests by reusing a single OVSDB client across address set test functions, which is a good approach. The change to tolerate transient connection errors in the polling loop also improves resilience.
However, I've identified a potential issue in the client caching implementation in getOVNNbClient. The current approach will only monitor the tables provided in the first call, which could lead to subtle bugs if this framework function is reused with different tables in other tests. I've added a specific comment with details and suggestions.
| ovnnbClientOnce.Do(func() { | ||
| ovnnbCachedNbCli, ovnnbCachedModel, ovnnbClientErr = ovs.NewDynamicOvnNbClient( | ||
| ovnnbAddr, | ||
| ovnNbTimeoutSeconds, | ||
| ovsdbConnTimeout, | ||
| ovsdbInactivityTimeout, | ||
| tables..., | ||
| ) | ||
| }) |
There was a problem hiding this comment.
The current implementation of client caching has a potential flaw. The tables... argument is captured by the closure inside ovnnbClientOnce.Do, which means the OVSDB client will only be initialized to monitor the tables from the first call to getOVNNbClient.
Subsequent calls with a different set of tables will receive the cached client, which is not monitoring the newly requested tables. This will likely lead to unexpected behavior or test failures.
While this might work for the current usage within this file (where it's always called with ovnnb.AddressSetTable), it makes the function fragile and unsafe for general use in a framework.
Consider one of the following fixes:
- Rename the function to be more specific (e.g.,
getAddrSetOVNNbClient) and hardcode the table name, removing thetablesargument. This makes its limited scope clear. - Add a check to ensure all callers use the same set of tables, and fail the test otherwise.
- Implement a more robust caching mechanism that can handle multiple sets of tables (though this might be overly complex for this context).
Pull Request Test Coverage Report for Build 22542692796Details
💛 - Coveralls |
Summary
WaitForAddressSet*calls usingsync.Onceinstead of creating a new connection (2 OVSDB connections: schema + monitor) per callRoot Cause
In HA E2E tests, the address set test "should manage address set when EnableAddressSet is true" intermittently fails with either:
MonitorRPC hangs indefinitely during OVSDB client creationMonitorcompletesBoth failures share the same root cause — a deadlock in libovsdb triggered by RAFT leader change during the initial Monitor RPC setup:
c.Monitor()holdsrpcMutex.RLock(), waiting formonitor_condRPC responsemonitor_condwatchForLeaderChangegoroutine detects the change via_Servermonitor update, triesrpcMutex.Lock()→ blocked by Monitor'sRLocksync.RWMutexis write-preferring: onceLock()is pending, newRLock()calls also blockhandleInactivityProbestriesRLock()for Echo probe → also blockedThe test previously created 6 OVSDB connections (2 per
WaitForAddressSet*call × 3 calls), each going through the full Connect + Monitor dance, significantly increasing the probability of hitting this race window.Fix
Listerrors instead of immediately failing, giving libovsdb's built-in reconnection logic time to recoverTest plan
🤖 Generated with Claude Code