-
Notifications
You must be signed in to change notification settings - Fork 1k
Fixed the issue that watch connect cannot reflect resource from recovered clusters immediately #7074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @XiShanYongYe-Chang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug where watch connections in a multi-cluster environment would not immediately reflect resources from newly recovered or added clusters. The solution involves enhancing the watch mechanism to actively invalidate existing watch connections when the cluster topology changes. This forces clients to re-establish their watches, thereby ensuring they receive up-to-date resource information from all available clusters, improving the system's responsiveness to cluster recovery events. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes an issue where watch connections did not reflect resources from recovered clusters immediately. This is achieved by tracking active watch connections and invalidating them when a cluster is added or recovered, forcing clients to reconnect and establish new watches that include the newly available cluster. The changes introduce an invalidatableWatchMux to handle watch invalidation and modify MultiClusterCache to manage these watchers. The overall approach is sound and correctly addresses the bug. I have a couple of suggestions for improvement regarding concurrency and style guide adherence.
e55b90b to
8c49192
Compare
|
Hi @NickYadance, can you help take a review? |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7074 +/- ##
==========================================
+ Coverage 46.55% 46.59% +0.03%
==========================================
Files 700 700
Lines 48091 48149 +58
==========================================
+ Hits 22389 22433 +44
- Misses 24020 24030 +10
- Partials 1682 1686 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
RainbowMango
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
Looks good to me, thx @XiShanYongYe-Chang |
|
Thanks @NickYadance |
| // activeWatchers tracks all active watch connections for each GVR | ||
| // key: GVR string representation, value: list of active watch multiplexers | ||
| activeWatchersLock sync.RWMutex | ||
| activeWatchers map[string][]*invalidatableWatchMux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The invalidatableWatchMux naming is confusing, the structure itself doesn't indicate it's used for holding an invalidatable watch mux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, why not take schema.GroupVersionResource as the map key? That would make the code more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with watchMuxWithInvalidation, wdyt.
In addition, why not take schema.GroupVersionResource as the map key? That would make the code more readable.
Good suggestion.
|
|
||
| // add/update cluster cache | ||
| clustersAdded := false | ||
| addedClusters := []string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| addedClusters := []string{} |
Doesn't have to introduce a variable just for logging, and we already have the cluster name logged once a new cache is added.
| // Any cluster being added to cache (whether new or recovered) should trigger invalidation | ||
| // This is critical for cluster recovery scenarios where existing watch connections | ||
| // don't include the recovered cluster's resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I was told, the existing watch connection will eventually(~5 min) receive the recovered cluster's resources, so this comment might not be entirely accurate.
| // Cluster removal is already handled by cacher.Stop() -> terminateAllWatchers() | ||
| if clustersAdded { | ||
| klog.Infof("Cluster topology changed (clusters added: %v), invalidating all active watches to trigger reconnection", addedClusters) | ||
| c.invalidateAllWatches() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What bothers me is that I don't know the impact of interrupting a client's connection. How significant is it for the client?
For instance, if a client has already received some data from a healthy cluster, will it receive only the subsequent updates after re-establishing the watch, or will it get the full dataset again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should depend on the ResourceVersion parameter used when making the watch request. For the Reflector implementation in client-go, the default behavior is to only receive incremental changes. https://deepwiki.com/search/hpascaletargetrefworkloadkinda_e78b9389-089b-4a74-83f7-7370dc9976cf
In addition, we actually haven't changed the behavior of watch reconnections. It is still handled the same way as it was before. The current behavior simply terminates the watch request earlier than the default timeout, rather than waiting for the server to do so.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: changzhen <[email protected]>
8c49192 to
54b97ff
Compare
What type of PR is this?
/kind bug
What this PR does / why we need it:
ref #6963
Which issue(s) this PR fixes:
Fixes #6963
Special notes for your reviewer:
Does this PR introduce a user-facing change?: