Skip to content

Conversation

@XiShanYongYe-Chang
Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang commented Jan 4, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

ref #6963

Which issue(s) this PR fixes:

Fixes #6963

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`karmada-search`: Fixed the issue that watch connect cannot reflect resources from recovered clusters immediately.

@karmada-bot karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 4, 2026
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 4, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @XiShanYongYe-Chang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug where watch connections in a multi-cluster environment would not immediately reflect resources from newly recovered or added clusters. The solution involves enhancing the watch mechanism to actively invalidate existing watch connections when the cluster topology changes. This forces clients to re-establish their watches, thereby ensuring they receive up-to-date resource information from all available clusters, improving the system's responsiveness to cluster recovery events.

Highlights

  • Watch Invalidation Mechanism: Introduced a new invalidatableWatchMux type and a mechanism within MultiClusterCache to track and invalidate active watch connections, ensuring clients receive up-to-date resource information.
  • Cluster Recovery Handling: The UpdateCache method now detects when new or recovered clusters are added and triggers the invalidation of all active watches, forcing clients to reconnect and include resources from these clusters.
  • Watch Registration and Cleanup: Added registerWatch and unregisterWatch methods to MultiClusterCache to manage the lifecycle of active watch connections, ensuring proper tracking and cleanup when watches are stopped.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue where watch connections did not reflect resources from recovered clusters immediately. This is achieved by tracking active watch connections and invalidating them when a cluster is added or recovered, forcing clients to reconnect and establish new watches that include the newly available cluster. The changes introduce an invalidatableWatchMux to handle watch invalidation and modify MultiClusterCache to manage these watchers. The overall approach is sound and correctly addresses the bug. I have a couple of suggestions for improvement regarding concurrency and style guide adherence.

@XiShanYongYe-Chang
Copy link
Member Author

Hi @NickYadance, can you help take a review?

@codecov-commenter
Copy link

codecov-commenter commented Jan 4, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 76.27119% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.59%. Comparing base (2a29397) to head (54b97ff).

Files with missing lines Patch % Lines
pkg/search/proxy/store/util.go 63.63% 8 Missing ⚠️
pkg/search/proxy/store/multi_cluster_cache.go 83.78% 3 Missing and 3 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7074      +/-   ##
==========================================
+ Coverage   46.55%   46.59%   +0.03%     
==========================================
  Files         700      700              
  Lines       48091    48149      +58     
==========================================
+ Hits        22389    22433      +44     
- Misses      24020    24030      +10     
- Partials     1682     1686       +4     
Flag Coverage Δ
unittests 46.59% <76.27%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

@NickYadance
Copy link

Hi @NickYadance, can you help take a review?

Looks good to me, thx @XiShanYongYe-Chang

@XiShanYongYe-Chang
Copy link
Member Author

Thanks @NickYadance

@RainbowMango RainbowMango added this to the v1.17 milestone Jan 6, 2026
Comment on lines 65 to 68
// activeWatchers tracks all active watch connections for each GVR
// key: GVR string representation, value: list of active watch multiplexers
activeWatchersLock sync.RWMutex
activeWatchers map[string][]*invalidatableWatchMux
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The invalidatableWatchMux naming is confusing, the structure itself doesn't indicate it's used for holding an invalidatable watch mux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, why not take schema.GroupVersionResource as the map key? That would make the code more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with watchMuxWithInvalidation, wdyt.

In addition, why not take schema.GroupVersionResource as the map key? That would make the code more readable.

Good suggestion.


// add/update cluster cache
clustersAdded := false
addedClusters := []string{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
addedClusters := []string{}

Doesn't have to introduce a variable just for logging, and we already have the cluster name logged once a new cache is added.

Comment on lines 133 to 135
// Any cluster being added to cache (whether new or recovered) should trigger invalidation
// This is critical for cluster recovery scenarios where existing watch connections
// don't include the recovered cluster's resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I was told, the existing watch connection will eventually(~5 min) receive the recovered cluster's resources, so this comment might not be entirely accurate.

// Cluster removal is already handled by cacher.Stop() -> terminateAllWatchers()
if clustersAdded {
klog.Infof("Cluster topology changed (clusters added: %v), invalidating all active watches to trigger reconnection", addedClusters)
c.invalidateAllWatches()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What bothers me is that I don't know the impact of interrupting a client's connection. How significant is it for the client?
For instance, if a client has already received some data from a healthy cluster, will it receive only the subsequent updates after re-establishing the watch, or will it get the full dataset again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should depend on the ResourceVersion parameter used when making the watch request. For the Reflector implementation in client-go, the default behavior is to only receive incremental changes. https://deepwiki.com/search/hpascaletargetrefworkloadkinda_e78b9389-089b-4a74-83f7-7370dc9976cf

In addition, we actually haven't changed the behavior of watch reconnections. It is still handled the same way as it was before. The current behavior simply terminates the watch request earlier than the default timeout, rather than waiting for the server to do so.

@karmada-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Search] Search component cannot immediately reflect resources from recovered clusters in existing watch connections

5 participants