[Search] Search component cannot immediately reflect resources from recovered clusters in existing watch connections

## Background

We encountered this issue during chaos engineering tests with our deployment platform that uses Karmada's search component to aggregate Pod views from multiple Kubernetes clusters.

## Environment

- Karmada version: v1.14.5, K8s version: v1.19.3
- Two Kubernetes clusters: K8s1 (healthy) and K8s2 (subject to fault injection)
- Internal deployment platform uses informer to watch aggregated Pod resources through search component

## Steps to Reproduce

1. **Initial State**: Both K8s1 and K8s2 are healthy and registered with Karmada. The deployment platform has an active watch connection to search component showing aggregated Pods from both clusters.
2. **Fault Injection**: Inject network fault into K8s2 (drop all network packets), causing K8s2 to become `NotReady`.
3. **Search Component Behavior**: Search component stops the informer for K8s2 cluster (as designed in `controller.go:262-286`):

   ```go
   func (c *Controller) clusterAbleToCache(cluster string) (cls *clusterv1alpha1.Cluster, able bool, err error) {
       if !util.IsClusterReady(&cls.Status) {
           klog.Warningf("cluster %s is notReady try to stop this cluster informer", cluster)
           c.InformerManager.Stop(cluster)
           return
       }
   }
   ```
4. **Fault Recovery**: Remove network fault from K8s2. K8s2 becomes `Ready` again.
5. **Search Component Recovery**: Search component restarts the informer for K8s2 and begins caching resources.
6. **Resource Changes**: Create new Pods or modify existing Pods in K8s2.
7. **Issue Observed**: The deployment platform's existing watch connection **cannot see the new resources from K8s2**. The Pod list remains in the pre-fault state.
8. **Delayed Resolution**: After 5-10 minutes, the deployment platform finally sees the updated Pod list from K8s2.

### Why 5-10 Minutes Delay?

The delay is due to client-go's watch timeout mechanism:

- `MinWatchTimeout` is default to 5 minutes in client-go
- Actual timeout is randomized between `minWatchTimeout` and `2 * minWatchTimeout` (5-10 minutes)
- Only after watch timeout triggers reconnection, the new `Watch()` call includes K8s2

Reference: [client-go watch timeout](https://github.com/kubernetes/client-go/blob/46360b527ebcd4cb7e99ae1770ad9a613bb0f128/tools/cache/reflector.go#L55)

## Expected Behavior

When a previously unavailable cluster (K8s2) recovers and its resources are cached:

1. Existing watch connections should be notified about the cluster addition
2. Resources from the recovered cluster should be sent as `ADDED` events
3. Deployment platform should see K8s2 resources immediately, not after 5-10 minutes

## Impact

- **Service Visibility**: During the 5-10 minute window, the deployment platform has incomplete resource views
- **Operational Risk**: Operators may make incorrect decisions based on outdated information
- **Poor User Experience**: After cluster recovery, users expect immediate visibility into all resources
- **Inconsistent Views Across Replicas**: When the deployment platform has multiple replicas, each replica's watch connection times out at different random times (due to the randomized 5-10 minute timeout). This causes different replicas to see the updated Pod list at different moments, resulting in inconsistent responses. Users may observe Pod lists changing back and forth between requests as their requests are load-balanced across different replicas - some showing the old state (pre-recovery) and others showing the new state (post-recovery). This inconsistency severely impacts user experience and makes it difficult to trust the platform's data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Search] Search component cannot immediately reflect resources from recovered clusters in existing watch connections #6963

Background

Environment

Steps to Reproduce

Why 5-10 Minutes Delay?

Expected Behavior

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Search] Search component cannot immediately reflect resources from recovered clusters in existing watch connections #6963

Description

Background

Environment

Steps to Reproduce

Why 5-10 Minutes Delay?

Expected Behavior

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions