Skip to content

[Search] Search component cannot immediately reflect resources from recovered clusters in existing watch connections #6963

@ryanwuer

Description

@ryanwuer

Background

We encountered this issue during chaos engineering tests with our deployment platform that uses Karmada's search component to aggregate Pod views from multiple Kubernetes clusters.

Environment

  • Karmada version: v1.14.5, K8s version: v1.19.3
  • Two Kubernetes clusters: K8s1 (healthy) and K8s2 (subject to fault injection)
  • Internal deployment platform uses informer to watch aggregated Pod resources through search component

Steps to Reproduce

  1. Initial State: Both K8s1 and K8s2 are healthy and registered with Karmada. The deployment platform has an active watch connection to search component showing aggregated Pods from both clusters.

  2. Fault Injection: Inject network fault into K8s2 (drop all network packets), causing K8s2 to become NotReady.

  3. Search Component Behavior: Search component stops the informer for K8s2 cluster (as designed in controller.go:262-286):

    func (c *Controller) clusterAbleToCache(cluster string) (cls *clusterv1alpha1.Cluster, able bool, err error) {
        if !util.IsClusterReady(&cls.Status) {
            klog.Warningf("cluster %s is notReady try to stop this cluster informer", cluster)
            c.InformerManager.Stop(cluster)
            return
        }
    }
  4. Fault Recovery: Remove network fault from K8s2. K8s2 becomes Ready again.

  5. Search Component Recovery: Search component restarts the informer for K8s2 and begins caching resources.

  6. Resource Changes: Create new Pods or modify existing Pods in K8s2.

  7. Issue Observed: The deployment platform's existing watch connection cannot see the new resources from K8s2. The Pod list remains in the pre-fault state.

  8. Delayed Resolution: After 5-10 minutes, the deployment platform finally sees the updated Pod list from K8s2.

Why 5-10 Minutes Delay?

The delay is due to client-go's watch timeout mechanism:

  • MinWatchTimeout is default to 5 minutes in client-go
  • Actual timeout is randomized between minWatchTimeout and 2 * minWatchTimeout (5-10 minutes)
  • Only after watch timeout triggers reconnection, the new Watch() call includes K8s2

Reference: client-go watch timeout

Expected Behavior

When a previously unavailable cluster (K8s2) recovers and its resources are cached:

  1. Existing watch connections should be notified about the cluster addition
  2. Resources from the recovered cluster should be sent as ADDED events
  3. Deployment platform should see K8s2 resources immediately, not after 5-10 minutes

Impact

  • Service Visibility: During the 5-10 minute window, the deployment platform has incomplete resource views
  • Operational Risk: Operators may make incorrect decisions based on outdated information
  • Poor User Experience: After cluster recovery, users expect immediate visibility into all resources
  • Inconsistent Views Across Replicas: When the deployment platform has multiple replicas, each replica's watch connection times out at different random times (due to the randomized 5-10 minute timeout). This causes different replicas to see the updated Pod list at different moments, resulting in inconsistent responses. Users may observe Pod lists changing back and forth between requests as their requests are load-balanced across different replicas - some showing the old state (pre-recovery) and others showing the new state (post-recovery). This inconsistency severely impacts user experience and makes it difficult to trust the platform's data.

Metadata

Metadata

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions