-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Background
We encountered this issue during chaos engineering tests with our deployment platform that uses Karmada's search component to aggregate Pod views from multiple Kubernetes clusters.
Environment
- Karmada version: v1.14.5, K8s version: v1.19.3
- Two Kubernetes clusters: K8s1 (healthy) and K8s2 (subject to fault injection)
- Internal deployment platform uses informer to watch aggregated Pod resources through search component
Steps to Reproduce
-
Initial State: Both K8s1 and K8s2 are healthy and registered with Karmada. The deployment platform has an active watch connection to search component showing aggregated Pods from both clusters.
-
Fault Injection: Inject network fault into K8s2 (drop all network packets), causing K8s2 to become
NotReady. -
Search Component Behavior: Search component stops the informer for K8s2 cluster (as designed in
controller.go:262-286):func (c *Controller) clusterAbleToCache(cluster string) (cls *clusterv1alpha1.Cluster, able bool, err error) { if !util.IsClusterReady(&cls.Status) { klog.Warningf("cluster %s is notReady try to stop this cluster informer", cluster) c.InformerManager.Stop(cluster) return } }
-
Fault Recovery: Remove network fault from K8s2. K8s2 becomes
Readyagain. -
Search Component Recovery: Search component restarts the informer for K8s2 and begins caching resources.
-
Resource Changes: Create new Pods or modify existing Pods in K8s2.
-
Issue Observed: The deployment platform's existing watch connection cannot see the new resources from K8s2. The Pod list remains in the pre-fault state.
-
Delayed Resolution: After 5-10 minutes, the deployment platform finally sees the updated Pod list from K8s2.
Why 5-10 Minutes Delay?
The delay is due to client-go's watch timeout mechanism:
MinWatchTimeoutis default to 5 minutes in client-go- Actual timeout is randomized between
minWatchTimeoutand2 * minWatchTimeout(5-10 minutes) - Only after watch timeout triggers reconnection, the new
Watch()call includes K8s2
Reference: client-go watch timeout
Expected Behavior
When a previously unavailable cluster (K8s2) recovers and its resources are cached:
- Existing watch connections should be notified about the cluster addition
- Resources from the recovered cluster should be sent as
ADDEDevents - Deployment platform should see K8s2 resources immediately, not after 5-10 minutes
Impact
- Service Visibility: During the 5-10 minute window, the deployment platform has incomplete resource views
- Operational Risk: Operators may make incorrect decisions based on outdated information
- Poor User Experience: After cluster recovery, users expect immediate visibility into all resources
- Inconsistent Views Across Replicas: When the deployment platform has multiple replicas, each replica's watch connection times out at different random times (due to the randomized 5-10 minute timeout). This causes different replicas to see the updated Pod list at different moments, resulting in inconsistent responses. Users may observe Pod lists changing back and forth between requests as their requests are load-balanced across different replicas - some showing the old state (pre-recovery) and others showing the new state (post-recovery). This inconsistency severely impacts user experience and makes it difficult to trust the platform's data.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status