-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Environment
- Kubernetes Version: v1.26.5
- Calico Version: v3.25.1
- Installation Method: Kubrspray
- CNI Plugin: Calico
- Datastore: Kubernetes API
Problem Description
We encountered a cascading failure scenario where a calico-node pod malfunction caused the calico-kube-controllers pod to become stuck in CrashLoopBackOff on the same node, effectively breaking the Calico control plane.
Detailed Scenario
-
Initial Failure: The
calico-nodepod on a specific node experiences network state corruption (veth interfaces, iptables, or routing issues) -
Symptom: All container-based networking on that node fails:
- Pods cannot reach the Kubernetes API server (
https://192.168.0.1:443) - Host machine networking remains functional (bypasses container network)
- Error logs from
calico-node:
[ERROR] startup/startup.go 173: failed to query Rancher's cluster state config map error=Get "https://192.168.0.1:443/api/v1/namespaces/kube-system/configmaps/full-cluster-state?timeout=2s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) Calico node failed to start - Pods cannot reach the Kubernetes API server (
-
Cascading Impact:
calico-kube-controllers(single replica) happens to be scheduled on this node- It enters
CrashLoopBackOffbecause it cannot reach the API server - Since the node itself appears "Ready" to Kubernetes, the pod is not rescheduled
- The Calico control plane becomes degraded/unavailable
-
Current Workaround: Manually restarting the
calico-nodepod restores networking and allowscalico-kube-controllersto function again.
Root Cause Analysis
calico-nodenetwork objects (veth, routing, iptables) become abnormal, breaking all container networkingcalico-kube-controllersruns as a single replica (to avoid race conditions)- Kubernetes does not reschedule the pod because the node status is "Ready"
- No automatic remediation mechanism exists to detect and fix this scenario
Questions
1. High Availability for calico-kube-controllers
Is there (or will there be) support for running multiple replicas of calico-kube-controllers?
2. Automatic Remediation Strategies
What are the recommended approaches for automatically detecting and remediating calico-node failures?
3. Best Practices
What is the recommended production deployment strategy to handle this scenario?