calico-kube-controllers Single Point of Failure and Automatic Remediation

## Environment
- **Kubernetes Version**: v1.26.5
- **Calico Version**: v3.25.1
- **Installation Method**: Kubrspray
- **CNI Plugin**: Calico
- **Datastore**: Kubernetes API

## Problem Description

We encountered a cascading failure scenario where a `calico-node` pod malfunction caused the `calico-kube-controllers` pod to become stuck in `CrashLoopBackOff` on the same node, effectively breaking the Calico control plane.

### Detailed Scenario

1. **Initial Failure**: The `calico-node` pod on a specific node experiences network state corruption (veth interfaces, iptables, or routing issues)

2. **Symptom**: All container-based networking on that node fails:
   - Pods cannot reach the Kubernetes API server (`https://192.168.0.1:443`)
   - Host machine networking remains functional (bypasses container network)
   - Error logs from `calico-node`:
   ```
   [ERROR] startup/startup.go 173: failed to query Rancher's cluster state config map
   error=Get "https://192.168.0.1:443/api/v1/namespaces/kube-system/configmaps/full-cluster-state?timeout=2s":
   net/http: request canceled (Client.Timeout exceeded while awaiting headers)
   Calico node failed to start
   ```

3. **Cascading Impact**:
   - `calico-kube-controllers` (single replica) happens to be scheduled on this node
   - It enters `CrashLoopBackOff` because it cannot reach the API server
   - Since the node itself appears "Ready" to Kubernetes, the pod is **not rescheduled**
   - The Calico control plane becomes degraded/unavailable

4. **Current Workaround**: Manually restarting the `calico-node` pod restores networking and allows `calico-kube-controllers` to function again.

## Root Cause Analysis

- `calico-node` network objects (veth, routing, iptables) become abnormal, breaking all container networking
- `calico-kube-controllers` runs as a **single replica** (to avoid race conditions)
- Kubernetes does not reschedule the pod because the node status is "Ready"
- No automatic remediation mechanism exists to detect and fix this scenario

## Questions

### 1. High Availability for calico-kube-controllers
**Is there (or will there be) support for running multiple replicas of `calico-kube-controllers`?**

### 2. Automatic Remediation Strategies
**What are the recommended approaches for automatically detecting and remediating calico-node failures?**

### 3. Best Practices
**What is the recommended production deployment strategy to handle this scenario?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

calico-kube-controllers Single Point of Failure and Automatic Remediation #11509

Environment

Problem Description

Detailed Scenario

Root Cause Analysis

Questions

1. High Availability for calico-kube-controllers

2. Automatic Remediation Strategies

3. Best Practices

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

calico-kube-controllers Single Point of Failure and Automatic Remediation #11509

Description

Environment

Problem Description

Detailed Scenario

Root Cause Analysis

Questions

1. High Availability for calico-kube-controllers

2. Automatic Remediation Strategies

3. Best Practices

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions