Skip to content

calico-kube-controllers Single Point of Failure and Automatic Remediation #11509

@Aslan-Liu

Description

@Aslan-Liu

Environment

  • Kubernetes Version: v1.26.5
  • Calico Version: v3.25.1
  • Installation Method: Kubrspray
  • CNI Plugin: Calico
  • Datastore: Kubernetes API

Problem Description

We encountered a cascading failure scenario where a calico-node pod malfunction caused the calico-kube-controllers pod to become stuck in CrashLoopBackOff on the same node, effectively breaking the Calico control plane.

Detailed Scenario

  1. Initial Failure: The calico-node pod on a specific node experiences network state corruption (veth interfaces, iptables, or routing issues)

  2. Symptom: All container-based networking on that node fails:

    • Pods cannot reach the Kubernetes API server (https://192.168.0.1:443)
    • Host machine networking remains functional (bypasses container network)
    • Error logs from calico-node:
    [ERROR] startup/startup.go 173: failed to query Rancher's cluster state config map
    error=Get "https://192.168.0.1:443/api/v1/namespaces/kube-system/configmaps/full-cluster-state?timeout=2s":
    net/http: request canceled (Client.Timeout exceeded while awaiting headers)
    Calico node failed to start
    
  3. Cascading Impact:

    • calico-kube-controllers (single replica) happens to be scheduled on this node
    • It enters CrashLoopBackOff because it cannot reach the API server
    • Since the node itself appears "Ready" to Kubernetes, the pod is not rescheduled
    • The Calico control plane becomes degraded/unavailable
  4. Current Workaround: Manually restarting the calico-node pod restores networking and allows calico-kube-controllers to function again.

Root Cause Analysis

  • calico-node network objects (veth, routing, iptables) become abnormal, breaking all container networking
  • calico-kube-controllers runs as a single replica (to avoid race conditions)
  • Kubernetes does not reschedule the pod because the node status is "Ready"
  • No automatic remediation mechanism exists to detect and fix this scenario

Questions

1. High Availability for calico-kube-controllers

Is there (or will there be) support for running multiple replicas of calico-kube-controllers?

2. Automatic Remediation Strategies

What are the recommended approaches for automatically detecting and remediating calico-node failures?

3. Best Practices

What is the recommended production deployment strategy to handle this scenario?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions