Skip to content

Leader-election wedge: controller spins for hours logging 'no controller is watching nodes' without self-recovery #55

@diranged

Description

@diranged

Summary

Vigil can enter a state where the leader lease is held but no controller is actually running, and remains in that state indefinitely while logging the exact diagnosis every 30 seconds. Observed in production: 10h 56m wedged before manual intervention.

Related to #21, but #21 described a ~7 min gap during rolling updates. This is the same failure mode persisting for hours — the self-diagnostic message is in place but no self-recovery action is taken.

Environment

  • Vigil v0.6.2
  • Kubernetes 1.35 on EKS
  • 2 replicas with leader election enabled
  • Deployed via ArgoCD

Symptoms

The standby pod (17h old) emitted this every 30s for nearly 11 hours:

ERROR  leader-election  leader lease acquisition is critically delayed — no controller is watching nodes
{"elapsed": "10h52m0s", "lease-duration": "15s"}

During the wedge:

  • Lease object existed, holder identity referenced a pod that was no longer running
  • Both surviving replicas could see the stale lease but neither force-acquired
  • lease-duration is 15s but the lease was never expired/replaced
  • Metrics endpoint was up; pod was Ready; liveness probe passed
  • No nodes with the startup taint were reconciled — kubectl get events showed zero node-readiness activity for the entire 11h
  • Karpenter-provisioned nodes during this window had their taint removed by some other path (timeout / different controller), so workloads were not stuck — but Vigil's whole purpose was bypassed silently

Manual kubectl delete lease vigil-controller.nextdoor.com -n vigil-system resolved it instantly: re-election succeeded in 6 seconds, controller picked up the next new node within ~1 second of it booting, and tracked all 13 DaemonSets through readiness normally.

Root cause hypothesis

The lease appears to have been "owned" by a stale UID from a previous pod that died without releasing it. The Kubernetes leader-election library's RenewDeadline/LeaseDuration should have caused expiration, but did not — possibly because the previous pod was renewing the lease right up until it was force-deleted, leaving a non-expired record with no live owner.

Proposed fixes

The bar is: if Vigil knows its controller isn't running, it should not stay running. The error message proves it has the signal — it just needs to act on it.

Two approaches, either or both:

  1. Liveness probe failure. Have monitorLeaseAcquisition flip the health endpoint to unhealthy after some threshold (e.g. 5× LeaseDuration, or a configurable timeout). Pod gets killed → fresh start → lease re-acquires cleanly. This is the cheapest fix and follows standard k8s patterns.

  2. Stale-holder force acquisition. When monitorLeaseAcquisition detects extended delay, look up the lease holder's pod by name. If the pod doesn't exist (or is not Ready), update the lease to take ownership.

Option 1 is simpler and lower-risk. Option 2 is more graceful but adds complexity.

Severity

In our case, the wedge happened on a sandbox cluster so blast radius was zero. On a production taskworker cluster with high node churn, this same wedge would have caused taskworker pods to land before DaemonSets are Ready for the entire wedge duration — exactly what Vigil exists to prevent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions