Leader-election wedge: controller spins for hours logging 'no controller is watching nodes' without self-recovery

## Summary

Vigil can enter a state where the leader lease is held but no controller is actually running, and remains in that state indefinitely while logging the exact diagnosis every 30 seconds. Observed in production: **10h 56m** wedged before manual intervention.

Related to #21, but #21 described a ~7 min gap during rolling updates. This is the same failure mode persisting for hours — the self-diagnostic message is in place but no self-recovery action is taken.

## Environment

- Vigil v0.6.2
- Kubernetes 1.35 on EKS
- 2 replicas with leader election enabled
- Deployed via ArgoCD

## Symptoms

The standby pod (17h old) emitted this every 30s for nearly 11 hours:

```
ERROR  leader-election  leader lease acquisition is critically delayed — no controller is watching nodes
{"elapsed": "10h52m0s", "lease-duration": "15s"}
```

During the wedge:
- Lease object existed, holder identity referenced a pod that was no longer running
- Both surviving replicas could see the stale lease but neither force-acquired
- ` lease-duration` is 15s but the lease was never expired/replaced
- Metrics endpoint was up; pod was Ready; liveness probe passed
- No nodes with the startup taint were reconciled — `kubectl get events` showed zero `node-readiness` activity for the entire 11h
- Karpenter-provisioned nodes during this window had their taint removed by some other path (timeout / different controller), so workloads were not stuck — but Vigil's whole purpose was bypassed silently

Manual `kubectl delete lease vigil-controller.nextdoor.com -n vigil-system` resolved it instantly: re-election succeeded in 6 seconds, controller picked up the next new node within ~1 second of it booting, and tracked all 13 DaemonSets through readiness normally.

## Root cause hypothesis

The lease appears to have been "owned" by a stale UID from a previous pod that died without releasing it. The Kubernetes leader-election library's `RenewDeadline`/`LeaseDuration` should have caused expiration, but did not — possibly because the previous pod was renewing the lease right up until it was force-deleted, leaving a non-expired record with no live owner.

## Proposed fixes

The bar is: **if Vigil knows its controller isn't running, it should not stay running.** The error message proves it has the signal — it just needs to act on it.

Two approaches, either or both:

1. **Liveness probe failure.** Have `monitorLeaseAcquisition` flip the health endpoint to unhealthy after some threshold (e.g. 5× `LeaseDuration`, or a configurable timeout). Pod gets killed → fresh start → lease re-acquires cleanly. This is the cheapest fix and follows standard k8s patterns.

2. **Stale-holder force acquisition.** When `monitorLeaseAcquisition` detects extended delay, look up the lease holder's pod by name. If the pod doesn't exist (or is not Ready), update the lease to take ownership.

Option 1 is simpler and lower-risk. Option 2 is more graceful but adds complexity.

## Severity

In our case, the wedge happened on a sandbox cluster so blast radius was zero. On a production taskworker cluster with high node churn, this same wedge would have caused taskworker pods to land before DaemonSets are Ready for the entire wedge duration — exactly what Vigil exists to prevent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader-election wedge: controller spins for hours logging 'no controller is watching nodes' without self-recovery #55

Summary

Environment

Symptoms

Root cause hypothesis

Proposed fixes

Severity

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Leader-election wedge: controller spins for hours logging 'no controller is watching nodes' without self-recovery #55

Description

Summary

Environment

Symptoms

Root cause hypothesis

Proposed fixes

Severity

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions