You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Create external HTTPRoutes without the three required pieces: `external-dns: "true"` label, `external-dns.alpha.kubernetes.io/target: vanillax.me` annotation, and `sectionName: https` — **DNS won't be created and Cloudflare tunnel routing fails silently**
130
130
- Use `Replace=true,Force=true` sync-options on Jobs — causes duplicate Job execution bug ([#24005](https://github.com/argoproj/argo-cd/issues/24005)); use ArgoCD hooks instead
131
+
- Auto-merge major Helm chart version bumps for critical infrastructure (kube-prometheus-stack, longhorn, kyverno, cilium) — **a kube-prometheus-stack v82→v83 auto-merge caused a full cluster outage on 2026-04-08 via Kyverno webhook deadlock**. Pin Renovate to minor/patch only for these charts.
132
+
- Remove infrastructure namespaces from Kyverno webhook exclusions in `values.yaml` — **longhorn-system, argocd, volsync-system, etc. MUST be excluded or a Kyverno crash causes full cluster deadlock**. See `infrastructure/controllers/kyverno/CLAUDE.md` for details.
131
133
132
134
## Nested CLAUDE.md Files
133
135
@@ -182,3 +184,4 @@ Detailed instructions load automatically when working in these directories:
Copy file name to clipboardExpand all lines: infrastructure/controllers/kyverno/CLAUDE.md
+57Lines changed: 57 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -227,6 +227,63 @@ spec:
227
227
228
228
**If you need to re-process existing resources after a policy change**, do a one-time ArgoCD sync or manually trigger resource re-admission — don't enable `mutateExistingOnPolicyUpdate`.
229
229
230
+
## Critical: Webhook Deadlock Prevention
231
+
232
+
**Incident: 2026-04-08 — Full cluster outage caused by Kyverno webhook deadlock.**
233
+
234
+
### What happened
235
+
236
+
1. Renovate auto-merged a kube-prometheus-stack v82→v83 Helm chart upgrade
237
+
2. ArgoCD synced it, restarting many Prometheus pods simultaneously
238
+
3. The restart flood overwhelmed Kyverno's admission controller — its internal cache sync failed
239
+
4. Kyverno crashed with `"failed to wait for cache sync"` on v1beta1 informers
240
+
5. Kyverno's webhook was still registered with `failurePolicy: Fail`
241
+
6. **Every Deployment/StatefulSet/DaemonSet creation outside kube-system was rejected**
242
+
7. Longhorn (longhorn-system) couldn't restart → no storage → ArgoCD couldn't mount PVCs → ArgoCD died
243
+
8. Full cluster deadlock — even rebooting all nodes didn't fix it because webhook configs survive in etcd
244
+
9. Only manual deletion of webhook configurations broke the deadlock
245
+
246
+
### The fix
247
+
248
+
Infrastructure namespaces are excluded from Kyverno's webhook `namespaceSelector` in `values.yaml`:
249
+
250
+
```yaml
251
+
config:
252
+
webhooks:
253
+
namespaceSelector:
254
+
matchExpressions:
255
+
- key: kubernetes.io/metadata.name
256
+
operator: NotIn
257
+
values:
258
+
- kube-system
259
+
- longhorn-system # Wave 1 storage
260
+
- argocd # Wave 0 GitOps
261
+
- volsync-system # Wave 1 backup controller
262
+
- snapshot-controller # Wave 1 snapshots
263
+
- cert-manager # Wave 4 but critical
264
+
- external-secrets # Wave 0 secrets
265
+
- 1passwordconnect # Wave 0 secrets
266
+
```
267
+
268
+
**Why this works:** The fail-closed PVC gate only needs to protect app namespaces (Waves 4-6). Infrastructure namespaces (Waves 0-2) must boot before Kyverno (Wave 3), so they should never be gated by Kyverno's webhook.
269
+
270
+
### Emergency recovery
271
+
272
+
If Kyverno causes a webhook deadlock again:
273
+
274
+
```bash
275
+
./scripts/emergency-webhook-cleanup.sh
276
+
```
277
+
278
+
This deletes all Kyverno webhook configurations. Kyverno recreates them once it's healthy. The script is safe to run — it only removes webhook registrations, not policies or generated resources.
279
+
280
+
### Prevention rules
281
+
282
+
- **Never remove infrastructure namespaces from the webhook exclusion list**
283
+
- **Pin Renovate to minor/patch for critical charts** (kube-prometheus-stack, longhorn, kyverno, cilium) — major version bumps should be manually reviewed
284
+
- **Monitor Kyverno admission controller restarts** — more than 2 restarts in 10 minutes indicates a potential deadlock forming
285
+
- **If Kyverno shows `"failed to wait for cache sync"` in logs** — run the emergency cleanup script immediately, don't wait
0 commit comments