Skip to content

Commit 0ed5692

Browse files
committed
up
1 parent 2c3c575 commit 0ed5692

5 files changed

Lines changed: 84 additions & 9 deletions

File tree

.wolf/anatomy.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
# anatomy.md
22

3-
> Auto-maintained by OpenWolf. Last scanned: 2026-04-09T01:45:00.204Z
3+
> Auto-maintained by OpenWolf. Last scanned: 2026-04-09T01:45:45.973Z
44
> Files: 518 tracked | Anatomy hits: 0 | Misses: 0
55
66
## ./
77

88
- `.gitattributes` — Git attributes (~6 tok)
99
- `.gitignore` — Git ignore rules (~611 tok)
1010
- `astro.config.mjs` — Astro configuration (~274 tok)
11-
- `CLAUDE.md` — OpenWolf (~3047 tok)
11+
- `CLAUDE.md` — OpenWolf (~3081 tok)
1212
- `firewalla-dns-config.txt` — Firewalla Local DNS Configuration for vanillax.me (~424 tok)
1313
- `MIGRATION_EXTERNAL_DNS.md` — Migration to ExternalDNS-Based Split DNS Architecture (~1870 tok)
1414
- `mkdocs.yml` (~269 tok)
@@ -150,7 +150,7 @@
150150

151151
## infrastructure/controllers/kyverno/
152152

153-
- `CLAUDE.md` — Kyverno Backup & Restore System (~2614 tok)
153+
- `CLAUDE.md` — Kyverno Backup & Restore System (~3270 tok)
154154
- `kustomization.yaml` — K8s Kustomization: kyverno (~338 tok)
155155
- `namespace.yaml` — K8s Namespace: kyverno (~17 tok)
156156
- `rbac-patch.yaml` — K8s ClusterRole: kyverno:background-controller:volsync (~759 tok)

.wolf/hooks/_session.json

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@
2323
"first_read": "2026-04-08T23:10:14.425Z"
2424
},
2525
"/home/vanillax/programming/talos-argocd-proxmox/infrastructure/controllers/kyverno/CLAUDE.md": {
26-
"count": 2,
27-
"tokens": 2907,
26+
"count": 3,
27+
"tokens": 3047,
2828
"first_read": "2026-04-08T23:10:21.123Z"
2929
},
3030
"/home/vanillax/programming/talos-argocd-proxmox/docs/pvc-plumber-full-flow.md": {
@@ -88,8 +88,8 @@
8888
"first_read": "2026-04-09T00:37:54.042Z"
8989
},
9090
"/home/vanillax/programming/talos-argocd-proxmox/CLAUDE.md": {
91-
"count": 2,
92-
"tokens": 2907,
91+
"count": 4,
92+
"tokens": 3047,
9393
"first_read": "2026-04-09T01:44:34.360Z"
9494
}
9595
},
@@ -189,6 +189,18 @@
189189
"action": "edit",
190190
"tokens": 211,
191191
"at": "2026-04-09T01:45:00.210Z"
192+
},
193+
{
194+
"file": "/home/vanillax/programming/talos-argocd-proxmox/infrastructure/controllers/kyverno/CLAUDE.md",
195+
"action": "edit",
196+
"tokens": 758,
197+
"at": "2026-04-09T01:45:26.520Z"
198+
},
199+
{
200+
"file": "/home/vanillax/programming/talos-argocd-proxmox/CLAUDE.md",
201+
"action": "edit",
202+
"tokens": 83,
203+
"at": "2026-04-09T01:45:45.979Z"
192204
}
193205
],
194206
"edit_counts": {
@@ -200,11 +212,12 @@
200212
"infrastructure/controllers/kyverno/policies/volsync-pvc-validate.yaml": 2,
201213
"infrastructure/controllers/kyverno/policies/volsync-pvc-mutate.yaml": 1,
202214
"infrastructure/controllers/kyverno/policies/volsync-pvc-generate.yaml": 2,
203-
"CLAUDE.md": 1
215+
"CLAUDE.md": 2,
216+
"infrastructure/controllers/kyverno/CLAUDE.md": 1
204217
},
205218
"anatomy_hits": 15,
206219
"anatomy_misses": 3,
207-
"repeated_reads_warned": 17,
220+
"repeated_reads_warned": 20,
208221
"cerebrum_warnings": 0,
209222
"stop_count": 43
210223
}

.wolf/memory.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,3 +298,5 @@
298298
| 21:44 | Edited my-apps/home/project-zomboid/deployment.yaml | 3→6 lines | ~46 |
299299
| 21:44 | Session end: 15 writes across 8 files (pvc.yaml, deployment.yaml, 2026-04-09-kyverno-cel-migration.md, values.yaml, emergency-webhook-cleanup.sh) | 18 reads | ~46306 tok |
300300
| 21:45 | Edited CLAUDE.md | 1→3 lines | ~197 |
301+
| 21:45 | Edited infrastructure/controllers/kyverno/CLAUDE.md | expanded (+57 lines) | ~707 |
302+
| 21:45 | Edited CLAUDE.md | 1→2 lines | ~77 |

CLAUDE.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,8 @@ docs/ # Documentation
128128
- Omit Kyverno canonical defaults (`emitWarning`, `validationFailureAction`, `skipBackgroundRequests`) from policy YAML — **Kyverno webhook adds them, ArgoCD detects the diff, app shows OutOfSync**
129129
- Create external HTTPRoutes without the three required pieces: `external-dns: "true"` label, `external-dns.alpha.kubernetes.io/target: vanillax.me` annotation, and `sectionName: https`**DNS won't be created and Cloudflare tunnel routing fails silently**
130130
- Use `Replace=true,Force=true` sync-options on Jobs — causes duplicate Job execution bug ([#24005](https://github.com/argoproj/argo-cd/issues/24005)); use ArgoCD hooks instead
131+
- Auto-merge major Helm chart version bumps for critical infrastructure (kube-prometheus-stack, longhorn, kyverno, cilium) — **a kube-prometheus-stack v82→v83 auto-merge caused a full cluster outage on 2026-04-08 via Kyverno webhook deadlock**. Pin Renovate to minor/patch only for these charts.
132+
- Remove infrastructure namespaces from Kyverno webhook exclusions in `values.yaml`**longhorn-system, argocd, volsync-system, etc. MUST be excluded or a Kyverno crash causes full cluster deadlock**. See `infrastructure/controllers/kyverno/CLAUDE.md` for details.
131133

132134
## Nested CLAUDE.md Files
133135

@@ -182,3 +184,4 @@ Detailed instructions load automatically when working in these directories:
182184
- **[docs/argocd.md](docs/argocd.md)** - ArgoCD documentation
183185
- **[docs/vpa-resource-optimization.md](docs/vpa-resource-optimization.md)** - VPA auto-scaling
184186
- **[docs/plans/2026-03-22-alloy-otel-honeycomb-design.md](docs/plans/2026-03-22-alloy-otel-honeycomb-design.md)** - OTEL + Honeycomb observability design
187+
- **[scripts/emergency-webhook-cleanup.sh](scripts/emergency-webhook-cleanup.sh)** - Emergency recovery from Kyverno webhook deadlock

infrastructure/controllers/kyverno/CLAUDE.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,63 @@ spec:
227227

228228
**If you need to re-process existing resources after a policy change**, do a one-time ArgoCD sync or manually trigger resource re-admission — don't enable `mutateExistingOnPolicyUpdate`.
229229

230+
## Critical: Webhook Deadlock Prevention
231+
232+
**Incident: 2026-04-08 — Full cluster outage caused by Kyverno webhook deadlock.**
233+
234+
### What happened
235+
236+
1. Renovate auto-merged a kube-prometheus-stack v82→v83 Helm chart upgrade
237+
2. ArgoCD synced it, restarting many Prometheus pods simultaneously
238+
3. The restart flood overwhelmed Kyverno's admission controller — its internal cache sync failed
239+
4. Kyverno crashed with `"failed to wait for cache sync"` on v1beta1 informers
240+
5. Kyverno's webhook was still registered with `failurePolicy: Fail`
241+
6. **Every Deployment/StatefulSet/DaemonSet creation outside kube-system was rejected**
242+
7. Longhorn (longhorn-system) couldn't restart → no storage → ArgoCD couldn't mount PVCs → ArgoCD died
243+
8. Full cluster deadlock — even rebooting all nodes didn't fix it because webhook configs survive in etcd
244+
9. Only manual deletion of webhook configurations broke the deadlock
245+
246+
### The fix
247+
248+
Infrastructure namespaces are excluded from Kyverno's webhook `namespaceSelector` in `values.yaml`:
249+
250+
```yaml
251+
config:
252+
webhooks:
253+
namespaceSelector:
254+
matchExpressions:
255+
- key: kubernetes.io/metadata.name
256+
operator: NotIn
257+
values:
258+
- kube-system
259+
- longhorn-system # Wave 1 storage
260+
- argocd # Wave 0 GitOps
261+
- volsync-system # Wave 1 backup controller
262+
- snapshot-controller # Wave 1 snapshots
263+
- cert-manager # Wave 4 but critical
264+
- external-secrets # Wave 0 secrets
265+
- 1passwordconnect # Wave 0 secrets
266+
```
267+
268+
**Why this works:** The fail-closed PVC gate only needs to protect app namespaces (Waves 4-6). Infrastructure namespaces (Waves 0-2) must boot before Kyverno (Wave 3), so they should never be gated by Kyverno's webhook.
269+
270+
### Emergency recovery
271+
272+
If Kyverno causes a webhook deadlock again:
273+
274+
```bash
275+
./scripts/emergency-webhook-cleanup.sh
276+
```
277+
278+
This deletes all Kyverno webhook configurations. Kyverno recreates them once it's healthy. The script is safe to run — it only removes webhook registrations, not policies or generated resources.
279+
280+
### Prevention rules
281+
282+
- **Never remove infrastructure namespaces from the webhook exclusion list**
283+
- **Pin Renovate to minor/patch for critical charts** (kube-prometheus-stack, longhorn, kyverno, cilium) — major version bumps should be manually reviewed
284+
- **Monitor Kyverno admission controller restarts** — more than 2 restarts in 10 minutes indicates a potential deadlock forming
285+
- **If Kyverno shows `"failed to wait for cache sync"` in logs** — run the emergency cleanup script immediately, don't wait
286+
230287
## Debugging Backup/Restore
231288

232289
```bash

0 commit comments

Comments
 (0)