Skip to content

Commit 41a764f

Browse files
committed
fix overload
1 parent 6292dad commit 41a764f

7 files changed

Lines changed: 48 additions & 26 deletions

File tree

CLAUDE.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,10 @@ docs/ # Documentation
112112
- Manually create or delete ReplicationSource/ReplicationDestination (Kyverno manages these)
113113
- Use legacy `nfs:` block for NFS PVs (mountOptions silently ignored — use CSI)
114114
- Use `RollingUpdate` strategy on Deployments with RWO PVCs (causes Multi-Attach deadlock)
115-
- Use `background: true` on Kyverno generate policies — **causes API server overload from continuous background scanning of all matching resources; use `background: false` and rely on admission-time generation**
116-
- Use `mutateExistingOnPolicyUpdate: true` on Kyverno generate policies — **re-evaluates ALL matching resources cluster-wide on any policy change, creating UpdateRequest storms**
115+
- Use `background: true` on Kyverno generate policies — **causes API server overload from continuous background scanning; use `background: false`**
116+
- Use `mutateExistingOnPolicyUpdate: true` on Kyverno generate policies — **re-evaluates ALL matching resources cluster-wide on any policy change**
117+
- Use `synchronize: true` on Kyverno generate policies — **drift watchers create UpdateRequests on every controller status update, hammering the API server; use `synchronize: false`**
118+
- Omit Kyverno canonical defaults (`emitWarning`, `validationFailureAction`, `skipBackgroundRequests`) from policy YAML — **Kyverno webhook adds them, ArgoCD detects the diff, app shows OutOfSync**
117119

118120
## Nested CLAUDE.md Files
119121

infrastructure/controllers/argocd/apps/kyverno-app.yaml

Lines changed: 6 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -36,24 +36,16 @@ spec:
3636
ignoreDifferences:
3737
- group: kyverno.io
3838
kind: ClusterPolicy
39-
jqPathExpressions:
40-
- .metadata.generation
41-
# Kyverno webhook adds these defaults on admission
42-
- .spec.emitWarning
43-
- .spec.validationFailureAction
44-
- .spec.rules[].skipBackgroundRequests
39+
jsonPointers:
40+
- /metadata/generation
4541
- group: kyverno.io
4642
kind: ClusterCleanupPolicy
47-
jqPathExpressions:
48-
- .metadata.generation
49-
- .spec.validationFailureAction
43+
jsonPointers:
44+
- /metadata/generation
5045
- group: kyverno.io
5146
kind: Policy
52-
jqPathExpressions:
53-
- .metadata.generation
54-
- .spec.emitWarning
55-
- .spec.validationFailureAction
56-
- .spec.rules[].skipBackgroundRequests
47+
jsonPointers:
48+
- /metadata/generation
5749
# Kyverno injects caBundle into webhooks after creation
5850
- group: admissionregistration.k8s.io
5951
kind: MutatingWebhookConfiguration

infrastructure/controllers/kyverno/CLAUDE.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -207,18 +207,23 @@ spec:
207207

208208
**Never use `mutateExistingOnPolicyUpdate: true` on generate policies**. This re-evaluates ALL matching resources cluster-wide whenever the policy YAML changes — even a comment edit triggers it. Combined with background scanning, this caused a 23-hour API server overload incident (2026-03-25).
209209

210-
**The safe pattern for all Kyverno generate policies:**
210+
**The safe pattern for all Kyverno generate policies (canonical form):**
211211
```yaml
212212
spec:
213213
mutateExistingOnPolicyUpdate: false # REQUIRED — prevents cluster-wide re-evaluation on policy change
214214
background: false # REQUIRED — prevents continuous background scanning
215+
emitWarning: false # Kyverno default — include to match canonical form for ArgoCD sync
216+
validationFailureAction: Audit # Kyverno default — include to match canonical form for ArgoCD sync
215217
rules:
216218
- name: my-generate-rule
219+
skipBackgroundRequests: true # Kyverno default — include to match canonical form for ArgoCD sync
217220
generate:
218-
synchronize: true # OKsync enforcement happens via admission webhook, not background scan
221+
synchronize: false # REQUIREDprevents drift watchers that generate UpdateRequests on every controller status update
219222
```
220223
221-
`synchronize: true` still works with `background: false` because sync is enforced through the admission controller. Generated resources are created/recreated when the trigger resource goes through admission (e.g., PVC creation via ArgoCD sync).
224+
**Why `synchronize: false`**: With `synchronize: true`, Kyverno watches every generated resource (ExternalSecrets, ReplicationSources, etc.) and creates UpdateRequests whenever their controllers update status. With ~114 watched resources, this generates hundreds of thousands of API calls. Resources are still created on admission (PVC creation via ArgoCD sync) — they just aren't re-synced on drift.
225+
226+
**Why canonical form**: Kyverno's admission webhook adds `emitWarning`, `validationFailureAction`, and `skipBackgroundRequests` as defaults. If these aren't in git, ArgoCD detects the diff and shows OutOfSync. Writing the defaults explicitly keeps ArgoCD happy.
222227

223228
**If you need to re-process existing resources after a policy change**, do a one-time ArgoCD sync or manually trigger resource re-admission — don't enable `mutateExistingOnPolicyUpdate`.
224229

infrastructure/controllers/kyverno/policies/volsync-pvc-backup-restore.yaml

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,14 @@ metadata:
1414
spec:
1515
mutateExistingOnPolicyUpdate: false
1616
background: false
17+
emitWarning: false
18+
validationFailureAction: Audit
1719
rules:
1820
# Rule 0: Gate PVC creation on PVC Plumber availability (FAIL-CLOSED)
1921
# If PVC Plumber is unreachable, DENY PVC creation to prevent data loss
2022
# during disaster recovery. Apps retry via ArgoCD backoff until Plumber is healthy.
2123
- name: require-pvc-plumber-available
24+
skipBackgroundRequests: true
2225
match:
2326
any:
2427
- resources:
@@ -60,6 +63,7 @@ spec:
6063
# Rule 1: Conditionally add dataSourceRef if backup exists in Kopia
6164
# IMPORTANT: Only trigger on CREATE to avoid race conditions during PVC deletion
6265
- name: add-datasource-if-backup-exists
66+
skipBackgroundRequests: true
6367
match:
6468
any:
6569
- resources:
@@ -103,6 +107,7 @@ spec:
103107

104108
# Rule 2: Generate ExternalSecret for Kopia repository credentials
105109
- name: generate-kopia-secret
110+
skipBackgroundRequests: true
106111
match:
107112
any:
108113
- resources:
@@ -121,7 +126,7 @@ spec:
121126
- volsync-system
122127
- kyverno
123128
generate:
124-
synchronize: true
129+
synchronize: false
125130
apiVersion: external-secrets.io/v1
126131
kind: ExternalSecret
127132
name: "volsync-{{request.object.metadata.name}}"
@@ -160,6 +165,7 @@ spec:
160165
# Rule 3: Generate ReplicationSource (backup schedule)
161166
# IMPORTANT: Only create backup AFTER PVC is Bound to avoid conflicts with restore
162167
- name: generate-replication-source
168+
skipBackgroundRequests: true
163169
match:
164170
any:
165171
- resources:
@@ -190,7 +196,7 @@ spec:
190196
operator: GreaterThanOrEquals
191197
value: "2h"
192198
generate:
193-
synchronize: true
199+
synchronize: false
194200
apiVersion: volsync.backube/v1alpha1
195201
kind: ReplicationSource
196202
name: "{{request.object.metadata.name}}-backup"
@@ -225,6 +231,7 @@ spec:
225231

226232
# Rule 4: Generate ReplicationDestination (restore capability)
227233
- name: generate-replication-destination
234+
skipBackgroundRequests: true
228235
match:
229236
any:
230237
- resources:
@@ -243,7 +250,7 @@ spec:
243250
- volsync-system
244251
- kyverno
245252
generate:
246-
synchronize: true
253+
synchronize: false
247254
apiVersion: volsync.backube/v1alpha1
248255
kind: ReplicationDestination
249256
name: "{{request.object.metadata.name}}-backup"

infrastructure/controllers/kyverno/policies/vpa-auto-generate.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,14 @@ metadata:
1515
spec:
1616
mutateExistingOnPolicyUpdate: false
1717
background: false
18+
emitWarning: false
19+
validationFailureAction: Audit
1820
rules:
1921
# Rule 1: Infrastructure and monitoring — recommend only
2022
# Resources in these namespaces are manually tuned via GitOps.
2123
# VPA provides recommendations but does not auto-apply them.
2224
- name: generate-vpa-recommend-only
25+
skipBackgroundRequests: true
2326
match:
2427
any:
2528
- resources:
@@ -64,7 +67,7 @@ spec:
6467
- k8sgpt
6568
- pod-cleanup
6669
generate:
67-
synchronize: true
70+
synchronize: false
6871
apiVersion: autoscaling.k8s.io/v1
6972
kind: VerticalPodAutoscaler
7073
name: "{{request.object.metadata.name}}"
@@ -91,6 +94,7 @@ spec:
9194
# running pods. This prevents OOM kills on bursty workloads (e.g. Kafka,
9295
# TubeSync) where steady-state usage is low but peak usage needs headroom.
9396
- name: generate-vpa-auto-tune
97+
skipBackgroundRequests: true
9498
match:
9599
any:
96100
- resources:
@@ -137,7 +141,7 @@ spec:
137141
- k8sgpt
138142
- pod-cleanup
139143
generate:
140-
synchronize: true
144+
synchronize: false
141145
apiVersion: autoscaling.k8s.io/v1
142146
kind: VerticalPodAutoscaler
143147
name: "{{request.object.metadata.name}}"

infrastructure/controllers/kyverno/policies/vpa-min-allowed.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,11 @@ metadata:
1515
handles the rest. Example: vpa.kubernetes.io/min-memory: "2Gi"
1616
spec:
1717
background: false
18+
emitWarning: false
19+
validationFailureAction: Audit
1820
rules:
1921
- name: inject-min-allowed
22+
skipBackgroundRequests: true
2023
match:
2124
any:
2225
- resources:

scripts/validate-kyverno-policies.sh

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,10 @@
22
# Validates Kyverno generate policies for dangerous settings that cause API server overload.
33
# Used by: CI pipeline, pre-commit hook
44
#
5-
# background: true on generate policies → continuous background scanning (~30s loop)
5+
# background: true → continuous background scanning (~30s loop)
66
# mutateExistingOnPolicyUpdate: true → re-evaluates ALL matching resources on policy change
7-
# Both caused a 23-hour API server overload incident (2026-03-25).
7+
# synchronize: true → drift watchers create UpdateRequests on every controller status update
8+
# All three caused a 23-hour API server overload incident (2026-03-25).
89
set -euo pipefail
910

1011
ERRORS=0
@@ -27,6 +28,14 @@ for file in $(grep -rl 'kind: ClusterPolicy\|kind: Policy' infrastructure/contro
2728
echo ""
2829
ERRORS=$((ERRORS + 1))
2930
fi
31+
32+
if grep -q 'synchronize: true' "$file"; then
33+
echo "ERROR: ${file}"
34+
echo " synchronize: true creates drift watchers that generate UpdateRequests on every controller update."
35+
echo " Use: synchronize: false"
36+
echo ""
37+
ERRORS=$((ERRORS + 1))
38+
fi
3039
done
3140

3241
if [ "$ERRORS" -gt 0 ]; then

0 commit comments

Comments
 (0)