Skip to content

Commit e12c537

Browse files
authored
fix: revert failure metric, NaN/Inf guard, and validator mutation (#277)
* fix: revert failure metric, NaN/Inf guard, and validator mutation - Add attune_revert_failures_total metric so failed revert attempts are visible to monitoring and alerting. Previously, a failed revert (the operator's worst failure mode) was invisible to Prometheus. - Filter NaN/Inf samples in QueryRangeGrouped to prevent non-finite values from Prometheus flowing into the recommendation engine. - Stop validator from mutating the input AttunePolicy when UpdateStrategy is nil; use a local variable instead. - Remove redundant header clone in headerTransport.RoundTrip (req.Clone already deep-copies headers). - Add FROM/TO resource values to the revert log for easier debugging. - Add Grafana dashboard panel, PrometheusRule alert, docs, and troubleshooting section for the new metric. Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> * fix: add revert failures panel to source Grafana dashboard Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> --------- Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
1 parent 5482c4d commit e12c537

15 files changed

Lines changed: 236 additions & 23 deletions

File tree

charts/attune/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,10 @@ helm install attune oci://ghcr.io/attune-io/charts/attune \
4848
| logging.format | string | `"json"` | Log format (json, text) |
4949
| logging.level | string | `"info"` | Log level (debug, info, warn, error) |
5050
| maxConcurrentReconciles | string | `""` | Maximum number of AttunePolicy reconciles running in parallel. Increase for large clusters with many policies (e.g. 4 for 200+ policies). |
51-
| metrics | object | `{"enabled":true,"port":8080,"prometheusRule":{"additionalLabels":{},"enabled":false,"rules":{"budgetExhausted":{"enabled":true,"for":"30m","severity":"warning"},"dataQuality":{"enabled":true,"for":"30m","severity":"warning"},"degraded":{"enabled":true,"for":"5m","severity":"critical"},"highRevertRate":{"enabled":true,"for":"15m","severity":"critical","threshold":"0.5"},"prometheusUnreachable":{"enabled":true,"for":"10m","severity":"warning"},"reconcileErrors":{"enabled":true,"for":"10m","severity":"warning","threshold":"0"},"reconcileStale":{"enabled":true,"for":"5m","severity":"warning","staleDuration":"30m"},"requestsClamped":{"enabled":true,"for":"1h","severity":"info"}}},"serviceMonitor":{"additionalLabels":{},"enabled":false,"interval":"30s"}}` | Metrics endpoint |
51+
| metrics | object | `{"enabled":true,"port":8080,"prometheusRule":{"additionalLabels":{},"enabled":false,"rules":{"budgetExhausted":{"enabled":true,"for":"30m","severity":"warning"},"dataQuality":{"enabled":true,"for":"30m","severity":"warning"},"degraded":{"enabled":true,"for":"5m","severity":"critical"},"highRevertRate":{"enabled":true,"for":"15m","severity":"critical","threshold":"0.5"},"prometheusUnreachable":{"enabled":true,"for":"10m","severity":"warning"},"reconcileErrors":{"enabled":true,"for":"10m","severity":"warning","threshold":"0"},"reconcileStale":{"enabled":true,"for":"5m","severity":"warning","staleDuration":"30m"},"requestsClamped":{"enabled":true,"for":"1h","severity":"info"},"revertFailures":{"enabled":true,"for":"5m","severity":"critical"}}},"serviceMonitor":{"additionalLabels":{},"enabled":false,"interval":"30s"}}` | Metrics endpoint |
5252
| metrics.prometheusRule.additionalLabels | object | `{}` | Additional labels for the PrometheusRule |
5353
| metrics.prometheusRule.enabled | bool | `false` | Create a PrometheusRule for out-of-the-box alerting. Requires the Prometheus Operator CRDs (monitoring.coreos.com/v1). |
54-
| metrics.prometheusRule.rules | object | `{"budgetExhausted":{"enabled":true,"for":"30m","severity":"warning"},"dataQuality":{"enabled":true,"for":"30m","severity":"warning"},"degraded":{"enabled":true,"for":"5m","severity":"critical"},"highRevertRate":{"enabled":true,"for":"15m","severity":"critical","threshold":"0.5"},"prometheusUnreachable":{"enabled":true,"for":"10m","severity":"warning"},"reconcileErrors":{"enabled":true,"for":"10m","severity":"warning","threshold":"0"},"reconcileStale":{"enabled":true,"for":"5m","severity":"warning","staleDuration":"30m"},"requestsClamped":{"enabled":true,"for":"1h","severity":"info"}}` | Override default alert rules. Each key matches a rule name; set enabled: false to disable individual rules. |
54+
| metrics.prometheusRule.rules | object | `{"budgetExhausted":{"enabled":true,"for":"30m","severity":"warning"},"dataQuality":{"enabled":true,"for":"30m","severity":"warning"},"degraded":{"enabled":true,"for":"5m","severity":"critical"},"highRevertRate":{"enabled":true,"for":"15m","severity":"critical","threshold":"0.5"},"prometheusUnreachable":{"enabled":true,"for":"10m","severity":"warning"},"reconcileErrors":{"enabled":true,"for":"10m","severity":"warning","threshold":"0"},"reconcileStale":{"enabled":true,"for":"5m","severity":"warning","staleDuration":"30m"},"requestsClamped":{"enabled":true,"for":"1h","severity":"info"},"revertFailures":{"enabled":true,"for":"5m","severity":"critical"}}` | Override default alert rules. Each key matches a rule name; set enabled: false to disable individual rules. |
5555
| metrics.prometheusRule.rules.budgetExhausted.for | string | `"30m"` | How long the condition must persist before firing |
5656
| metrics.prometheusRule.rules.dataQuality.for | string | `"30m"` | How long the condition must persist before firing |
5757
| metrics.prometheusRule.rules.highRevertRate.for | string | `"15m"` | How long the condition must persist before firing |
@@ -60,6 +60,7 @@ helm install attune oci://ghcr.io/attune-io/charts/attune \
6060
| metrics.prometheusRule.rules.reconcileErrors.threshold | string | `"0"` | Error rate threshold (per second, averaged over 5m) |
6161
| metrics.prometheusRule.rules.reconcileStale.staleDuration | string | `"30m"` | Fire when no reconcile completes within this duration |
6262
| metrics.prometheusRule.rules.requestsClamped.for | string | `"1h"` | How long the condition must persist before firing |
63+
| metrics.prometheusRule.rules.revertFailures.for | string | `"5m"` | How long the condition must persist before firing |
6364
| metrics.serviceMonitor.additionalLabels | object | `{}` | Additional labels for the ServiceMonitor |
6465
| metrics.serviceMonitor.enabled | bool | `false` | Create a ServiceMonitor for Prometheus Operator |
6566
| metrics.serviceMonitor.interval | string | `"30s"` | Scrape interval |

charts/attune/files/grafana-dashboard.json

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -300,6 +300,40 @@
300300
}
301301
]
302302
},
303+
{
304+
"fieldConfig": {
305+
"defaults": {
306+
"custom": {
307+
"drawStyle": "line",
308+
"fillOpacity": 10
309+
},
310+
"color": {
311+
"mode": "fixed",
312+
"fixedColor": "red"
313+
}
314+
}
315+
},
316+
"gridPos": {
317+
"h": 8,
318+
"w": 8,
319+
"x": 16,
320+
"y": 3
321+
},
322+
"id": 32,
323+
"options": {
324+
"tooltip": {
325+
"mode": "multi"
326+
}
327+
},
328+
"title": "Revert Failures",
329+
"type": "timeseries",
330+
"targets": [
331+
{
332+
"expr": "sum by (namespace, workload) (rate(attune_revert_failures_total[$__rate_interval]))",
333+
"legendFormat": "{{ namespace }}/{{ workload }}"
334+
}
335+
]
336+
},
303337
{
304338
"fieldConfig": {
305339
"defaults": {

charts/attune/templates/prometheusrule.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,4 +126,18 @@ spec:
126126
clamping resource requests to container limits for {{ .Values.metrics.prometheusRule.rules.requestsClamped.for }}.
127127
Consider increasing limits or switching to controlledValues: RequestsAndLimits.
128128
{{- end }}
129+
{{- if .Values.metrics.prometheusRule.rules.revertFailures.enabled }}
130+
- alert: AttuneRevertFailures
131+
expr: sum by (namespace, workload) (rate(attune_revert_failures_total[5m])) > 0
132+
for: {{ .Values.metrics.prometheusRule.rules.revertFailures.for }}
133+
labels:
134+
severity: {{ .Values.metrics.prometheusRule.rules.revertFailures.severity }}
135+
annotations:
136+
summary: attune resize revert is failing
137+
description: >-
138+
Workload {{ "{{ $labels.namespace }}" }}/{{ "{{ $labels.workload }}" }}
139+
had a resize revert failure. The pod may be running with
140+
problematic resources. Check operator logs for the revert error
141+
and verify RBAC for the pods/resize subresource.
142+
{{- end }}
129143
{{- end }}

charts/attune/tests/prometheusrule_test.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ tests:
2020
path: spec.groups[0].name
2121
value: attune
2222

23-
- it: should include all eight default alerts
23+
- it: should include all nine default alerts
2424
set:
2525
metrics:
2626
enabled: true
@@ -29,7 +29,7 @@ tests:
2929
asserts:
3030
- lengthEqual:
3131
path: spec.groups[0].rules
32-
count: 8
32+
count: 9
3333

3434
- it: should disable individual rules
3535
set:
@@ -45,7 +45,7 @@ tests:
4545
asserts:
4646
- lengthEqual:
4747
path: spec.groups[0].rules
48-
count: 6
48+
count: 7
4949

5050
- it: should apply additional labels
5151
set:

charts/attune/values.schema.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,15 @@
253253
"for": { "type": "string", "default": "1h" },
254254
"severity": { "type": "string", "default": "info" }
255255
}
256+
},
257+
"revertFailures": {
258+
"type": "object",
259+
"additionalProperties": false,
260+
"properties": {
261+
"enabled": { "type": "boolean", "default": true },
262+
"for": { "type": "string", "default": "5m" },
263+
"severity": { "type": "string", "default": "critical" }
264+
}
256265
}
257266
}
258267
}

charts/attune/values.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,11 @@ metrics:
136136
# -- How long the condition must persist before firing
137137
for: 1h
138138
severity: info
139+
revertFailures:
140+
enabled: true
141+
# -- How long the condition must persist before firing
142+
for: 5m
143+
severity: critical
139144

140145
grafanaDashboard:
141146
# -- Create a ConfigMap with the Grafana dashboard (auto-discovered by Grafana sidecar)

deploy/grafana/dashboard.json

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,40 @@
338338
}
339339
]
340340
},
341+
{
342+
"fieldConfig": {
343+
"defaults": {
344+
"custom": {
345+
"drawStyle": "line",
346+
"fillOpacity": 10
347+
},
348+
"color": {
349+
"mode": "fixed",
350+
"fixedColor": "red"
351+
}
352+
}
353+
},
354+
"gridPos": {
355+
"h": 8,
356+
"w": 8,
357+
"x": 16,
358+
"y": 3
359+
},
360+
"id": 32,
361+
"options": {
362+
"tooltip": {
363+
"mode": "multi"
364+
}
365+
},
366+
"title": "Revert Failures",
367+
"type": "timeseries",
368+
"targets": [
369+
{
370+
"expr": "sum by (namespace, workload) (rate(attune_revert_failures_total[$__rate_interval]))",
371+
"legendFormat": "{{ namespace }}/{{ workload }}"
372+
}
373+
]
374+
},
341375
{
342376
"datasource": {
343377
"type": "prometheus",

docs/guides/troubleshooting.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -401,6 +401,34 @@ Common causes:
401401
- **restart**: the application crashes at the new resource level. Check application logs.
402402
- **notready**: readiness probe fails post-resize. Verify probe configuration.
403403

404+
### Revert failures
405+
406+
**Symptom**: Entries in `.status.resizeHistory` show `result: Failed`, or
407+
`attune_revert_failures_total` is incrementing.
408+
409+
**Cause**: The operator detected a safety issue (OOMKill, throttle, etc.)
410+
and tried to revert the pod to its original resources, but the `/resize`
411+
subresource call failed. The pod remains at the post-resize resource level.
412+
413+
**Fix**: Check operator logs for the revert error:
414+
415+
```bash
416+
kubectl logs -l app.kubernetes.io/name=attune --tail=100 | grep "Failed to revert"
417+
```
418+
419+
Common causes:
420+
421+
- **Conflict**: another controller (HPA, VPA) is modifying the same pod.
422+
Use `attune_revert_failures_total` to track frequency.
423+
- **Pod evicted**: the pod was evicted between the safety check and revert.
424+
- **RBAC**: the operator ServiceAccount lacks `update` on the `pods/resize`
425+
subresource.
426+
427+
```promql
428+
# Alert when reverts are failing
429+
sum by (namespace, workload) (rate(attune_revert_failures_total[5m])) > 0
430+
```
431+
404432
### Resizes not happening during expected window
405433

406434
**Symptom**: Operator logs "Outside resize window, skipping resize" even

docs/reference/metrics.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,24 @@ Total number of resize reverts triggered by the safety monitor.
2424
| `workload` | Workload name |
2525
| `reason` | `oomkill`, `throttle`, `restart`, `notready`, `re-fetch-failed`, or `annotation-persist-failed` |
2626

27+
### attune_revert_failures_total
28+
29+
Total number of failed resize revert attempts. A non-zero value means the
30+
operator tried to restore a pod's original resources but the `/resize`
31+
subresource call failed, leaving the pod running with post-resize resources
32+
that may be causing issues.
33+
34+
| Label | Description |
35+
|-------|-------------|
36+
| `namespace` | Workload namespace |
37+
| `workload` | Workload name |
38+
| `reason` | Same reason labels as `attune_reverts_total` |
39+
40+
```promql
41+
# Alert when reverts are failing
42+
sum by (namespace, workload) (rate(attune_revert_failures_total[5m])) > 0
43+
```
44+
2745
### attune_prometheus_query_errors_total
2846

2947
Total number of failed Prometheus queries.

internal/controller/resize.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -550,6 +550,7 @@ func (r *AttunePolicyReconciler) resizeContainer(
550550
revertFailed := false
551551
if revertErr := monitor.RevertPod(ctx, revertRecord); revertErr != nil {
552552
logger.Error(revertErr, "Failed to revert pod after "+reason, "pod", pod.Name)
553+
operatormetrics.RevertFailuresTotal.WithLabelValues(pod.Namespace, workloadName, reason).Inc()
553554
revertFailed = true
554555
}
555556
if !revertFailed {

0 commit comments

Comments
 (0)