Skip to content

Commit 8742839

Browse files
authored
docs: complete SLO guardrails documentation across all references (#307)
* docs: add slo:<name> revert reason across all documentation Add slo:<name> as a valid revert reason to every location that enumerates revert reasons (metrics.md, troubleshooting.md, cli.md, api.md, first-30-days.md, canary-rollout.md, quickstart.md, index.md, safety.md). The SLO guardrails feature was fully implemented in code and tested (PR #306) but the documentation predated the feature and did not list the new reason type. Also adds an SLO guardrails subsection to docs/architecture/safety.md explaining the fail-open design, evaluation window behavior, and mitigation guidance. Minor: add Deployment metadata labels to the SLO guardrails E2E test for consistency with other E2E tests. Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> * docs: add SLO guardrails to SPEC.md auto-revert triggers and fix stale code comments - SPEC.md: add SLO guardrail breach as 5th auto-revert trigger in section 7.2, add sloGuardrails and safetyObservationPeriod to the updateStrategy spec example - internal/safety/monitor.go: update SafetyVerdict.Reason comment to include slo:<name>, update CheckPod docstring to list all 6 checks in actual execution order (including throttle and SLO guardrails) Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> * docs: complete safety check enumeration in remaining doc locations Update 4 locations that still listed incomplete safety check types: - resize-api.md Mermaid diagram label - SPEC.md directory tree comment - SPEC.md Phase 3 checklist - why-attune.md comparison table All now list all 5 check types: OOMKill, throttle, restart, NotReady, SLO guardrails. Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> * docs: add 4 missing updateStrategy fields to inheritable fields table Add safetyObservationPeriod, sloGuardrails, canary, and initialSizing to the Inheritable UpdateStrategy Fields table in configuration.md. These fields are part of the UpdateStrategy struct and were correctly listed in the namespace defaults summary row but missing from the dedicated table. Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> --------- Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
1 parent 0deb244 commit 8742839

15 files changed

Lines changed: 53 additions & 17 deletions

File tree

docs/SPEC.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -224,8 +224,15 @@ spec:
224224
observationPeriod: 30m # monitor canary pods for this long (minimum: 1m)
225225
# Cooldown between resize cycles
226226
cooldown: 1h # default: 1h, min: 1m
227-
# Automatic revert on OOMKill or excessive CPU throttle
227+
# Automatic revert on OOMKill, throttle, restarts, NotReady, or SLO breach
228228
autoRevert: true # default: true
229+
safetyObservationPeriod: 5m # observe pod post-resize (default: 5m, min: 1m)
230+
sloGuardrails: # optional: application-level SLO checks post-resize
231+
- name: p99-latency
232+
query: "histogram_quantile(0.99, rate(http_duration_seconds_bucket{namespace=\"{{ .Namespace }}\"}[5m]))"
233+
threshold: "0.5"
234+
comparison: above # revert if value > threshold
235+
evaluationWindow: 5m # wait before checking (default: 5m, min: 1m)
229236

230237
# Priority/weight for conflict resolution
231238
# When multiple policies match a workload, highest weight wins
@@ -773,6 +780,7 @@ When `autoRevert: true` (default), the Safety Monitor watches resized pods for:
773780
2. **CPU Throttle**: CPU throttle ratio exceeds 50% (configurable) post-resize
774781
3. **Excessive Restarts**: Container restart count increases by 2+ post-resize
775782
4. **Pod Not Ready**: Pod becomes NotReady within observation period
783+
5. **SLO Guardrail Breach**: Application-level PromQL query breached its threshold after `evaluationWindow` elapsed (fails open on query errors)
776784

777785
On trigger:
778786
1. Restore original resources via `/resize` subresource
@@ -1380,7 +1388,7 @@ attune/
13801388
│ │ ├── engine.go # Pod resize via /resize subresource
13811389
│ │ └── engine_test.go
13821390
│ ├── safety/
1383-
│ │ ├── monitor.go # OOMKill, throttle, restart, auto-revert
1391+
│ │ ├── monitor.go # OOMKill, throttle, restart, NotReady, SLO guardrails, auto-revert
13841392
│ │ └── monitor_test.go
13851393
│ ├── throttle/ # Shared throttle checker interface
13861394
│ ├── transform/ # Informer cache transform functions
@@ -1467,7 +1475,7 @@ attune/
14671475

14681476
### Phase 3: Safety & Intelligence
14691477

1470-
- [x] Safety monitor (OOMKill, throttle, restart detection)
1478+
- [x] Safety monitor (OOMKill, throttle, restart, NotReady, SLO guardrails)
14711479
- [x] Auto-revert mechanism
14721480
- [x] Confidence-based recommendation widening
14731481
- [x] Time-of-day-aware algorithm

docs/architecture/resize-api.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ stateDiagram-v2
158158
159159
state SafetyObservation {
160160
direction TB
161-
[*] --> Monitoring: Watching for OOMKill, throttle, restarts
161+
[*] --> Monitoring: Watching for OOMKill, throttle, restarts, NotReady, SLO breach
162162
Monitoring --> Safe: No violations after observation period
163163
Monitoring --> Unsafe: Violation detected
164164
}

docs/architecture/safety.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,25 @@ The pod's Ready condition is `False`, meaning readiness probes are failing.
9191
allocation changes. Some applications expose health endpoints that degrade
9292
under CPU throttling.
9393

94+
### SLO guardrails
95+
96+
After a resize, the safety monitor can evaluate application-level PromQL
97+
queries to detect degradation. Each guardrail specifies a query, a
98+
threshold, and a comparison direction (`above` or `below`).
99+
100+
The check runs only after the guardrail's `evaluationWindow` (default: 5m,
101+
minimum: 1m) elapses post-resize. This delay gives the application time
102+
to stabilize before comparing against SLO thresholds.
103+
104+
If a guardrail query breaches its threshold, the resize is reverted with
105+
reason `slo:<guardrail-name>`. The monitor **fails open**: if a query
106+
returns an error, NaN, or Inf, the guardrail is skipped with a log
107+
message rather than triggering a false revert.
108+
109+
**Mitigation**: review the guardrail's PromQL query and threshold in
110+
`updateStrategy.sloGuardrails`. Adjust the threshold, widen the
111+
`evaluationWindow`, or remove the guardrail if the metric is unreliable.
112+
94113
## Observation period
95114

96115
After a resize, the operator observes the pod for a configurable period
@@ -252,5 +271,5 @@ Before resizing, the controller checks for potential conflicts:
252271
When multiple consecutive resizes are reverted (visible in
253272
`.status.resizeHistory`), the policy's parameters likely need adjustment
254273
before further resizes should be attempted. Check the revert reasons
255-
(`oomkill`, `restart`, `notready`) and adjust overheads or cooldown
274+
(`oomkill`, `restart`, `notready`, `throttle`, `slo:<name>`) and adjust overheads or cooldown
256275
accordingly.

docs/getting-started/first-30-days.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ kubectl attune history -n my-app
130130

131131
The history command shows each resize with its result and reason. If a
132132
resize is reverted, the **REASON** column tells you why (oomkill, restart,
133-
notready, throttle).
133+
notready, throttle, or slo:&lt;name&gt; for SLO guardrail breaches).
134134

135135
!!! warning
136136
If you see repeated reverts, increase the overhead or adjust

docs/getting-started/quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ kubectl get attunepolicy my-app
177177
```
178178

179179
The `RESIZED` column increments as pods are resized in place. If a safety
180-
violation occurs (OOMKill, excessive restarts, pod NotReady), the operator
180+
violation occurs (OOMKill, excessive restarts, pod NotReady, or SLO guardrail breach), the operator
181181
auto-reverts the affected pods.
182182

183183
!!! tip

docs/guides/canary-rollout.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ spec:
4545
pods. Only running pods without an active resize or pending deletion qualify.
4646
3. **In-place resize**: the operator calls `UpdateResize` on each selected pod.
4747
4. **Observation**: during `observationPeriod`, the safety monitor checks for
48-
OOMKill, restart spikes, and pod NotReady.
48+
OOMKill, restart spikes, pod NotReady, CPU throttle, and SLO guardrail breaches.
4949
5. **Verdict**: if all canary pods remain healthy, the resize is considered
5050
successful. If any violation is detected, the operator auto-reverts.
5151
6. **Cooldown**: the operator waits for the `cooldown` duration before the
@@ -86,7 +86,7 @@ kubectl get attunepolicy my-app -o jsonpath='{.status.resizeHistory}' | jq '.[]
8686
```
8787

8888
!!! warning
89-
If you see repeated reverts, review the `reason` field (oomkill, restart,
89+
If you see repeated reverts, review the `reason` field (oomkill, restart, throttle, slo:&lt;name&gt;,
9090
notready) and consider increasing the overhead or adjusting bounds
9191
before retrying.
9292

docs/guides/troubleshooting.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -400,6 +400,7 @@ Common causes:
400400
- **throttle**: CPU throttle ratio exceeded 50% post-resize. Increase `cpu.overhead`.
401401
- **restart**: the application crashes at the new resource level. Check application logs.
402402
- **notready**: readiness probe fails post-resize. Verify probe configuration.
403+
- **slo:&lt;name&gt;**: an SLO guardrail query breached its threshold after resize. Review the guardrail's PromQL query and threshold in `updateStrategy.sloGuardrails`.
403404

404405
### Revert failures
405406

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ the operator built to use it.
5050

5151
- **In-place resize** via the Kubernetes 1.32+ `/resize` subresource
5252
- **Graduated rollout**: Observe, Recommend, OneShot, Canary, Auto
53-
- **Auto-revert** on OOMKill, CPU throttle, restart spikes, or pod NotReady
53+
- **Auto-revert** on OOMKill, CPU throttle, restart spikes, pod NotReady, or SLO guardrail breach
5454
- **HPA coexistence** without death spirals
5555
- **Confidence scaling** for sparse data
5656
- **Time-of-day awareness** for bursty workloads

docs/reference/api.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ spec:
177177
| `resizeHistory[].to` | `string` | New value |
178178
| `resizeHistory[].method` | `string` | `InPlace` or `Eviction` |
179179
| `resizeHistory[].result` | `string` | `Success`, `Failed`, `Reverted`, or `Evicted` |
180-
| `resizeHistory[].reason` | `string` | Why a resize was reverted or failed (e.g. `oomkill`, `restart`, `notready`). Empty for successful resizes. |
180+
| `resizeHistory[].reason` | `string` | Why a resize was reverted or failed (e.g. `oomkill`, `restart`, `notready`, `slo:<name>`). Empty for successful resizes. |
181181
| `workloadErrors[].workload` | `string` | Workload name that encountered an error during reconciliation |
182182
| `workloadErrors[].error` | `string` | Human-readable error description |
183183
| `canary.phase` | `string` | `CanaryInProgress` or `FullRollout` |
@@ -229,7 +229,7 @@ View them with `kubectl describe attunepolicy <name>` or
229229
| `BudgetExhausted` | Warning | The per-reconcile resize budget was exhausted before all workloads could be resized |
230230
| `InfeasibleBlocked` | Warning | A resize was blocked because it would exceed node capacity |
231231
| `ResizeSkipped` | Warning | A resize was skipped (e.g. pod in bad state, rolling out) |
232-
| `Reverted` | Warning | A resize was reverted due to safety observation failure (OOMKill, CPU throttle, restarts) |
232+
| `Reverted` | Warning | A resize was reverted due to safety observation failure (OOMKill, CPU throttle, restarts, or SLO guardrail breach) |
233233
| `Evicted` | Warning | A pod was evicted as a fallback when in-place resize was not possible |
234234
| `StaleRecommendation` | Warning | Recommendations are stale (no fresh Prometheus data) |
235235
| `CooldownActive` | Normal | Resize deferred because the cooldown period has not elapsed |

docs/reference/cli.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ kubectl attune history -n production
197197
| TO | New resource value |
198198
| METHOD | `InPlace` or `Eviction` |
199199
| RESULT | `Success`, `Failed`, `Reverted`, or `Evicted` |
200-
| REASON | Why a resize was reverted or failed (`oomkill`, `restart`, `notready`, `throttle`, etc.). Shows `-` for successful resizes. |
200+
| REASON | Why a resize was reverted or failed (`oomkill`, `restart`, `notready`, `throttle`, `slo:<name>`). Shows `-` for successful resizes. |
201201

202202
### wizard
203203

0 commit comments

Comments
 (0)