docs: complete SLO guardrails documentation across all references (#307)

SebTardif · web-flow · commit 874283909c5a · 2026-06-07T06:36:38.000Z
* docs: add slo:<name> revert reason across all documentation Add slo:<name> as a valid revert reason to every location that enumerates revert reasons (metrics.md, troubleshooting.md, cli.md, api.md, first-30-days.md, canary-rollout.md, quickstart.md, index.md, safety.md). The SLO guardrails feature was fully implemented in code and tested (PR #306) but the documentation predated the feature and did not list the new reason type. Also adds an SLO guardrails subsection to docs/architecture/safety.md explaining the fail-open design, evaluation window behavior, and mitigation guidance. Minor: add Deployment metadata labels to the SLO guardrails E2E test for consistency with other E2E tests. Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> * docs: add SLO guardrails to SPEC.md auto-revert triggers and fix stale code comments - SPEC.md: add SLO guardrail breach as 5th auto-revert trigger in section 7.2, add sloGuardrails and safetyObservationPeriod to the updateStrategy spec example - internal/safety/monitor.go: update SafetyVerdict.Reason comment to include slo:<name>, update CheckPod docstring to list all 6 checks in actual execution order (including throttle and SLO guardrails) Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> * docs: complete safety check enumeration in remaining doc locations Update 4 locations that still listed incomplete safety check types: - resize-api.md Mermaid diagram label - SPEC.md directory tree comment - SPEC.md Phase 3 checklist - why-attune.md comparison table All now list all 5 check types: OOMKill, throttle, restart, NotReady, SLO guardrails. Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> * docs: add 4 missing updateStrategy fields to inheritable fields table Add safetyObservationPeriod, sloGuardrails, canary, and initialSizing to the Inheritable UpdateStrategy Fields table in configuration.md. These fields are part of the UpdateStrategy struct and were correctly listed in the namespace defaults summary row but missing from the dedicated table. Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca> --------- Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
diff --git a/docs/SPEC.md b/docs/SPEC.md
@@ -224,8 +224,15 @@ spec:
       observationPeriod: 30m  # monitor canary pods for this long (minimum: 1m)
     # Cooldown between resize cycles
     cooldown: 1h              # default: 1h, min: 1m
-    # Automatic revert on OOMKill or excessive CPU throttle
+    # Automatic revert on OOMKill, throttle, restarts, NotReady, or SLO breach
     autoRevert: true          # default: true
+    safetyObservationPeriod: 5m  # observe pod post-resize (default: 5m, min: 1m)
+    sloGuardrails:            # optional: application-level SLO checks post-resize
+      - name: p99-latency
+        query: "histogram_quantile(0.99, rate(http_duration_seconds_bucket{namespace=\"{{ .Namespace }}\"}[5m]))"
+        threshold: "0.5"
+        comparison: above     # revert if value > threshold
+        evaluationWindow: 5m  # wait before checking (default: 5m, min: 1m)
 
   # Priority/weight for conflict resolution
   # When multiple policies match a workload, highest weight wins
@@ -773,6 +780,7 @@ When `autoRevert: true` (default), the Safety Monitor watches resized pods for:
 2. **CPU Throttle**: CPU throttle ratio exceeds 50% (configurable) post-resize
 3. **Excessive Restarts**: Container restart count increases by 2+ post-resize
 4. **Pod Not Ready**: Pod becomes NotReady within observation period
+5. **SLO Guardrail Breach**: Application-level PromQL query breached its threshold after `evaluationWindow` elapsed (fails open on query errors)
 
 On trigger:
 1. Restore original resources via `/resize` subresource
@@ -1380,7 +1388,7 @@ attune/
 │   │   ├── engine.go            # Pod resize via /resize subresource
 │   │   └── engine_test.go
 │   ├── safety/
-│   │   ├── monitor.go           # OOMKill, throttle, restart, auto-revert
+│   │   ├── monitor.go           # OOMKill, throttle, restart, NotReady, SLO guardrails, auto-revert
 │   │   └── monitor_test.go
 │   ├── throttle/                # Shared throttle checker interface
 │   ├── transform/               # Informer cache transform functions
@@ -1467,7 +1475,7 @@ attune/
 
 ### Phase 3: Safety & Intelligence
 
-- [x] Safety monitor (OOMKill, throttle, restart detection)
+- [x] Safety monitor (OOMKill, throttle, restart, NotReady, SLO guardrails)
 - [x] Auto-revert mechanism
 - [x] Confidence-based recommendation widening
 - [x] Time-of-day-aware algorithm
diff --git a/docs/architecture/resize-api.md b/docs/architecture/resize-api.md
@@ -158,7 +158,7 @@ stateDiagram-v2
 
     state SafetyObservation {
         direction TB
-        [*] --> Monitoring: Watching for OOMKill, throttle, restarts
+        [*] --> Monitoring: Watching for OOMKill, throttle, restarts, NotReady, SLO breach
         Monitoring --> Safe: No violations after observation period
         Monitoring --> Unsafe: Violation detected
     }
diff --git a/docs/architecture/safety.md b/docs/architecture/safety.md
@@ -91,6 +91,25 @@ The pod's Ready condition is `False`, meaning readiness probes are failing.
 allocation changes. Some applications expose health endpoints that degrade
 under CPU throttling.
 
+### SLO guardrails
+
+After a resize, the safety monitor can evaluate application-level PromQL
+queries to detect degradation. Each guardrail specifies a query, a
+threshold, and a comparison direction (`above` or `below`).
+
+The check runs only after the guardrail's `evaluationWindow` (default: 5m,
+minimum: 1m) elapses post-resize. This delay gives the application time
+to stabilize before comparing against SLO thresholds.
+
+If a guardrail query breaches its threshold, the resize is reverted with
+reason `slo:<guardrail-name>`. The monitor **fails open**: if a query
+returns an error, NaN, or Inf, the guardrail is skipped with a log
+message rather than triggering a false revert.
+
+**Mitigation**: review the guardrail's PromQL query and threshold in
+`updateStrategy.sloGuardrails`. Adjust the threshold, widen the
+`evaluationWindow`, or remove the guardrail if the metric is unreliable.
+
 ## Observation period
 
 After a resize, the operator observes the pod for a configurable period
@@ -252,5 +271,5 @@ Before resizing, the controller checks for potential conflicts:
 When multiple consecutive resizes are reverted (visible in
 `.status.resizeHistory`), the policy's parameters likely need adjustment
 before further resizes should be attempted. Check the revert reasons
-(`oomkill`, `restart`, `notready`) and adjust overheads or cooldown
+(`oomkill`, `restart`, `notready`, `throttle`, `slo:<name>`) and adjust overheads or cooldown
 accordingly.
diff --git a/docs/getting-started/first-30-days.md b/docs/getting-started/first-30-days.md
@@ -130,7 +130,7 @@ kubectl attune history -n my-app
 
 The history command shows each resize with its result and reason. If a
 resize is reverted, the **REASON** column tells you why (oomkill, restart,
-notready, throttle).
+notready, throttle, or slo:&lt;name&gt; for SLO guardrail breaches).
 
 !!! warning
     If you see repeated reverts, increase the overhead or adjust
diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
@@ -177,7 +177,7 @@ kubectl get attunepolicy my-app
 ```
 
 The `RESIZED` column increments as pods are resized in place. If a safety
-violation occurs (OOMKill, excessive restarts, pod NotReady), the operator
+violation occurs (OOMKill, excessive restarts, pod NotReady, or SLO guardrail breach), the operator
 auto-reverts the affected pods.
 
 !!! tip
diff --git a/docs/guides/canary-rollout.md b/docs/guides/canary-rollout.md
@@ -45,7 +45,7 @@ spec:
    pods. Only running pods without an active resize or pending deletion qualify.
 3. **In-place resize**: the operator calls `UpdateResize` on each selected pod.
 4. **Observation**: during `observationPeriod`, the safety monitor checks for
-   OOMKill, restart spikes, and pod NotReady.
+   OOMKill, restart spikes, pod NotReady, CPU throttle, and SLO guardrail breaches.
 5. **Verdict**: if all canary pods remain healthy, the resize is considered
    successful. If any violation is detected, the operator auto-reverts.
 6. **Cooldown**: the operator waits for the `cooldown` duration before the
@@ -86,7 +86,7 @@ kubectl get attunepolicy my-app -o jsonpath='{.status.resizeHistory}' | jq '.[]
 ```
 
 !!! warning
-    If you see repeated reverts, review the `reason` field (oomkill, restart,
+    If you see repeated reverts, review the `reason` field (oomkill, restart, throttle, slo:&lt;name&gt;,
     notready) and consider increasing the overhead or adjusting bounds
     before retrying.
 
diff --git a/docs/guides/troubleshooting.md b/docs/guides/troubleshooting.md
@@ -400,6 +400,7 @@ Common causes:
 - **throttle**: CPU throttle ratio exceeded 50% post-resize. Increase `cpu.overhead`.
 - **restart**: the application crashes at the new resource level. Check application logs.
 - **notready**: readiness probe fails post-resize. Verify probe configuration.
+- **slo:&lt;name&gt;**: an SLO guardrail query breached its threshold after resize. Review the guardrail's PromQL query and threshold in `updateStrategy.sloGuardrails`.
 
 ### Revert failures
 
diff --git a/docs/index.md b/docs/index.md
@@ -50,7 +50,7 @@ the operator built to use it.
 
 - **In-place resize** via the Kubernetes 1.32+ `/resize` subresource
 - **Graduated rollout**: Observe, Recommend, OneShot, Canary, Auto
-- **Auto-revert** on OOMKill, CPU throttle, restart spikes, or pod NotReady
+- **Auto-revert** on OOMKill, CPU throttle, restart spikes, pod NotReady, or SLO guardrail breach
 - **HPA coexistence** without death spirals
 - **Confidence scaling** for sparse data
 - **Time-of-day awareness** for bursty workloads
diff --git a/docs/reference/api.md b/docs/reference/api.md
@@ -177,7 +177,7 @@ spec:
 | `resizeHistory[].to` | `string` | New value |
 | `resizeHistory[].method` | `string` | `InPlace` or `Eviction` |
 | `resizeHistory[].result` | `string` | `Success`, `Failed`, `Reverted`, or `Evicted` |
-| `resizeHistory[].reason` | `string` | Why a resize was reverted or failed (e.g. `oomkill`, `restart`, `notready`). Empty for successful resizes. |
+| `resizeHistory[].reason` | `string` | Why a resize was reverted or failed (e.g. `oomkill`, `restart`, `notready`, `slo:<name>`). Empty for successful resizes. |
 | `workloadErrors[].workload` | `string` | Workload name that encountered an error during reconciliation |
 | `workloadErrors[].error` | `string` | Human-readable error description |
 | `canary.phase` | `string` | `CanaryInProgress` or `FullRollout` |
@@ -229,7 +229,7 @@ View them with `kubectl describe attunepolicy <name>` or
 | `BudgetExhausted` | Warning | The per-reconcile resize budget was exhausted before all workloads could be resized |
 | `InfeasibleBlocked` | Warning | A resize was blocked because it would exceed node capacity |
 | `ResizeSkipped` | Warning | A resize was skipped (e.g. pod in bad state, rolling out) |
-| `Reverted` | Warning | A resize was reverted due to safety observation failure (OOMKill, CPU throttle, restarts) |
+| `Reverted` | Warning | A resize was reverted due to safety observation failure (OOMKill, CPU throttle, restarts, or SLO guardrail breach) |
 | `Evicted` | Warning | A pod was evicted as a fallback when in-place resize was not possible |
 | `StaleRecommendation` | Warning | Recommendations are stale (no fresh Prometheus data) |
 | `CooldownActive` | Normal | Resize deferred because the cooldown period has not elapsed |
diff --git a/docs/reference/cli.md b/docs/reference/cli.md
@@ -197,7 +197,7 @@ kubectl attune history -n production
 | TO | New resource value |
 | METHOD | `InPlace` or `Eviction` |
 | RESULT | `Success`, `Failed`, `Reverted`, or `Evicted` |
-| REASON | Why a resize was reverted or failed (`oomkill`, `restart`, `notready`, `throttle`, etc.). Shows `-` for successful resizes. |
+| REASON | Why a resize was reverted or failed (`oomkill`, `restart`, `notready`, `throttle`, `slo:<name>`). Shows `-` for successful resizes. |
 
 ### wizard
 
diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md
@@ -243,6 +243,10 @@ that do not set them explicitly. Policy-level values always take precedence.
 | `maxTotalMemoryIncrease` | quantity | (none) | Max aggregate memory increase per cycle |
 | `schedule` | object | (none) | Time windows, days of week, timezone |
 | `export` | object | (none) | Metrics export configuration |
+| `safetyObservationPeriod` | duration | `5m` | Post-resize observation window (min: 1m) |
+| `sloGuardrails` | list | `[]` | Application-level SLO PromQL checks after resize |
+| `canary` | object | (none) | Canary rollout configuration (percentage, observationPeriod) |
+| `initialSizing` | bool | `false` | Enable mutating webhook for pod creation |
 
 Example: set a cluster-wide maintenance window and budget cap via
 `AttuneDefaults`, then individual policies inherit them unless overridden:
diff --git a/docs/reference/metrics.md b/docs/reference/metrics.md
@@ -22,7 +22,7 @@ Total number of resize reverts triggered by the safety monitor.
 |-------|-------------|
 | `namespace` | Workload namespace |
 | `workload` | Workload name |
-| `reason` | `oomkill`, `throttle`, `restart`, `notready`, `re-fetch-failed`, or `annotation-persist-failed` |
+| `reason` | `oomkill`, `throttle`, `restart`, `notready`, `slo:<name>`, `re-fetch-failed`, or `annotation-persist-failed` |
 
 ### attune_revert_failures_total
 
diff --git a/docs/why-attune.md b/docs/why-attune.md
@@ -318,7 +318,7 @@ across the capabilities that matter most.
 | **Primary function** | Recommend + apply | VPA dashboard | CLI recommender | VPA applier | Usage-based controller | **Recommend + in-place apply** |
 | **Resize method** | Evict/recreate, InPlaceOrRecreate (1.33+) | No resize | No resize | Cron-based rollout | Rolling restart | **In-place only** |
 | **HPA compatible** | No (conflicts on CPU metric) | N/A | N/A | N/A | No | **Yes** |
-| **Safety system** | Minimal (PDB only) | N/A | N/A | min-diff thresholds | None | **Multi-layer (OOMKill, throttle, revert)** |
+| **Safety system** | Minimal (PDB only) | N/A | N/A | min-diff thresholds | None | **Multi-layer (OOMKill, throttle, restart, NotReady, SLO guardrails)** |
 | **Time-of-day aware** | No (24h half-life histogram) | No | No | No | No | **Yes (hourly profiles)** |
 | **Graduated rollout** | No (all-or-nothing) | N/A | N/A | No | No | **5 modes (Observe to Auto)** |
 | **Per-resource config** | containerPolicies[] | N/A | CLI flags | Annotations per resource | N/A | **Typed CRD (cpu/memory sections)** |
diff --git a/internal/safety/monitor.go b/internal/safety/monitor.go
@@ -85,7 +85,7 @@ type ResizeRecord struct {
 // SafetyVerdict is the result of checking a resized pod for problems.
 type SafetyVerdict struct {
 	Safe    bool
-	Reason  string // "oomkill", "throttle", "restart", "notready", ""
+	Reason  string // "oomkill", "throttle", "restart", "notready", "slo:<name>", ""
 	Message string
 	// ThrottleDeferred is true when the throttle check was skipped because the
 	// resize happened less than 5 minutes ago (the Prometheus rate window still
@@ -181,7 +181,9 @@ func CheckCriticalStatuses(pod *corev1.Pod, record ResizeRecord) *SafetyVerdict
 //  1. Pod existence (deleted pods are considered safe).
 //  2. OOMKill events that occurred after the resize.
 //  3. Restart count increases of 2 or more since the resize.
-//  4. Pod Ready condition.
+//  4. CPU throttle ratio (after a 5m grace period).
+//  5. SLO guardrail queries (if configured, after evaluationWindow).
+//  6. Pod Ready condition.
 func (m *Monitor) CheckPod(ctx context.Context, record ResizeRecord, now time.Time) (SafetyVerdict, error) {
 	pod, err := m.client.CoreV1().Pods(record.Namespace).Get(ctx, record.PodName, metav1.GetOptions{})
 	if err != nil {
diff --git a/test/e2e/slo-guardrails/chainsaw-test.yaml b/test/e2e/slo-guardrails/chainsaw-test.yaml
@@ -24,6 +24,8 @@ spec:
               metadata:
                 name: slo-app
                 namespace: e2e-slo-guardrails
+                labels:
+                  app: slo-app
               spec:
                 replicas: 1
                 selector:

Original file line number	Diff line number	Diff line change
`@@ -158,7 +158,7 @@ stateDiagram-v2`
`158`	`158`
`159`	`159`	`state SafetyObservation {`
`160`	`160`	`direction TB`
`161`		`- [*] --> Monitoring: Watching for OOMKill, throttle, restarts`
	`161`	`+ [*] --> Monitoring: Watching for OOMKill, throttle, restarts, NotReady, SLO breach`
`162`	`162`	`Monitoring --> Safe: No violations after observation period`
`163`	`163`	`Monitoring --> Unsafe: Violation detected`
`164`	`164`	`}`