|
| 1 | +# ADR-044: Breakfix MTBI Metric |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +NVSentinel measures how long it takes to repair a faulty node (MTTR / |
| 6 | +remediation duration) and counts how many times each node is quarantined |
| 7 | +(`fault_quarantine_nodes_quarantined_total{node}`), but it has no visibility into |
| 8 | +the *time between* failures — how long a node typically runs healthy before |
| 9 | +breaking again. Operators cannot today answer: "How long do nodes typically run |
| 10 | +between failures?", "What is the distribution (median / p90) of healthy uptime |
| 11 | +between incidents?", or "Is the fleet's time-between-failures trending down?" |
| 12 | + |
| 13 | +### Definition |
| 14 | + |
| 15 | +MTBI is defined as below: |
| 16 | + |
| 17 | +``` |
| 18 | +MTBI = generatedTimestamp(incident N+1) − generatedTimestamp(incident N) |
| 19 | +``` |
| 20 | + |
| 21 | +## Decision |
| 22 | + |
| 23 | +Emit a per-incident interval histogram `fault_quarantine_node_mtbi_seconds` |
| 24 | +from fault-quarantine, computed at quarantine time. The previous |
| 25 | +incident's `generatedTimestamp` is carried forward via an **annotation-piggyback**: |
| 26 | +a dedicated node annotation written in the annotation patch fault-quarantine |
| 27 | +already applies at the `Quarantined` transition. |
| 28 | + |
| 29 | +## Design |
| 30 | + |
| 31 | +To expose the interval distribution, emit one histogram observation per |
| 32 | +incident. The interval is computed at the moment a node is quarantined, using a |
| 33 | +dedicated node annotation as the carry-forward for the previous incident's |
| 34 | +`generatedTimestamp`. |
| 35 | + |
| 36 | +### Mechanism (annotation-piggyback at quarantine) |
| 37 | + |
| 38 | +- **Annotation key:** `nvsentinel.nvidia.com/last-incident-timestamp`, declared as |
| 39 | + a `const` in `fault-quarantine/pkg/common` alongside the existing quarantine |
| 40 | + annotation keys. |
| 41 | +- **Value:** the incident's `generatedTimestamp` serialized as **RFC3339Nano in |
| 42 | + UTC** |
| 43 | +- **Written:** only at the `Quarantined` transition, folded into the annotation |
| 44 | + patch fault-quarantine already applies (the `annotationsMap` passed to |
| 45 | + `QuarantineNodeAndSetAnnotations`). No extra API call. |
| 46 | +- **Read:** at the start of each new quarantine, from the node object already |
| 47 | + fetched during reconcile. This requires adding the new key to the allowlist in |
| 48 | + `getNodeQuarantineAnnotations`, otherwise the value is |
| 49 | + filtered out before the quarantine path ever sees it. |
| 50 | +- **Lifecycle:** excluded from every annotation-removal list, so it persists across |
| 51 | + cycles and is overwritten on the next quarantine. This relies on the k8s client |
| 52 | + patch helpers (`UnQuarantineNodeAndRemoveAnnotations`, |
| 53 | + `HandleManualUncordonCleanup`, `HandleManualUntaintCleanup`) removing only the |
| 54 | + explicitly-listed keys rather than wholesale-stripping nvsentinel annotations. |
| 55 | + |
| 56 | +```mermaid |
| 57 | +flowchart TD |
| 58 | + A["Incident N+1: node enters Quarantined<br/>genTs(N+1) = current generatedTimestamp"] --> B["Read node annotation last-incident-timestamp"] |
| 59 | + B --> C{Annotation present and parseable?} |
| 60 | + C -->|"No (first incident / invalid)"| F["Write annotation = genTs(N+1)"] |
| 61 | + C -->|Yes| D{"interval = genTs(N+1) - genTs(N) > 0?"} |
| 62 | + D -->|No| F |
| 63 | + D -->|Yes| E["Emit MTBI metric = interval"] |
| 64 | + E --> F["Write annotation = genTs(N+1)"] |
| 65 | +``` |
| 66 | + |
| 67 | +### Metric |
| 68 | + |
| 69 | +| Field | Value | |
| 70 | +|---|---| |
| 71 | +| Name | `fault_quarantine_node_mtbi_seconds` | |
| 72 | +| Type | Histogram | |
| 73 | +| Buckets | `ExponentialBucketsRange(900, 2592000, 14)` — 14 buckets from 15 minutes to 30 days (~1 month), ~1.78× growth per bucket | |
| 74 | +| Description | Fault-to-fault interval between consecutive incidents on a node: `generatedTimestamp(N+1) − generatedTimestamp(N)`. Emitted at quarantine start when a prior incident annotation exists and the interval is positive. | |
| 75 | + |
| 76 | +### Queries |
| 77 | +Few example grafana queries: |
| 78 | + |
| 79 | +```promql |
| 80 | +# p90 MTBI by cluster (seconds) |
| 81 | +histogram_quantile( |
| 82 | + 0.90, |
| 83 | + sum by (cluster, le) ( |
| 84 | + rate(fault_quarantine_node_mtbi_seconds_bucket{env="prod"}[$__range]) |
| 85 | + ) |
| 86 | +) |
| 87 | +
|
| 88 | +# p90 MTBI by csp (seconds) |
| 89 | +histogram_quantile( |
| 90 | + 0.90, |
| 91 | + sum by (csp, le) ( |
| 92 | + rate(fault_quarantine_node_mtbi_seconds_bucket{env="prod"}[$__range]) |
| 93 | + ) |
| 94 | +) |
| 95 | +
|
| 96 | +``` |
| 97 | + |
| 98 | +### Pros |
| 99 | + |
| 100 | +- **No extra I/O** — no per-incident DB query and no extra API round trip: the |
| 101 | + previous timestamp is read from the node already fetched during reconcile and |
| 102 | + written within the annotation patch quarantine already applies. No datastore lookup. |
| 103 | +- **Robust to restarts, manual actions, and reprovision** — state lives on the |
| 104 | + node and is touched only at the `Quarantined` transition, so pod restarts and |
| 105 | + manual uncordon/untaint never interfere; a fresh node has no annotation and is |
| 106 | + correctly treated as a first incident. |
| 107 | +- **Self-correcting** — a corrupted/overwritten annotation affects at most one |
| 108 | + interval and heals on the next incident. |
| 109 | + |
| 110 | +### Cons |
| 111 | + |
| 112 | +- **Externally mutable state** — node annotation is externally |
| 113 | + mutable (an operator/controller overwriting it corrupts a single interval; |
| 114 | + see self-correcting above). |
| 115 | + |
| 116 | +## Implementation |
| 117 | + |
| 118 | +1. **`fault-quarantine/pkg/common/common.go`** — add a const for the annotation |
| 119 | + key, e.g. `NodeLastIncidentTimestampAnnotationKey = "nvsentinel.nvidia.com/last-incident-timestamp"`. |
| 120 | +2. **`fault-quarantine/pkg/metrics/metrics.go`** — add `NodeMTBISeconds` |
| 121 | + (`Histogram`, buckets `ExponentialBucketsRange(900, 2592000, 14)` |
| 122 | + — 15 minutes to 30 days). |
| 123 | +3. **`fault-quarantine/pkg/reconciler/reconciler.go`** — |
| 124 | + - **Read allowlist:** add `NodeLastIncidentTimestampAnnotationKey` to the |
| 125 | + `quarantineKeys` list in `getNodeQuarantineAnnotations`. Without this the key |
| 126 | + is filtered out of the `annotations` map and MTBI never emits. |
| 127 | + - **Observe + write:** in the fresh `Quarantined` path, pass the event (for |
| 128 | + `generatedTimestamp`) and the prior `annotations` into the annotation-building |
| 129 | + step. `prepareAnnotations` currently takes only |
| 130 | + `(ctx, taintsToBeApplied, labelsMap, isCordoned)`, so either extend its |
| 131 | + signature or perform the read/observe/write in `applyQuarantine` where the |
| 132 | + event is in scope. Logic: |
| 133 | + - if the prior annotation is present, parseable (RFC3339Nano), and |
| 134 | + `interval = genTs(N+1) − genTs(N) > 0`, observe the histogram; |
| 135 | + - always write the current `generatedTimestamp` into the annotation map that |
| 136 | + is handed to `QuarantineNodeAndSetAnnotations`; |
| 137 | + - skip emission (but still write) on first incident, unparseable value, |
| 138 | + missing/zero `generatedTimestamp`, or non-positive interval. |
| 139 | + - Do **not** add the key to any annotation-removal list. |
| 140 | +4. **`docs/METRICS.md`** — document `fault_quarantine_node_mtbi_seconds`. |
| 141 | + |
| 142 | +### Edge cases |
| 143 | + |
| 144 | +| Scenario | Handling | |
| 145 | +|---|---| |
| 146 | +| First incident on a node | Annotation absent → skip emission, write current `generatedTimestamp` | |
| 147 | +| Unparseable / corrupted annotation | Skip emission, overwrite with current `generatedTimestamp` (self-heals next incident) | |
| 148 | +| Non-positive interval (clock skew, out-of-order events, manual/force quarantine) | Skip emission, still overwrite annotation | |
| 149 | +| Compound events in one cycle | Recorded as `AlreadyQuarantined`; the quarantine annotation patch runs only on the `Quarantined` transition | |
| 150 | +| Manual uncordon between incidents | No effect — annotation is written/read only at quarantine | |
| 151 | +| Pod restart | No effect — state lives on the node | |
| 152 | +| Node reprovisioned | Fresh node has no annotation → treated as first incident | |
| 153 | + |
| 154 | +### Testing |
| 155 | + |
| 156 | +- First incident: annotation absent → no observation, annotation written. |
| 157 | +- Second incident: valid prior annotation → one observation equal to the interval. |
| 158 | +- Non-positive interval (N+1 ≤ N): no observation, annotation overwritten. |
| 159 | +- Unparseable prior annotation: no observation, annotation overwritten. |
| 160 | +- Missing/zero `generatedTimestamp`: no observation, no write. |
| 161 | +- Pod restart mid-cycle: state on node intact, next quarantine emits correctly. |
| 162 | +- Manual uncordon/untaint between incidents: annotation untouched, interval still |
| 163 | + measured at next quarantine. |
| 164 | +- Annotation survives an unquarantine cycle (not in any removal list). |
| 165 | + |
| 166 | + |
| 167 | +## Alternatives Considered |
| 168 | + |
| 169 | +### Per-incident datastore query at quarantine |
| 170 | +Query the previous `Quarantined` document for the node (`FindOne` by |
| 171 | +`healthevent.nodename`, sort `createdAt` desc) and compute the interval. |
| 172 | + |
| 173 | +**Rejected** as the primary mechanism because: it adds a DB round trip per |
| 174 | +incident and depends on a compound index over |
| 175 | +`(nodename, nodequarantined, createdAt)` to avoid scans. The annotation-piggyback |
| 176 | +achieves the same result with no extra I/O and no index dependency. |
| 177 | + |
| 178 | +## References |
| 179 | + |
| 180 | +- [METRICS.md](../METRICS.md) |
| 181 | +- [`fault-quarantine/pkg/metrics/metrics.go`](../../fault-quarantine/pkg/metrics/metrics.go) — `TotalNodesQuarantined`, remediation-duration histograms |
| 182 | +- [`fault-quarantine/pkg/reconciler/reconciler.go`](../../fault-quarantine/pkg/reconciler/reconciler.go) — `updateQuarantineMetrics`, `prepareAnnotations`, `Quarantined` transition, annotation removal lists |
| 183 | +- [`data-models/protobufs/health_event.proto`](../../data-models/protobufs/health_event.proto) — `generatedTimestamp`, `nodeQuarantined` |
0 commit comments