Skip to content

Commit 035b76f

Browse files
committed
feat: design doc for breakfix mtbi
1 parent 5d1a4d5 commit 035b76f

1 file changed

Lines changed: 183 additions & 0 deletions

File tree

docs/designs/044-breakfix-mtbi.md

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# ADR-044: Breakfix MTBI Metric
2+
3+
## Context
4+
5+
NVSentinel measures how long it takes to repair a faulty node (MTTR /
6+
remediation duration) and counts how many times each node is quarantined
7+
(`fault_quarantine_nodes_quarantined_total{node}`), but it has no visibility into
8+
the *time between* failures — how long a node typically runs healthy before
9+
breaking again. Operators cannot today answer: "How long do nodes typically run
10+
between failures?", "What is the distribution (median / p90) of healthy uptime
11+
between incidents?", or "Is the fleet's time-between-failures trending down?"
12+
13+
### Definition
14+
15+
MTBI is defined as below:
16+
17+
```
18+
MTBI = generatedTimestamp(incident N+1) − generatedTimestamp(incident N)
19+
```
20+
21+
## Decision
22+
23+
Emit a per-incident interval histogram `fault_quarantine_node_mtbi_seconds`
24+
from fault-quarantine, computed at quarantine time. The previous
25+
incident's `generatedTimestamp` is carried forward via an **annotation-piggyback**:
26+
a dedicated node annotation written in the annotation patch fault-quarantine
27+
already applies at the `Quarantined` transition.
28+
29+
## Design
30+
31+
To expose the interval distribution, emit one histogram observation per
32+
incident. The interval is computed at the moment a node is quarantined, using a
33+
dedicated node annotation as the carry-forward for the previous incident's
34+
`generatedTimestamp`.
35+
36+
### Mechanism (annotation-piggyback at quarantine)
37+
38+
- **Annotation key:** `nvsentinel.nvidia.com/last-incident-timestamp`, declared as
39+
a `const` in `fault-quarantine/pkg/common` alongside the existing quarantine
40+
annotation keys.
41+
- **Value:** the incident's `generatedTimestamp` serialized as **RFC3339Nano in
42+
UTC**
43+
- **Written:** only at the `Quarantined` transition, folded into the annotation
44+
patch fault-quarantine already applies (the `annotationsMap` passed to
45+
`QuarantineNodeAndSetAnnotations`). No extra API call.
46+
- **Read:** at the start of each new quarantine, from the node object already
47+
fetched during reconcile. This requires adding the new key to the allowlist in
48+
`getNodeQuarantineAnnotations`, otherwise the value is
49+
filtered out before the quarantine path ever sees it.
50+
- **Lifecycle:** excluded from every annotation-removal list, so it persists across
51+
cycles and is overwritten on the next quarantine. This relies on the k8s client
52+
patch helpers (`UnQuarantineNodeAndRemoveAnnotations`,
53+
`HandleManualUncordonCleanup`, `HandleManualUntaintCleanup`) removing only the
54+
explicitly-listed keys rather than wholesale-stripping nvsentinel annotations.
55+
56+
```mermaid
57+
flowchart TD
58+
A["Incident N+1: node enters Quarantined<br/>genTs(N+1) = current generatedTimestamp"] --> B["Read node annotation last-incident-timestamp"]
59+
B --> C{Annotation present and parseable?}
60+
C -->|"No (first incident / invalid)"| F["Write annotation = genTs(N+1)"]
61+
C -->|Yes| D{"interval = genTs(N+1) - genTs(N) > 0?"}
62+
D -->|No| F
63+
D -->|Yes| E["Emit MTBI metric = interval"]
64+
E --> F["Write annotation = genTs(N+1)"]
65+
```
66+
67+
### Metric
68+
69+
| Field | Value |
70+
|---|---|
71+
| Name | `fault_quarantine_node_mtbi_seconds` |
72+
| Type | Histogram |
73+
| Buckets | `ExponentialBucketsRange(900, 2592000, 14)` — 14 buckets from 15 minutes to 30 days (~1 month), ~1.78× growth per bucket |
74+
| Description | Fault-to-fault interval between consecutive incidents on a node: `generatedTimestamp(N+1) − generatedTimestamp(N)`. Emitted at quarantine start when a prior incident annotation exists and the interval is positive. |
75+
76+
### Queries
77+
Few example grafana queries:
78+
79+
```promql
80+
# p90 MTBI by cluster (seconds)
81+
histogram_quantile(
82+
0.90,
83+
sum by (cluster, le) (
84+
rate(fault_quarantine_node_mtbi_seconds_bucket{env="prod"}[$__range])
85+
)
86+
)
87+
88+
# p90 MTBI by csp (seconds)
89+
histogram_quantile(
90+
0.90,
91+
sum by (csp, le) (
92+
rate(fault_quarantine_node_mtbi_seconds_bucket{env="prod"}[$__range])
93+
)
94+
)
95+
96+
```
97+
98+
### Pros
99+
100+
- **No extra I/O** — no per-incident DB query and no extra API round trip: the
101+
previous timestamp is read from the node already fetched during reconcile and
102+
written within the annotation patch quarantine already applies. No datastore lookup.
103+
- **Robust to restarts, manual actions, and reprovision** — state lives on the
104+
node and is touched only at the `Quarantined` transition, so pod restarts and
105+
manual uncordon/untaint never interfere; a fresh node has no annotation and is
106+
correctly treated as a first incident.
107+
- **Self-correcting** — a corrupted/overwritten annotation affects at most one
108+
interval and heals on the next incident.
109+
110+
### Cons
111+
112+
- **Externally mutable state** — node annotation is externally
113+
mutable (an operator/controller overwriting it corrupts a single interval;
114+
see self-correcting above).
115+
116+
## Implementation
117+
118+
1. **`fault-quarantine/pkg/common/common.go`** — add a const for the annotation
119+
key, e.g. `NodeLastIncidentTimestampAnnotationKey = "nvsentinel.nvidia.com/last-incident-timestamp"`.
120+
2. **`fault-quarantine/pkg/metrics/metrics.go`** — add `NodeMTBISeconds`
121+
(`Histogram`, buckets `ExponentialBucketsRange(900, 2592000, 14)`
122+
— 15 minutes to 30 days).
123+
3. **`fault-quarantine/pkg/reconciler/reconciler.go`**
124+
- **Read allowlist:** add `NodeLastIncidentTimestampAnnotationKey` to the
125+
`quarantineKeys` list in `getNodeQuarantineAnnotations`. Without this the key
126+
is filtered out of the `annotations` map and MTBI never emits.
127+
- **Observe + write:** in the fresh `Quarantined` path, pass the event (for
128+
`generatedTimestamp`) and the prior `annotations` into the annotation-building
129+
step. `prepareAnnotations` currently takes only
130+
`(ctx, taintsToBeApplied, labelsMap, isCordoned)`, so either extend its
131+
signature or perform the read/observe/write in `applyQuarantine` where the
132+
event is in scope. Logic:
133+
- if the prior annotation is present, parseable (RFC3339Nano), and
134+
`interval = genTs(N+1) − genTs(N) > 0`, observe the histogram;
135+
- always write the current `generatedTimestamp` into the annotation map that
136+
is handed to `QuarantineNodeAndSetAnnotations`;
137+
- skip emission (but still write) on first incident, unparseable value,
138+
missing/zero `generatedTimestamp`, or non-positive interval.
139+
- Do **not** add the key to any annotation-removal list.
140+
4. **`docs/METRICS.md`** — document `fault_quarantine_node_mtbi_seconds`.
141+
142+
### Edge cases
143+
144+
| Scenario | Handling |
145+
|---|---|
146+
| First incident on a node | Annotation absent → skip emission, write current `generatedTimestamp` |
147+
| Unparseable / corrupted annotation | Skip emission, overwrite with current `generatedTimestamp` (self-heals next incident) |
148+
| Non-positive interval (clock skew, out-of-order events, manual/force quarantine) | Skip emission, still overwrite annotation |
149+
| Compound events in one cycle | Recorded as `AlreadyQuarantined`; the quarantine annotation patch runs only on the `Quarantined` transition |
150+
| Manual uncordon between incidents | No effect — annotation is written/read only at quarantine |
151+
| Pod restart | No effect — state lives on the node |
152+
| Node reprovisioned | Fresh node has no annotation → treated as first incident |
153+
154+
### Testing
155+
156+
- First incident: annotation absent → no observation, annotation written.
157+
- Second incident: valid prior annotation → one observation equal to the interval.
158+
- Non-positive interval (N+1 ≤ N): no observation, annotation overwritten.
159+
- Unparseable prior annotation: no observation, annotation overwritten.
160+
- Missing/zero `generatedTimestamp`: no observation, no write.
161+
- Pod restart mid-cycle: state on node intact, next quarantine emits correctly.
162+
- Manual uncordon/untaint between incidents: annotation untouched, interval still
163+
measured at next quarantine.
164+
- Annotation survives an unquarantine cycle (not in any removal list).
165+
166+
167+
## Alternatives Considered
168+
169+
### Per-incident datastore query at quarantine
170+
Query the previous `Quarantined` document for the node (`FindOne` by
171+
`healthevent.nodename`, sort `createdAt` desc) and compute the interval.
172+
173+
**Rejected** as the primary mechanism because: it adds a DB round trip per
174+
incident and depends on a compound index over
175+
`(nodename, nodequarantined, createdAt)` to avoid scans. The annotation-piggyback
176+
achieves the same result with no extra I/O and no index dependency.
177+
178+
## References
179+
180+
- [METRICS.md](../METRICS.md)
181+
- [`fault-quarantine/pkg/metrics/metrics.go`](../../fault-quarantine/pkg/metrics/metrics.go)`TotalNodesQuarantined`, remediation-duration histograms
182+
- [`fault-quarantine/pkg/reconciler/reconciler.go`](../../fault-quarantine/pkg/reconciler/reconciler.go)`updateQuarantineMetrics`, `prepareAnnotations`, `Quarantined` transition, annotation removal lists
183+
- [`data-models/protobufs/health_event.proto`](../../data-models/protobufs/health_event.proto)`generatedTimestamp`, `nodeQuarantined`

0 commit comments

Comments
 (0)