Skip to content

Commit da8e804

Browse files
chernistryauto-heal-fixupgithub-actions[bot]
authored
feat(supervisor): operator supervisor surface with signed escalation receipts (#1805)
* feat(supervisor): operator supervisor surface with signed escalation receipts Adds `bernstein supervisor status` and `bernstein supervisor escalate` as an operator-facing surface over the existing stalled_manager, watchdog, and spawn_supervisor detectors. The detectors remain the source of truth; the new command aggregates and renders. Each escalation appends a signed receipt to the audit chain. The receipt carries the worker id, worktree id, last N audit entries, identity tokens, stall reason, recommended action, and the previous chain digest. The receipt verifies offline against the install's Ed25519 public key. The recommended_action field is a pure function of the chain slice at stall time (stall_reason + audit_entries + respawn_budget_remaining). Two operators verifying the same receipt arrive at the byte-identical recommendation. A cross-worktree fence assertion refuses receipt assembly when the stuck session leaked into a sibling worktree's resolution events. The aggregator surfaces a stuck-count + oldest-stall summary line on `bernstein status` and `bernstein fleet`. The TUI gains a SupervisorPane that highlights stalled / parked / no-progress sessions with the same recommended action. Worker badges learn a STUCK status so the dashboard's existing widgets stay consistent with the pane. Closes #1800. Files touched: - src/bernstein/core/orchestration/supervisor_receipt.py (new) - src/bernstein/core/orchestration/supervisor_aggregator.py (new) - src/bernstein/cli/commands/supervisor_cmd.py (new) - src/bernstein/cli/main.py - src/bernstein/cli/commands/status_cmd.py - src/bernstein/cli/commands/fleet_cmd.py - src/bernstein/core/lifecycle/hooks.py (worker.escalated event) - src/bernstein/tui/status_bar.py (SupervisorPane + helpers) - src/bernstein/tui/worker_badges.py (STUCK status) - docs/api/supervisor.md (new) - tests/unit/test_supervisor_receipt.py (new) - tests/integration/test_supervisor_chain_roundtrip.py (new) - tests/snapshot/test_supervisor_pane_snapshot.py (new) * chore(ci): regenerate contract drift allow-lists Auto-applied by contract-drift-autofix.yml on PR #1805. Regenerated via scripts/regen_contract_drift.py. Refs #1273. Source CI run: https://github.com/sipyourdrink-ltd/bernstein/actions/runs/26249756348 * fix(supervisor): address CodeRabbit must-address findings - Validate `--reason` strips to non-empty before any state mutation in `bernstein supervisor escalate`. A whitespace-only reason previously passed through into the receipt body. - Log silent-broad-exception paths in the supervisor summary helpers (status + fleet) and in the install-fingerprint lookup so an empty summary surfaces a cause in the orchestrator log instead of looking identical to a healthy run. - Refuse to silently reset the audit chain anchor when the audit log directory exists but is unreadable. The previous fallback to the genesis sentinel would have let a fresh receipt skip the chain head and break the tamper-evidence guarantee. - Switch receipt filenames to nanosecond timestamps and open the file exclusively (``"x"``) so two escalations in the same second cannot silently overwrite each other. - Escape every dynamic field interpolated into the supervisor TUI pane's Rich markup; an upstream caller passing ``[``/``]`` in a worker id or role no longer corrupts the dashboard layout. bot-ack: 3284009273 bot-ack: 3284009276 bot-ack: 3284009287 bot-ack: 3284009299 bot-ack: 3284009302 bot-ack: 3284009305 bot-ack: 3284009328 --------- Co-authored-by: auto-heal-fixup <auto-heal-fixup@bernstein.local> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent 2daf12b commit da8e804

15 files changed

Lines changed: 2676 additions & 0 deletions

docs/api/supervisor.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Supervisor surface
2+
3+
This document describes the JSON shapes the `bernstein supervisor`
4+
command emits. Two surfaces are documented:
5+
6+
1. **Aggregated supervisor snapshot** - the body returned by
7+
`bernstein supervisor status --json` and embedded as the
8+
`supervisor` field in `bernstein status --json`.
9+
2. **Signed escalation receipt** - the envelope persisted under
10+
`.sdd/runtime/supervisor/receipts/` whenever the supervisor or the
11+
operator escalates a stalled worker.
12+
13+
Both shapes are versioned via an explicit `schema_version` field.
14+
15+
## Supervisor snapshot
16+
17+
```jsonc
18+
{
19+
"schema_version": "1.0.0",
20+
"generated_ts": 1700000000.0,
21+
"stuck_count": 2,
22+
"oldest_stall_age_s": 95.0,
23+
"workers": [
24+
{
25+
"worker_id": "abc123def456",
26+
"session_id": "sess-abc123",
27+
"role": "backend",
28+
"task_id": "t-12",
29+
"worktree_id": "wt-007",
30+
"last_heartbeat_age_s": 42.0,
31+
"is_stuck": false,
32+
"stall_reason": "unknown",
33+
"recommended_action": "inspect",
34+
"respawn_budget_remaining": 3,
35+
"stuck_since_ts": null,
36+
"details": {"status": "working"}
37+
}
38+
]
39+
}
40+
```
41+
42+
### Fields
43+
44+
| Field | Type | Description |
45+
|-------|------|-------------|
46+
| `schema_version` | string | Aggregator schema version. Currently `1.0.0`. |
47+
| `generated_ts` | float | Unix timestamp the snapshot was captured. |
48+
| `stuck_count` | integer | Number of workers with `is_stuck=true`. |
49+
| `oldest_stall_age_s` | float \| null | Age, in seconds, of the oldest currently-stuck worker; `null` when no worker is stuck or no stall timestamp is available. |
50+
| `workers[].worker_id` | string | Operator-decodable worker handle. |
51+
| `workers[].session_id` | string | Adapter session id. |
52+
| `workers[].role` | string | Worker role (`manager`, `backend`, `qa`, ...). |
53+
| `workers[].task_id` | string | Current task id, or empty string. |
54+
| `workers[].worktree_id` | string | Worktree the worker is running in. |
55+
| `workers[].last_heartbeat_age_s` | float \| null | Seconds since the last heartbeat; `null` when none recorded. |
56+
| `workers[].is_stuck` | bool | True iff at least one detector classifies the row as stuck. |
57+
| `workers[].stall_reason` | string | One of `manager_no_children`, `watchdog_model_question`, `respawn_budget_exhausted`, `heartbeat_stale`, `no_progress`, or `unknown`. |
58+
| `workers[].recommended_action` | string | One of `respawn`, `escalate`, `park`, `inspect`. Deterministic over the chain slice (see below). |
59+
| `workers[].respawn_budget_remaining` | integer | Respawns remaining under the session's budget. |
60+
| `workers[].stuck_since_ts` | float \| null | Unix timestamp the stall first fired; `null` when not known. |
61+
| `workers[].details` | object | Free-form detector context. The aggregator currently includes `status` (raw agent status). |
62+
63+
## Escalation receipt envelope
64+
65+
```jsonc
66+
{
67+
"schema_version": "1.0.0",
68+
"worker_id": "abc123def456",
69+
"worktree_id": "wt-007",
70+
"session_id": "sess-abc123",
71+
"stall_reason": "manager_no_children",
72+
"recommended_action": "escalate",
73+
"audit_entries": [
74+
{
75+
"event_type": "stalled_manager",
76+
"session_id": "sess-abc123",
77+
"details": {"runtime_s": 120.0, "hook_event_count": 12}
78+
}
79+
],
80+
"identity": {
81+
"install_rev": "abc123def4567890",
82+
"keyid": "...64 hex chars...",
83+
"run_id": "run-2026-05-21-001"
84+
},
85+
"prev_chain_digest": "...64 hex chars...",
86+
"payload_digest": "...64 hex chars...",
87+
"signature_b64": "...base64 Ed25519 signature...",
88+
"details": {
89+
"operator_reason": "wedged on credential rotation",
90+
"respawn_budget_remaining": 0
91+
}
92+
}
93+
```
94+
95+
### Receipt fields
96+
97+
| Field | Type | Description |
98+
|-------|------|-------------|
99+
| `schema_version` | string | Receipt schema version. Currently `1.0.0`. |
100+
| `worker_id` | string | Stable worker identifier. |
101+
| `worktree_id` | string | Worktree the worker was running in. |
102+
| `session_id` | string | Adapter session id. |
103+
| `stall_reason` | string | Structured stall reason - same vocabulary as the aggregator. |
104+
| `recommended_action` | string | Deterministic action - same vocabulary as the aggregator. |
105+
| `audit_entries` | array of object | Captured chain slice (default 16 trailing entries) leading up to the stall. |
106+
| `identity.install_rev` | string | Operator-decodable install fingerprint. |
107+
| `identity.keyid` | string | sha256 of the Ed25519 public key (hex). |
108+
| `identity.run_id` | string | Orchestrator run id, when known. |
109+
| `prev_chain_digest` | string | HMAC of the previous audit-chain entry. Links the receipt into the tamper-evident audit log. |
110+
| `payload_digest` | string | sha256 of the canonical signing payload. Lets verifiers detect a swapped signature blob. |
111+
| `signature_b64` | string | base64-encoded Ed25519 signature over the canonical payload. |
112+
| `details` | object | Free-form context. The CLI populates `operator_reason` and `respawn_budget_remaining`. |
113+
114+
### Determinism contract
115+
116+
`recommended_action` is a **pure function** of the receipt's
117+
`(stall_reason, audit_entries, respawn_budget_remaining)`. The function
118+
119+
* never reads files or environment,
120+
* never opens a socket,
121+
* never reads a wall clock.
122+
123+
Two operators handed the same receipt bytes (or independently
124+
reassembled receipts from the same chain prefix) compute the
125+
byte-identical `recommended_action`. The contract is enforced by the
126+
unit test
127+
`tests/unit/test_supervisor_receipt.py::test_recommended_action_determinism`,
128+
which drives the same chain slice through the function from two
129+
different temp dirs and asserts equality.
130+
131+
### Cross-worktree fence
132+
133+
Every receipt asserts that the stuck session never crossed worktree
134+
boundaries during the stall window. An audit entry whose
135+
`event_type` ends in `.resolved` or starts with `cross_worktree.` and
136+
references the stuck `session_id` from a sibling `worktree_id` is a
137+
fence violation and aborts receipt assembly. Verifiers re-run the same
138+
check from the receipt bytes alone, so a tampered audit slice that
139+
smuggled a leak past assembly fails verification.
140+
141+
### Verification
142+
143+
The standalone verifier loads only the public side of the install
144+
Ed25519 keypair (`<workdir>/.sdd/runtime/supervisor/install.key.pub`,
145+
PEM-encoded). It
146+
147+
1. recomputes `payload_digest` over the canonical signing bytes and
148+
asserts byte-equality with the receipt's `payload_digest`,
149+
2. re-asserts the cross-worktree fence,
150+
3. re-derives `recommended_action` from the embedded slice and
151+
asserts equality with the receipt's `recommended_action`,
152+
4. verifies the Ed25519 signature over the canonical bytes.
153+
154+
A receipt that survives all four checks is byte-portable: any auditor
155+
holding the install's public key validates it offline without
156+
contacting the orchestrator.

src/bernstein/cli/commands/fleet_cmd.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,36 @@ def _fallback_table_render(aggregator: FleetAggregator, config: FleetConfig) ->
137137
_console.print(table)
138138
_console.print(format_footer(config, rows, total))
139139

140+
supervisor_line = _fleet_supervisor_summary_line()
141+
if supervisor_line:
142+
_console.print(f"[dim]{supervisor_line}[/dim]")
143+
144+
145+
def _fleet_supervisor_summary_line() -> str:
146+
"""Return the stuck-count summary across the fleet's primary workspace.
147+
148+
The fleet view aggregates many projects but a single operator sits
149+
inside one workspace, so we surface the supervisor snapshot for that
150+
workspace as the most actionable signal. Returns an empty string on
151+
any aggregator failure so the fleet command never errors here.
152+
Failures are logged so an operator-visible drop can be debugged from
153+
the orchestrator log without restarting the fleet view.
154+
"""
155+
try:
156+
from pathlib import Path as _Path
157+
158+
from bernstein.core.defaults import AGENT
159+
from bernstein.core.orchestration.supervisor_aggregator import (
160+
aggregator_snapshot,
161+
format_summary_line,
162+
)
163+
164+
snapshot = aggregator_snapshot(_Path.cwd(), heartbeat_stale_s=AGENT.heartbeat_stale_s)
165+
except Exception: # pragma: no cover - fleet renderer must never raise
166+
logger.exception("fleet supervisor-summary aggregation failed")
167+
return ""
168+
return format_summary_line(snapshot)
169+
140170

141171
def _parse_bind(bind: str) -> tuple[str, int]:
142172
text = bind.strip()

src/bernstein/cli/commands/status_cmd.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from __future__ import annotations
44

55
import json
6+
import logging
67
import os
78
import sys
89
import time
@@ -24,6 +25,8 @@
2425
from bernstein.core.agent_discovery import AgentCapabilities, DiscoveryResult, discover_agents_cached
2526
from bernstein.tui.worker_badges import format_worker_badge, get_badge_for_worker
2627

28+
logger = logging.getLogger(__name__)
29+
2730
_NOT_AUTHENTICATED_MSG = "not authenticated"
2831

2932
_STORAGE_BACKEND_LABEL = "Storage backend"
@@ -184,6 +187,11 @@ def status(as_json: bool, no_color: bool, view_mode: str | None) -> None:
184187
if snapshots:
185188
data["rate_limit_meters"] = snapshots
186189

190+
# Attach the supervisor summary (stuck-count + oldest-stall age).
191+
# Operators reading ``bernstein status`` should not have to remember
192+
# the dedicated supervisor command to spot a wedged worker.
193+
data["supervisor"] = _supervisor_status_summary(Path.cwd())
194+
187195
if as_json or is_json():
188196
print_json(data)
189197
return
@@ -200,6 +208,48 @@ def status(as_json: bool, no_color: bool, view_mode: str | None) -> None:
200208

201209
render_status(data, console=con, view_config=vc)
202210

211+
supervisor_line = _supervisor_summary_line(Path.cwd())
212+
if supervisor_line:
213+
con.print(f"[dim]{supervisor_line}[/dim]")
214+
215+
216+
def _supervisor_status_summary(workdir: Path) -> dict[str, Any]:
217+
"""Return the supervisor stuck-count summary for ``bernstein status --json``.
218+
219+
Returns an empty dict if the aggregator raises - the command must
220+
never fail on a missing or malformed runtime tree. Failures are
221+
logged so an operator can correlate an empty summary with a real
222+
cause instead of treating silence as healthy.
223+
"""
224+
try:
225+
from bernstein.core.defaults import AGENT
226+
from bernstein.core.orchestration.supervisor_aggregator import (
227+
aggregator_snapshot,
228+
snapshot_to_dict,
229+
)
230+
231+
snapshot = aggregator_snapshot(workdir, heartbeat_stale_s=AGENT.heartbeat_stale_s)
232+
except Exception: # pragma: no cover - status must never error on this
233+
logger.exception("supervisor status summary failed")
234+
return {}
235+
return snapshot_to_dict(snapshot)
236+
237+
238+
def _supervisor_summary_line(workdir: Path) -> str:
239+
"""Return the one-line supervisor summary string for the human view."""
240+
try:
241+
from bernstein.core.defaults import AGENT
242+
from bernstein.core.orchestration.supervisor_aggregator import (
243+
aggregator_snapshot,
244+
format_summary_line,
245+
)
246+
247+
snapshot = aggregator_snapshot(workdir, heartbeat_stale_s=AGENT.heartbeat_stale_s)
248+
except Exception: # pragma: no cover - status must never error on this
249+
logger.exception("supervisor summary line render failed")
250+
return ""
251+
return format_summary_line(snapshot)
252+
203253

204254
# ---------------------------------------------------------------------------
205255
# ps - process visibility

0 commit comments

Comments
 (0)