Skip to content

Commit 052f64e

Browse files
EchoOfDawnInstar Agent (echo)claude
authored
feat(monitoring): ProcessFootprintMonitor — per-machine process-footprint measurement (the climb missing before the panic) (#1291)
* feat(monitoring): ProcessFootprintMonitor — per-machine process-footprint measurement (the climb missing before the panic) The ResourceLedger samples CPU%/RSS but not the per-machine PROCESS COUNT — the signal that actually climbed (dominated by idle MCP servers) until the host hit a kernel limit and panicked on 2026-06-26. This adds that missing measurement. - New observe-only ProcessFootprintMonitor: samples agent-relevant processes on an interval, classifies them (agent-cli / mcp via MCP_PROCESS_SIGNATURES / other-node), rolling-window TREND. Pure core; production ps scan via withSyncOp, fail-safe. - GET /resources/footprint — read-only status (Bearer-gated; 503 when disabled). - Declared in GUARD_MANIFEST (joins /guards). Ships DARK (developmentAgent gate); threshold heads-up opt-in (measure-first; sink wiring tracked as increment 2). - Never kills/gates — the reapers reclaim. Tests: 11 unit + 3 integration + 4 e2e. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(release): plain-language 'What to Tell Your User' in footprint fragment Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(observability): document ProcessFootprintMonitor (restore class doc-coverage floor) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci: re-trigger (flaky TopicProfileOrchestrator §10.4 circuit-breaker test; passes locally 44/44, unrelated to this PR) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Instar Agent (echo) <echo@instar.local> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 269250a commit 052f64e

15 files changed

Lines changed: 841 additions & 2 deletions
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
{
2+
"ts": "2026-06-27T00:27:54.943Z",
3+
"slug": "process-footprint-monitor",
4+
"suggestedTier": 2,
5+
"declaredTier": 1,
6+
"riskFloor": 2,
7+
"riskFloorReasons": [
8+
"irreversibility: src/core/PostUpdateMigrator.ts touches PostUpdateMigrator",
9+
"migration / fleet-rollout surface: src/core/PostUpdateMigrator.ts touches PostUpdateMigrator (fleet migration machinery)",
10+
"new capability: new route (router.<verb>() added",
11+
"new capability: new exported class / subsystem added",
12+
"new capability: new config key added"
13+
],
14+
"belowFloor": true,
15+
"files": 7,
16+
"loc": 313,
17+
"causalAutopsy": {
18+
"origin": "prior-pr",
19+
"relatedPrs": [
20+
1290
21+
],
22+
"notes": "Follow-up to the 2026-06-26 resource-exhaustion kernel panic (os_refcnt overflow) whose disk arm was fixed in #1290. The panic's process/handle arm had NO measurement: the ResourceLedger samples CPU%/RSS but not the per-machine PROCESS COUNT that actually climbed (dominated by idle MCP servers). This adds that missing measurement so the next climb is visible."
23+
},
24+
"verdict": "pass"
25+
}

site/src/content/docs/features/observability.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,3 +192,24 @@ When a session goes quiet at its prompt, a background loop decides whether it st
192192
The tail-gating itself lives in one shared helper, `paneTail` (`liveTail` / `stripLineLead` / `wasGlyphLed`), so "what counts as the live tail" has a single definition rather than a copy per consumer — the same helper `StuckSignatureClassifier` uses for its honest turn-receipts. The capture is widened to clear Claude Code's input-box chrome (which renders well below the error line), so a genuine error can't be pushed off-screen.
193193

194194
The classifier's signal feeds the **existing** recovery actuator — it emits `apiErrorAtIdle`, which `RateLimitSentinel` turns into a non-destructive backoff → nudge → verify → escalate loop (it never restarts a session on its own; the worst case of a wrong signal is one wasted nudge the verify step proves was a no-op). Every classify decision (fired vs suppressed) is recorded once per idle episode, so a wave of suppressions on genuine errors is observable rather than a silent under-fire. This keeps the idle-error path consistent with the broader [Signal vs. Authority](/foundations/north-star/) posture: the brittle detector signals, the full-context actuator decides.
195+
196+
## Process footprint (the climb measurement)
197+
198+
CPU and memory sampling tells you how *hard* the machine is working, but not how *many*
199+
processes are running — and it was the slow climb of the process count (several agent
200+
stacks plus their heavy, mostly-idle MCP servers: a whole Chromium for Playwright, an
201+
Electron) that went unwatched until the host hit a kernel limit and panicked on
202+
2026-06-26. The `ProcessFootprintMonitor` adds exactly that missing measurement. On an
203+
interval it counts the agent-relevant processes on the machine and classifies them —
204+
agent CLIs, MCP servers (matched by the same allow-listed signatures the MCP cleanup
205+
sweep uses), and other node — keeping a bounded rolling window so a TREND (rising /
206+
stable / falling) is visible.
207+
208+
It is **observe-only**: it never kills, throttles, or gates anything (reclaiming
209+
processes is the reapers' job). Read it at `GET /resources/footprint` → `{ enabled,
210+
latest: { total, byKind, rssBytes }, trend, overThreshold, samples }`. It ships dark
211+
(rides the developmentAgent gate, so it dogfoods on a dev agent before any fleet
212+
rollout) and every reading path fails safe (a failed scan keeps the last sample rather
213+
than crashing). An optional threshold heads-up exists but is **off by default**
214+
measure first. It registers in the guard posture, so `GET /guards` shows whether it is
215+
on.

src/core/PostUpdateMigrator.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6933,7 +6933,8 @@ Create worktrees for collaborator repos with \`instar worktree create <branch>\`
69336933
**Resource Usage (CPU + memory)** — Your ResourceLedger now continuously samples CPU% and memory (RSS) for your server process and every running session, alongside the existing durable rate-limit-event record. Read-only observability — it never gates.
69346934
- Current + windowed (avg/peak) usage per source plus an aggregate: \`curl -H "Authorization: Bearer $AUTH" "http://localhost:${port}/resources/summary?sinceHours=1"\` → \`{ sampleCount, sources: [{ source, currentCpuPercent, currentRssBytes, avgCpuPercent, peakCpuPercent, peakRssBytes, ... }] }\` (\`source\` is \`agent-server\`, \`session:<id>\`, or \`aggregate\`). Recent raw samples: \`GET /resources/samples?sinceHours=1&source=aggregate&limit=20\`.
69356935
- The dashboard "Resource Usage" tab renders all of this in plain language.
6936-
- **When to use** (PROACTIVE): when the user asks "how much CPU / memory am I using right now?", "what's eating resources?", or "is this agent heavy?" → \`GET /resources/summary\` (or point them at the Resource Usage dashboard tab). Read the durable numbers instead of guessing. (Spec: \`docs/specs/per-agent-resource-ledger.md\`.)
6936+
- **Process footprint** (the climb measurement): a per-machine count of your processes — agent CLIs + the heavy, mostly-idle MCP servers (a whole Chromium for Playwright, an Electron) + other node — sampled on an interval with a rolling-window TREND. The signal that was MISSING when steady-state process accumulation went unwatched until the host hit a kernel limit and panicked. \`curl -H "Authorization: Bearer $AUTH" "http://localhost:${port}/resources/footprint"\` → \`{ enabled, latest: { total, byKind, rssBytes }, trend, overThreshold, samples }\`. Observe-only (never kills/gates); ships dark (developmentAgent gate); the threshold heads-up is opt-in (\`monitoring.processFootprintMonitor.alertEnabled\`). 503 when disabled.
6937+
- **When to use** (PROACTIVE): when the user asks "how much CPU / memory am I using right now?", "what's eating resources?", or "is this agent heavy?" → \`GET /resources/summary\` (or point them at the Resource Usage dashboard tab). When asked "how many processes am I running?" / "is the footprint climbing toward another crash?" → \`GET /resources/footprint\`. Read the durable numbers instead of guessing. (Spec: \`docs/specs/per-agent-resource-ledger.md\`.)
69376938
`;
69386939
content += '\n' + section;
69396940
patched = true;

src/core/types.ts

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4617,6 +4617,23 @@ export interface MonitoringConfig {
46174617
/** Retain CPU/mem samples for this many days; older rows pruned (default: 7). */
46184618
retentionDays?: number;
46194619
};
4620+
/**
4621+
* ProcessFootprintMonitor — observe-only per-machine process-count measurement
4622+
* (the climb signal missing before the 2026-06-26 resource-exhaustion panic).
4623+
* `enabled` undefined resolves via the developmentAgent gate (live on echo, dark
4624+
* on the fleet); `false` → null + `/resources/footprint` 503s. Never gates.
4625+
*/
4626+
processFootprintMonitor?: {
4627+
enabled?: boolean;
4628+
/** Sampling cadence (ms) (default: 5min). */
4629+
sampleIntervalMs?: number;
4630+
/** Rolling-window size for the trend (default: 288 = 24h at 5-min cadence). */
4631+
windowSamples?: number;
4632+
/** Process count at/over which the (opt-in) heads-up fires (default: 220; 0 disables). */
4633+
alertThreshold?: number;
4634+
/** Opt-in heads-up — measure first (default: false). */
4635+
alertEnabled?: boolean;
4636+
};
46204637
/**
46214638
* ActiveWorkSilenceSentinel — topic-independent watchdog: a session that was
46224639
* actively producing output goes silent for N minutes. Covers the gap left
Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
/**
2+
* ProcessFootprintMonitor — the per-machine process-footprint measurement that
3+
* was MISSING when steady-state process accumulation (multiple full agent stacks
4+
* + their heavy MCP servers — a whole Chromium, an Electron) climbed unwatched
5+
* until the host hit a kernel limit and panicked (2026-06-26, os_refcnt overflow).
6+
*
7+
* The host spawn-cap bounds INSTANTANEOUS spawn bursts; the idle-session reapers
8+
* bound idle SESSIONS. Neither MEASURES the slow climb of the total process count.
9+
* This monitor does exactly that and nothing more: on an interval it counts the
10+
* instar-relevant processes on this machine, classifies them (agent CLIs vs MCP
11+
* servers vs other node), keeps a bounded rolling window so a TREND is visible,
12+
* and — only when explicitly enabled — raises ONE de-duplicated heads-up when the
13+
* count crosses a threshold. It is OBSERVE-ONLY: it never kills, throttles, or
14+
* gates anything (the reapers own reclamation). Ships DARK by default.
15+
*
16+
* Pure core: all process input is injected (`listProcesses`) so the classifier and
17+
* trend logic are unit-testable without scanning the real host.
18+
*/
19+
20+
import { execFileSync } from 'node:child_process';
21+
import { MCP_PROCESS_SIGNATURES } from './mcpProcessSignatures.js';
22+
import { withSyncOp } from '../core/InFlightSyncOpMarker.js';
23+
24+
/** A live process as seen by the scanner (only the fields we classify on). */
25+
export interface FootprintProcess {
26+
pid: number;
27+
/** Full command line (argv joined) — matched against signatures/patterns. */
28+
command: string;
29+
/** Resident set size in bytes (0 if unknown). */
30+
rssBytes: number;
31+
}
32+
33+
export type FootprintKind = 'agent-cli' | 'mcp' | 'other-node';
34+
35+
/** One point-in-time footprint reading. */
36+
export interface FootprintSample {
37+
ts: number;
38+
/** Total instar-relevant processes counted. */
39+
total: number;
40+
byKind: Record<FootprintKind, number>;
41+
/** Summed RSS of the counted processes. */
42+
rssBytes: number;
43+
}
44+
45+
export interface ProcessFootprintMonitorConfig {
46+
/** Master switch — DARK by default. When false, start() is a no-op. */
47+
enabled: boolean;
48+
sampleIntervalMs: number;
49+
/** Ring-buffer size (how many samples of history to keep for the trend). */
50+
windowSamples: number;
51+
/**
52+
* Total-process count at/above which a heads-up is raised. 0 disables the alert
53+
* regardless of `alertEnabled`. The alert is observe-only (one attention item).
54+
*/
55+
alertThreshold: number;
56+
/** Alert is opt-in even when the monitor is enabled (measure first). */
57+
alertEnabled: boolean;
58+
}
59+
60+
export const DEFAULT_PROCESS_FOOTPRINT_MONITOR_CONFIG: ProcessFootprintMonitorConfig = {
61+
enabled: false, // DARK by default (observe-only, but no reason to sample on the fleet yet)
62+
sampleIntervalMs: 5 * 60 * 1000,
63+
windowSamples: 288, // 24h at 5-min cadence
64+
alertThreshold: 220, // the panic snapshot showed ~280 node refs; warn well before
65+
alertEnabled: false, // opt-in: measure before paging
66+
};
67+
68+
export interface ProcessFootprintMonitorDeps {
69+
/** Returns the instar-relevant processes on this machine. Injected for tests. */
70+
listProcesses: () => FootprintProcess[];
71+
now?: () => number;
72+
/** Observe-only heads-up sink (the attention queue). Absent ⇒ alert is inert. */
73+
emitAttention?: (item: { id: string; title: string; body: string }) => void;
74+
}
75+
76+
/**
77+
* Production scanner: enumerate the host's processes via `ps`. Returns [] on any
78+
* failure (fail-safe — a missing reading must never crash the monitor). The scan
79+
* is off-hot-path (the monitor samples on a multi-minute interval, ships dark) and
80+
* funnels through withSyncOp so the in-flight marker sees the blocking spawn.
81+
*/
82+
export function defaultListProcesses(): FootprintProcess[] {
83+
let out: string;
84+
try {
85+
// lint-allow-blocking-scan: off-hot-path (multi-minute sampling interval, dark
86+
// by default), bounded 15s timeout — same posture as the AgentWorktreeReaper's
87+
// lsof scan. The monitor only READS process metadata; it never kills or gates.
88+
out = withSyncOp(() => execFileSync('ps', ['-A', '-o', 'pid=,rss=,command='], {
89+
encoding: 'utf-8', timeout: 15_000, maxBuffer: 32 * 1024 * 1024,
90+
}));
91+
} catch {
92+
return []; // @silent-fallback-ok — no ps reading ⇒ no sample (keeps last)
93+
}
94+
const procs: FootprintProcess[] = [];
95+
for (const line of out.split('\n')) {
96+
const m = line.match(/^\s*(\d+)\s+(\d+)\s+(.*)$/);
97+
if (!m) continue;
98+
procs.push({ pid: Number(m[1]), rssBytes: Number(m[2]) * 1024 /* ps rss is KB */, command: m[3] });
99+
}
100+
return procs;
101+
}
102+
103+
/** Classify a single process. Returns null for processes we don't count. */
104+
export function classifyFootprintProcess(p: FootprintProcess): FootprintKind | null {
105+
const cmd = (p.command || '').toLowerCase();
106+
if (!cmd) return null;
107+
// MCP servers — the heavy, mostly-idle ones (Chromium for Playwright, Electron,
108+
// mcp-remote bridges). Matched via the SAME allow-listed signatures the reaper uses.
109+
for (const sig of MCP_PROCESS_SIGNATURES) {
110+
if (sig.commandIncludesAll.every((needle) => cmd.includes(needle.toLowerCase()))) {
111+
return 'mcp';
112+
}
113+
}
114+
// Agent CLIs — the per-session reasoning processes.
115+
if (/\b(claude|codex|gemini)\b/.test(cmd) && !cmd.includes('grep')) return 'agent-cli';
116+
// Other instar node processes (servers, lifelines, MCP wrappers not matched above).
117+
if (/\bnode\b/.test(cmd) || cmd.includes('/.instar/') || cmd.includes('instar/dist')) return 'other-node';
118+
return null;
119+
}
120+
121+
/** Build a footprint sample from a process list (pure). */
122+
export function buildFootprintSample(procs: FootprintProcess[], ts: number): FootprintSample {
123+
const byKind: Record<FootprintKind, number> = { 'agent-cli': 0, mcp: 0, 'other-node': 0 };
124+
let total = 0;
125+
let rssBytes = 0;
126+
for (const p of procs) {
127+
const kind = classifyFootprintProcess(p);
128+
if (!kind) continue;
129+
byKind[kind]++;
130+
total++;
131+
rssBytes += Math.max(0, p.rssBytes || 0);
132+
}
133+
return { ts, total, byKind, rssBytes };
134+
}
135+
136+
export interface FootprintStatus {
137+
enabled: boolean;
138+
latest: FootprintSample | null;
139+
/** Direction over the window: rising if the latest exceeds the window median by
140+
* a margin, falling if below, else stable. Coarse on purpose. */
141+
trend: 'rising' | 'stable' | 'falling' | 'insufficient-data';
142+
windowSize: number;
143+
alertThreshold: number;
144+
alertEnabled: boolean;
145+
/** True while the most recent sample is at/over the threshold. */
146+
overThreshold: boolean;
147+
samples: FootprintSample[];
148+
}
149+
150+
export class ProcessFootprintMonitor {
151+
private readonly cfg: ProcessFootprintMonitorConfig;
152+
private readonly deps: ProcessFootprintMonitorDeps;
153+
private readonly now: () => number;
154+
private ring: FootprintSample[] = [];
155+
private timer?: NodeJS.Timeout;
156+
/** Per-episode alert latch: one heads-up per threshold-crossing episode. */
157+
private alerted = false;
158+
159+
constructor(deps: ProcessFootprintMonitorDeps, cfg?: Partial<ProcessFootprintMonitorConfig>) {
160+
this.deps = deps;
161+
this.cfg = { ...DEFAULT_PROCESS_FOOTPRINT_MONITOR_CONFIG, ...(cfg ?? {}) };
162+
this.now = deps.now ?? (() => Date.now());
163+
}
164+
165+
start(): void {
166+
if (this.timer || !this.cfg.enabled) return;
167+
this.sample(); // one immediate reading
168+
this.timer = setInterval(() => this.sample(), this.cfg.sampleIntervalMs);
169+
if (typeof this.timer.unref === 'function') this.timer.unref();
170+
}
171+
172+
stop(): void {
173+
if (this.timer) { clearInterval(this.timer); this.timer = undefined; }
174+
}
175+
176+
/** Take one reading (also callable directly in tests). Returns the sample. */
177+
sample(): FootprintSample {
178+
let procs: FootprintProcess[];
179+
try { procs = this.deps.listProcesses(); }
180+
catch { return this.ring[this.ring.length - 1] ?? buildFootprintSample([], this.now()); } // fail-safe: keep last
181+
const s = buildFootprintSample(procs, this.now());
182+
this.ring.push(s);
183+
while (this.ring.length > this.cfg.windowSamples) this.ring.shift();
184+
this.maybeAlert(s);
185+
return s;
186+
}
187+
188+
private maybeAlert(s: FootprintSample): void {
189+
if (!this.cfg.alertEnabled || this.cfg.alertThreshold <= 0) return;
190+
if (s.total >= this.cfg.alertThreshold) {
191+
if (!this.alerted && this.deps.emitAttention) {
192+
this.alerted = true; // one per episode
193+
this.deps.emitAttention({
194+
id: 'process-footprint:over-threshold',
195+
title: `Process footprint high (${s.total} processes)`,
196+
body: `This machine is running ${s.total} instar-relevant processes ` +
197+
`(${s.byKind['agent-cli']} agent CLIs, ${s.byKind.mcp} MCP servers, ` +
198+
`${s.byKind['other-node']} other node) — at/over the ${this.cfg.alertThreshold} ` +
199+
`heads-up threshold. Steady-state process accumulation is the footprint that ` +
200+
`preceded the resource-exhaustion panic; consider offloading idle MCP servers ` +
201+
`or consolidating agent stacks.`,
202+
});
203+
}
204+
} else if (s.total < this.cfg.alertThreshold * 0.9) {
205+
this.alerted = false; // re-arm with hysteresis once it recovers
206+
}
207+
}
208+
209+
private computeTrend(): FootprintStatus['trend'] {
210+
if (this.ring.length < 4) return 'insufficient-data';
211+
const totals = this.ring.map((s) => s.total).slice().sort((a, b) => a - b);
212+
const median = totals[Math.floor(totals.length / 2)];
213+
const latest = this.ring[this.ring.length - 1].total;
214+
const margin = Math.max(2, Math.ceil(median * 0.15));
215+
if (latest >= median + margin) return 'rising';
216+
if (latest <= median - margin) return 'falling';
217+
return 'stable';
218+
}
219+
220+
status(): FootprintStatus {
221+
return {
222+
enabled: this.cfg.enabled,
223+
latest: this.ring[this.ring.length - 1] ?? null,
224+
trend: this.computeTrend(),
225+
windowSize: this.ring.length,
226+
alertThreshold: this.cfg.alertThreshold,
227+
alertEnabled: this.cfg.alertEnabled,
228+
overThreshold: (this.ring[this.ring.length - 1]?.total ?? 0) >= this.cfg.alertThreshold && this.cfg.alertThreshold > 0,
229+
samples: this.ring.slice(),
230+
};
231+
}
232+
}

src/monitoring/guardManifest.ts

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,16 @@ export const GUARD_MANIFEST: readonly GuardManifestEntry[] = [
348348
component: 'ResourceLedger',
349349
description: 'CPU/memory sampling + rate-limit-event ledger (read-only observability).',
350350
},
351+
{
352+
key: 'monitoring.processFootprintMonitor.enabled',
353+
kind: 'config',
354+
configPath: 'monitoring.processFootprintMonitor.enabled',
355+
defaultEnabled: false, // dark on the fleet; ON for dev agents via the developmentAgent gate
356+
process: 'server',
357+
expectRuntime: false,
358+
component: 'ProcessFootprintMonitor',
359+
description: 'Per-machine process-footprint count + trend (observe-only; the climb measurement missing before the 2026-06-26 panic).',
360+
},
351361
{
352362
key: 'monitoring.memoryMonitoring',
353363
kind: 'config',

0 commit comments

Comments
 (0)