Add OOM prioritization and graceful degradation#159
Open
gnguralnick wants to merge 40 commits into
Open
Conversation
added 11 commits
June 10, 2026 11:22
Introduces a memory-watchdog service that classifies the container's process tree into eight OOM-priority tiers, keeps each process's oom_score_adj in sync, and sheds whole tiers (most-expendable first) under sustained memory pressure -- so the container degrades gracefully instead of at the kernel's whim, and the user's interface and agents are protected. - libs/memory_watchdog: tier classifier, /proc + tmux probes, oom_score_adj tagger, whole-tier shedder, shed-event ledger and published status file, and a watchdog loop that also supervises bootstrap/telegram/terminal (the mirror of bootstrap restarting the watchdog -- a closed recovery loop). - bootstrap: restart backoff + crash-loop circuit breaker; paused services are recorded in the ledger and surfaced in the banner. - Agent labels: user-created chats/initial-chat tagged user_created=true, workers tagged agent_created=true, driving tier 5 vs tier 7. - SessionStart hook injects a "you were stopped for memory" notice into a revived agent; dead-worker-recovery doc gains ledger-check guidance. - system_interface: /api/memory-status endpoint + a calm pressure banner that appears only under sustained pressure.
The runtime path layout was hardcoded independently in the watchdog, the system interface, and the notice hook, and the MEMORY_WATCHDOG_RUNTIME_DIR override was honored only by the writer. Make memory_watchdog.ledger the single source: it resolves relative to MNGR_AGENT_WORK_DIR and honors the override; the system interface now imports status_path() rather than rebuilding it; and the stdlib-only hook honors the same base and override so producer and consumers can't resolve to different files.
Problem: system_probe.py had a public read_agent_label_sets() that was a pure pass-through over the private _read_agent_label_sets(), unlike the module's other probes which are public directly -- needless indirection. Fix: merged the two into a single public read_agent_label_sets(); the public name and signature are unchanged so the watchdog import still works.
Problem: ShedRecord.timestamp was documented as 'Nanosecond-precision'
but the producing format strftime('...%f000Z') yields microseconds (%f)
with a literal '000' pad, so the last three digits are always zero --
the resolution is microsecond, not nanosecond.
Fix: updated the field description to say microsecond-precision; the
timestamp string format is unchanged (it still sorts and parses correctly).
Problem: The nanosecond-precision ISO timestamp format "%Y-%m-%dT%H:%M:%S.%f000Z" was hand-rolled and duplicated across ledger.py, shedder.py, and watchdog.py, reimplementing imbue_common.format_nanosecond_iso_timestamp (already a dependency, and the helper mngr uses for event/discovery timestamps) via a fragile %f-plus-literal-000 trick. Fix: Added a single in-package now_iso_timestamp() helper in data_types.py that delegates to format_nanosecond_iso_timestamp, plus one ISO_TIMESTAMP_FORMAT constant for the strptime parser, and routed all three producers and the parser through them so producers and reader cannot drift. Output is byte-identical, so the ledger format and the SessionStart hook's string comparisons are unchanged.
Problem: ShedRecord.timestamp in libs/memory_watchdog/data_types.py described itself as 'Microsecond-precision', but the value is produced by now_iso_timestamp() -> format_nanosecond_iso_timestamp(), which emits nanosecond precision (matching ISO_TIMESTAMP_FORMAT). The description was the only timestamp field in the module that misstated its precision. Fix: Change the field description to 'Nanosecond-precision' so the documentation matches the actual format. Description-string only; no runtime, schema, or serialization change.
This reverts commit 77f1843.
…cond" This reverts commit b52044a.
…ation # Conflicts: # uv.lock
gnguralnick
commented
Jun 16, 2026
added 3 commits
June 22, 2026 15:35
…atchdog onto supervisord origin/main replaced the tmux-based service manager (services.toml + a bootstrap restart loop + per-service svc-<name> windows) with supervisord, and retired the telegram service. The OOM watchdog was built directly on that old model, so this merge re-architects it to fit supervisord: - memory-watchdog is now a [program:*] in supervisord.conf instead of a services.toml entry. - The classifier no longer tiers services by tmux window name (every service would have collapsed into the single "bootstrap" supervisord pane and never been shed). It now walks supervisord's child processes and tiers each by its command line; the services tmux session is derived from supervisord's pane ancestor rather than an unreliable `tmux display-message`. - Dropped the bootstrap crash-loop circuit breaker (supervisord owns restarts now) and the watchdog's tmux window-supervision of bootstrap/telegram/terminal (supervisord owns liveness; telegram is retired). blocked_services plumbing is kept but dormant, reserved for a future supervisorctl-based crash-loop signal. - Re-ported the /api/memory-status endpoint from the old FastAPI server onto the new Flask server; the frontend memory-pressure banner merged cleanly. The OOM core (tier classification, oom_score_adj tagging, tier shedding, ledger, status file, the SessionStart shed-notice hook, and user_created/agent_created agent labeling) is unchanged -- it never depended on how services are launched.
gnguralnick
commented
Jun 23, 2026
gnguralnick
commented
Jun 23, 2026
gnguralnick
commented
Jun 24, 2026
gnguralnick
commented
Jun 24, 2026
added 11 commits
June 24, 2026 15:15
…via agent env; refresh banner copy and docs - Extract the ledger/status on-disk layout into a dependency-free memory_watchdog.paths module, imported by the ledger, the system interface status reader, and the revival hook, so the layout cannot drift between producer and consumers. - Resolve the services tmux session from MNGR_AGENT_NAME rather than walking supervisord's pane ancestor, removing find_services_session_name and the tmux current-session fallback. - Update the memory-pressure banner copy. - Fix stale pre-supervisord references (services.toml, svc-<name> windows, watchdog-as-supervisor) in OOM_DRILL.md, data_types.py, and the ledger docstring.
…ity indicator - Over-shedding: the shed loop re-read /proc/meminfo immediately after SIGKILL, before the kernel reclaimed the freed pages, so it escalated through cheaper tiers into the user's own agents even when shedding a single large agent-child had already freed enough. Replace the synchronous re-read with a pure select_tiers_to_shed that projects each tier's reclaim from its processes' resident memory and stops escalating once that clears the relief threshold; the next poll corrects any under-estimate. - supervisord detection: the image launches supervisord as 'python3 /usr/bin/supervisord', so argv[0]'s basename is the interpreter and the classifier never recognized it, leaving every service in the protected infrastructure tier (never shed). Match the basename of either of the first two argv tokens. - UI: the activity indicator rendered a cached 'Thinking...' from the agent-state websocket even after that socket disconnected, so a killed or finished agent could stay pinned on 'Thinking...' indefinitely. Show a muted 'Reconnecting...' whenever the socket is down; the server pushes a fresh snapshot on reconnect.
The banner dumped the raw shed labels inline. Replace that with a calm count of how much background work was paused, plus a chevron that expands an itemized list -- each entry naming what was paused, its kind (agent subprocess / worker / service / agent), and how much memory it freed.
…agent
Shed entries showed only the interpreter name ("python3") with no owner.
Now the classifier labels an agent subprocess past the interpreter
("python3 hog.py", "pytest") and tags it with the agent whose session it
ran under, threaded through the shed records, status file, and API. The
banner dropdown shows e.g. "python3 hog.py — Agent subprocess from alice".
owning_agent_name is distinct from the revival-notice agent_name, so
attributing a subprocess never implies the agent itself was shed.
- Render the paused items as a table (Process / Creator / Freed) instead of an inline list; the creator column is just the owning agent's name for a subprocess (e.g. 'hogtest') rather than 'Agent subprocess from hogtest'. - Fix the expand pushing the chat input and terminal button off-screen: give the workspace pane min-h-0 + overflow-hidden so it shrinks when the banner grows, and cap the dropdown at 38vh with internal scroll.
…nly focus ring The hover underline collided awkwardly with the chevron and the click focus box looked boxed-in. Replace with a faint hover chip behind the count, and show a focus ring only for keyboard navigation.
The revival SessionStart hook resolved the shed ledger relative to each agent's own MNGR_AGENT_WORK_DIR. The watchdog (writer) and system interface run under the system-services agent (work dir /mngr/code), but a worker runs in its own worktree, so its hook read a nonexistent worktree-local ledger and never told the revived worker it had been paused. Pin MEMORY_WATCHDOG_RUNTIME_DIR in the agent env (same mechanism as TICKETS_DIR) so writer and every reader share one ledger. Also persist owning_agent_name in the shed ledger so the durable record is not lossier than the live status file.
Shedding a whole tier to reclaim one large hog took down everything else in that tier -- the agent's transcript streamer, its lead's report poll, bare sleeps -- freeing almost nothing but breaking observability and coordination for every agent involved. Select and kill individual processes instead, ordered by tier shed-priority then resident size, stopping the instant the projected reclaim clears relief. A single hog is now shed on its own. Also never shed a process below a small resident floor (10 MiB): killing it cannot meaningfully relieve pressure, so doing so is pure collateral.
added 14 commits
June 25, 2026 13:01
…tall When the watchdog pauses a worker's own agent, the worker never reports, so the lead's background poll would just sit until its 30-minute timeout -- and the lead had no way to tell 'paused for memory' from 'still working'. Have the await poll watch the shed ledger (via the watchdog's own path module, --name opt-in) and, on seeing the worker's agent shed, return a distinct code with an actionable message: revive with 'mngr start <name> --restart' (a plain message/ start will not relaunch a shed agent), then re-send the task. Document the same in lead-proxy.md.
A worker shed mid memory-heavy task would, on revival, often just re-run the same command and be shed again. Extend the notice to say so explicitly: don't blindly re-run a memory-intensive task; find a lower-memory approach first (smaller batches, streaming, releasing data) and only retry if you can.
…pers An agent's mngr machinery -- the background-task loop, the transcript streamers it spawns (which feed the UI), and a lead's worker-report poll (create_worker.py await) -- shared the expendable agent-child tier with real work. So a memory shed took them down for ~nothing: the UI went dark and, worse, the lead's poll (its only signal that a worker was paused) died with the worker it watched. Classify these helpers as never-shed infrastructure by command pattern, regardless of depth, so the lead's eyes on a worker outlive the worker.
The await poll's shed check bounded on 'sheds after this poll started', which broke the realistic case: when a worker is shed its report poll dies too, the lead re-runs the poll, and that re-run -- started after the shed -- would ignore the very shed it should report. Use the revival hook's own pending-shed notion instead: a worker shed not yet followed by a notice_delivered (revival) marker. Works whether the poll survives the shed (now the common case, since the poll is protected) or is re-run after it.
A lead following the skill easily omits --name, which silently disabled the shed-ledger watch (the poll then just waits out its timeout on a paused worker). Default the worker name to the task file's directory name -- every flow stages the task at runtime/<flow>/<NAME>/task.md where <NAME> is the worker's mngr agent name -- so the pause detection works without the flag. --name still overrides; a wrong-derived name simply never matches a ledger record, so it cannot false-fire.
Agents whose subprocess is shed mid-session only see exit 137 -- there is no push notification for that case (the revival notice only covers an agent's own restart). Observed workers diagnosed it by reading the shed ledger directly and never reached for the dealing-with-the-unexpected skill, so put the guidance in the always-loaded CLAUDE.md: suspect the watchdog on an unexplained exit 137, confirm via the ledger, and find a lower-memory approach rather than blindly re-running.
Problem: the watchdog module docstring, the supervisord program comment, and the README all still said the watchdog 'sheds whole tiers', but this branch changed shedding to be per-process (largest-first within each shed-ordered tier, stopping at the relief threshold, with a minimum-RSS floor). The wording contradicted the actual behavior and the module's own internal comments. Fix: reword all three to describe per-process shedding while keeping the tier ordering guarantee explicit.
… recovery docs Problem: dead-worker-recovery.md and OOM_DRILL.md described reviving a memory-shed worker with a plain 'mngr start <worker>' / 'mngr message', but this branch's code (create_worker.py await notice, lead-proxy.md) states a shed agent will only relaunch with 'mngr start <worker> --restart'. The shed-revival guidance therefore gave a command that does not work. Fix: use '--restart' in the shed-pressure revival paths of both docs and note that a plain start/message will not relaunch a shed agent. The generic non-shed restart path (a crashed-but-not-shed worker) is left on plain 'mngr start', which is correct there.
Problem: libs/memory_watchdog/README.md claimed the SessionStart notice hook (scripts/claude_shed_notice_hook.py) duplicates the on-disk path layout because it cannot import the package. The hook actually imports the shared, dependency-free memory_watchdog.paths helper (via sys.path) precisely to avoid duplicating the layout -- as the hook's own docstring states -- so the README contradicted the implementation. Fix: rewrite the Paths section to name memory_watchdog.paths as the single dependency-free source of truth (re-exported through ledger, imported by the system interface) and describe the hook as importing that same helper rather than duplicating the layout.
Problem: _read_memory_status in apps/system_interface server.py documents that a missing/unreadable/future-schema watchdog status file leaves the banner hidden rather than erroring, but its try/except only caught OSError and JSONDecodeError. A status file that parses to a non-dict (e.g. 'null') or carries a non-numeric value where a number is expected raised AttributeError/TypeError/ValueError out of the projection, surfacing as an HTTP 500 -- contradicting the documented invariant. Fix: guard the JSON parse and the projection into the response model together, broadening the caught exceptions to also include AttributeError, TypeError, and ValueError so any malformed status content falls back to the healthy (no-banner) response, still logged once (not silent). Added a regression test for a top-level non-dict status file.
Problem: README step 3 listed the shed order as 'subprocesses, then auxiliary services, then worker agents', but the implementation sheds worker agents (tier 7) before auxiliary services (tier 6) -- per SHEDDABLE_TIERS_IN_SHED_ORDER and the tier ranks. The README's own tier table and shedder.py already state the correct order; only this prose line was stale. Fix: swapped the two clauses so the prose matches the implemented order (subprocesses, then worker agents, then auxiliary services, then user agents).
Problem: the final else branch in the pane-classification pass was commented 'A prefixed session we cannot interpret', but that branch runs only when no agent name could be resolved -- which happens for NON-prefixed sessions, since every prefixed session resolves to a non-None agent name and takes the prior branch. The comment described the opposite condition. Fix: reword the comment to describe the actual case (a non-services session whose name lacks the agent prefix, protected like a user agent).
Problem: the "was the worker shed for memory pressure?" block in .agents/skills/launch-task/references/dead-worker-recovery.md grepped/cat'd the shed ledger and status file via repo-root-relative paths (runtime/memory_watchdog/...). This reference is executed by a lead agent whose cwd is its own worktree (/mngr/worktree/<lead>-<hash>/), not /mngr/code, so the relative paths resolved to a nonexistent worktree-local location: the grep silently matched nothing (falsely implying the worker was never shed, prompting a revive straight back into memory pressure) and the cat failed. The shared ledger lives at /mngr/code/runtime/memory_watchdog/... -- pinned via MEMORY_WATCHDOG_RUNTIME_DIR for the Python readers, but that env var does not affect a shell grep/cat. CLAUDE.md already documents this ledger with the absolute path for the same reason. Fix: use the absolute /mngr/code/runtime/memory_watchdog/... paths in that block, matching CLAUDE.md, with a short note on why relative would miss the file.
Problem: The watchdog README's tier table listed RECOVERY (tier 3) as containing 'bootstrap, this watchdog' and the Tier enum comment described it as 'the service manager and this watchdog'. The classifier never assigns either to RECOVERY: the bootstrap pane shell and supervisord (the service manager) are classified as INFRASTRUCTURE (tier 1), as asserted by classifier_test, and the only process mapped to RECOVERY is the watchdog itself. The stale wording predates the supervisord migration. Fix: List only the watchdog under RECOVERY in the README table, and note in the enum comment that supervisord and the bootstrap launcher are tier-1 infrastructure, so the docs match what the classifier actually does.
gnguralnick
commented
Jun 26, 2026
gnguralnick
commented
Jun 26, 2026
gnguralnick
commented
Jun 26, 2026
gnguralnick
commented
Jun 26, 2026
Comment on lines
+8
to
+10
| 1. Snapshots the process tree (`/proc`), the tmux panes, and the host's agent | ||
| labels, then classifies every process into one of eight OOM-priority tiers | ||
| (see `data_types.Tier`). |
Contributor
There was a problem hiding this comment.
this fundamentally seems like a weird way to make this -- can we not have the processes launched with the right priority in the first place?
having an extra service to watch and poll runs the risk of something suddenly allocating tons of memory and either getting killed when it shouldn't or causing something else to get killed when it shouldn't
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a workspace container runs out of memory, victim selection is currently at the kernel's whim -- it can silently kill an agent, a build, or even the system interface, with nothing recorded or recovered. This adds an ordered, observable degradation layer.
A new
memory-watchdogservice:/procand the tmux panes, keyed by service window and agent labels. Pane shells are always spared so windows survive shedding and bootstrap can still detect/restart service exits. An agent's own coordination/observability machinery -- the mngr background-task loop, the transcript streamers it spawns (which feed the UI), and a lead'screate_worker.py awaitreport poll -- is classified as never-shed infrastructure rather than as expendable agent subprocesses: each is tiny but load-bearing, so shedding one frees nothing while blinding the UI or severing lead↔worker coordination.oom_score_adjper tier (positive-only, so noCAP_SYS_RESOURCEis needed) -- the kernel-level lever under the runc runtime (lima)./procbetween kills (an immediate re-read over-sheds: a just-killed process's pages aren't reclaimed yet, so usage still looks full and the shedder escalates needlessly into the user's own agents). Processes below a small resident floor (10 MiB) are never shed -- killing one frees too little to matter, so it would be pure collateral. User-created agents are the last resort; tiers 1-4 are never shed. This is the real mechanism under gVisor, where in-containeroom_score_adjcan't steer the host's victim selection./mngr/code/runtime/memory_watchdog/events/shed/events.jsonl, backed up) and publishes a status file the UI reads. Each record carries the shed process's owning agent.Liveness of the watchdog itself, and of every background service, is owned by supervisord: it restarts the watchdog if it dies, and restarts any service the watchdog sheds. The watchdog only decides what to shed; it does not supervise.
Supporting changes:
user_created=true; workers are taggedagent_created=true. These drive tier 5 (protected, last resort) vs tier 7; unlabeled agents default protectively to tier 5.scripts/claude_shed_notice_hook.py) injects a one-time "you were stopped to relieve memory pressure; your background tasks were cancelled and not restarted; don't blindly re-run a memory-heavy task -- find a lower-memory approach first" notice into a revived agent, tracked by a delivery marker in the ledger. The watchdog (writer), the system-interface status reader, and every agent's hook resolve the one shared ledger via a pinnedMEMORY_WATCHDOG_RUNTIME_DIR(.mngr/settings.toml) -- so the notice fires even for worker agents, whose work dir is a worktree rather than the main checkout.create_worker.py awaitreport poll (now protected, so it survives the shed) detects it through the ledger -- the worker name is derived from the task-file path, so it works even without an explicit--name-- and returns a distinct exit code (75) with an actionable message: revive the worker withmngr start <name> --restart(a plainmngr message/startwill not relaunch a shed agent), then re-send its task. This replaces a previous silent timeout where the lead lost its only signal that the worker had been paused.--restart; twice-shed escalates to the user).CLAUDE.mddocuments how any agent should react to an unexplained exit 137 (confirm against the ledger; find a lower-memory approach rather than re-running)./api/memory-statusendpoint (which stays healthy and hides the banner rather than 500ing on a malformed status file) and a calm, non-alarming full-width strip above the workspace that appears only under sustained pressure (zero layout impact otherwise). Collapsed, it shows the reassuring message plus a count of paused background tasks; expanded, it shows a Process / Creator / Freed table -- each row naming the paused command (with an×Nmultiplier when several of a kind were shed), who it belonged to (an agent subprocess is attributed to its owning agent by name, otherwise a friendly tier label), and how much memory it freed, plus any blocked system services.A manual OOM drill procedure is documented in
libs/memory_watchdog/OOM_DRILL.md; the pure classification/shed/breaker/IO logic is unit-tested, and the shed-and-recover path was exercised end-to-end against live docker containers (worker subprocess shed → worker survives and reacts; worker agent shed → lead's poll reports it and revives the worker → revived worker receives the notice).Scope notes
--restartdocker args were added. This layer handles the in-container degradation that machinery can't see.MemAvailable), not swap, so under heavy swapping true pressure can be understated -- noted for a possible follow-up, not addressed here.Testing
cd libs/memory_watchdog && uv run pytest: 45 passed (process-granular shedding incl. the spare-the-helpers regression, the classifier helper-protection case, ledger persistence, status round-trip, ratchets).uv run pytest .agents/skills/launch-task/scripts/create_worker_test.py: 50 passed (shed-ledger detection, pending-shed vs already-revived, worker-name derivation, the await poll lifecycle).cd apps/system_interface && uv run pytest -m "not tmux and not modal and not docker and not docker_sdk and not acceptance and not release": 487 passed (incl. the new memory-status malformed-file regression test). The one failure is the pre-existing macOS-only environment issue (mngr observeneeds a pytest-enabled local profile; fails identically onmain; passes in CI/Linux).npm run testinapps/system_interface/frontend): 193 passed, includingMemoryPressureBannerand the updatedActivityIndicatortests.libs/bootstrap+ meta/per-lib ratchets pass.