Skip to content

Add OOM prioritization and graceful degradation#159

Open
gnguralnick wants to merge 40 commits into
mainfrom
gabriel/oom-prioritization
Open

Add OOM prioritization and graceful degradation#159
gnguralnick wants to merge 40 commits into
mainfrom
gabriel/oom-prioritization

Conversation

@gnguralnick

@gnguralnick gnguralnick commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

When a workspace container runs out of memory, victim selection is currently at the kernel's whim -- it can silently kill an agent, a build, or even the system interface, with nothing recorded or recovered. This adds an ordered, observable degradation layer.

A new memory-watchdog service:

  • Classifies every process into 8 OOM-priority tiers (infrastructure → UI → recovery → durability → user agents → auxiliary services → worker agents → agent build/test/browser subprocesses) by walking /proc and the tmux panes, keyed by service window and agent labels. Pane shells are always spared so windows survive shedding and bootstrap can still detect/restart service exits. An agent's own coordination/observability machinery -- the mngr background-task loop, the transcript streamers it spawns (which feed the UI), and a lead's create_worker.py await report poll -- is classified as never-shed infrastructure rather than as expendable agent subprocesses: each is tiny but load-bearing, so shedding one frees nothing while blinding the UI or severing lead↔worker coordination.
  • Tags each process's oom_score_adj per tier (positive-only, so no CAP_SYS_RESOURCE is needed) -- the kernel-level lever under the runc runtime (lima).
  • Sheds individual processes under sustained pressure, most-expendable tier first and largest-resident-first within a tier, stopping the instant the projected reclaim clears the relief threshold -- so a single large hog is shed on its own rather than taking its whole tier down with it. Which processes to shed is decided up front by projecting how much resident memory each would free, rather than re-reading /proc between kills (an immediate re-read over-sheds: a just-killed process's pages aren't reclaimed yet, so usage still looks full and the shedder escalates needlessly into the user's own agents). Processes below a small resident floor (10 MiB) are never shed -- killing one frees too little to matter, so it would be pure collateral. User-created agents are the last resort; tiers 1-4 are never shed. This is the real mechanism under gVisor, where in-container oom_score_adj can't steer the host's victim selection.
  • Records every kill to an append-only ledger (/mngr/code/runtime/memory_watchdog/events/shed/events.jsonl, backed up) and publishes a status file the UI reads. Each record carries the shed process's owning agent.

Liveness of the watchdog itself, and of every background service, is owned by supervisord: it restarts the watchdog if it dies, and restarts any service the watchdog sheds. The watchdog only decides what to shed; it does not supervise.

Supporting changes:

  • bootstrap: restart backoff + crash-loop circuit breaker; paused services are recorded in the ledger and shown in the banner.
  • Agent labels: user-created chats and the initial chat are tagged user_created=true; workers are tagged agent_created=true. These drive tier 5 (protected, last resort) vs tier 7; unlabeled agents default protectively to tier 5.
  • Revival notice: a SessionStart hook (scripts/claude_shed_notice_hook.py) injects a one-time "you were stopped to relieve memory pressure; your background tasks were cancelled and not restarted; don't blindly re-run a memory-heavy task -- find a lower-memory approach first" notice into a revived agent, tracked by a delivery marker in the ledger. The watchdog (writer), the system-interface status reader, and every agent's hook resolve the one shared ledger via a pinned MEMORY_WATCHDOG_RUNTIME_DIR (.mngr/settings.toml) -- so the notice fires even for worker agents, whose work dir is a worktree rather than the main checkout.
  • Lead coordination: when the watchdog sheds a worker's own agent, the lead's create_worker.py await report poll (now protected, so it survives the shed) detects it through the ledger -- the worker name is derived from the task-file path, so it works even without an explicit --name -- and returns a distinct exit code (75) with an actionable message: revive the worker with mngr start <name> --restart (a plain mngr message/start will not relaunch a shed agent), then re-send its task. This replaces a previous silent timeout where the lead lost its only signal that the worker had been paused.
  • Dead-worker recovery: the reference doc gains ledger-check + revival guidelines (read the ledger by absolute path; don't revive while pressure is elevated; revive with --restart; twice-shed escalates to the user). CLAUDE.md documents how any agent should react to an unexplained exit 137 (confirm against the ledger; find a lower-memory approach rather than re-running).
  • system_interface — pressure banner: a /api/memory-status endpoint (which stays healthy and hides the banner rather than 500ing on a malformed status file) and a calm, non-alarming full-width strip above the workspace that appears only under sustained pressure (zero layout impact otherwise). Collapsed, it shows the reassuring message plus a count of paused background tasks; expanded, it shows a Process / Creator / Freed table -- each row naming the paused command (with an ×N multiplier when several of a kind were shed), who it belonged to (an agent subprocess is attributed to its owning agent by name, otherwise a friendly tier label), and how much memory it freed, plus any blocked system services.
  • system_interface — activity indicator hardening: the per-agent activity strip now renders a calm "Reconnecting…" while the agents websocket is disconnected instead of pinning a stale "Thinking…"/"Running tool…". This matters when the watchdog sheds an agent: without it, a killed agent would otherwise appear stuck mid-turn forever. The server pushes a fresh state snapshot on reconnect, restoring the true state within one cycle.

A manual OOM drill procedure is documented in libs/memory_watchdog/OOM_DRILL.md; the pure classification/shed/breaker/IO logic is unit-tested, and the shed-and-recover path was exercised end-to-end against live docker containers (worker subprocess shed → worker survives and reacts; worker agent shed → lead's poll reports it and revives the worker → revived worker receives the notice).

Scope notes

  • The minds desktop client's existing host-restart tier already recovers whole-container death (gVisor sandbox OOM), so no --restart docker args were added. This layer handles the in-container degradation that machinery can't see.
  • Not implemented: a config-driven auto-revive list (the plan's optional, default-empty hook) and the intelligent "watcher agent" -- both deferred.
  • The watchdog's arm threshold is computed from RAM availability (MemAvailable), not swap, so under heavy swapping true pressure can be understated -- noted for a possible follow-up, not addressed here.

Testing

  • cd libs/memory_watchdog && uv run pytest: 45 passed (process-granular shedding incl. the spare-the-helpers regression, the classifier helper-protection case, ledger persistence, status round-trip, ratchets).
  • uv run pytest .agents/skills/launch-task/scripts/create_worker_test.py: 50 passed (shed-ledger detection, pending-shed vs already-revived, worker-name derivation, the await poll lifecycle).
  • cd apps/system_interface && uv run pytest -m "not tmux and not modal and not docker and not docker_sdk and not acceptance and not release": 487 passed (incl. the new memory-status malformed-file regression test). The one failure is the pre-existing macOS-only environment issue (mngr observe needs a pytest-enabled local profile; fails identically on main; passes in CI/Linux).
  • Frontend (npm run test in apps/system_interface/frontend): 193 passed, including MemoryPressureBanner and the updated ActivityIndicator tests.
  • libs/bootstrap + meta/per-lib ratchets pass.

Gabriel Guralnick added 11 commits June 10, 2026 11:22
Introduces a memory-watchdog service that classifies the container's
process tree into eight OOM-priority tiers, keeps each process's
oom_score_adj in sync, and sheds whole tiers (most-expendable first)
under sustained memory pressure -- so the container degrades gracefully
instead of at the kernel's whim, and the user's interface and agents are
protected.

- libs/memory_watchdog: tier classifier, /proc + tmux probes,
  oom_score_adj tagger, whole-tier shedder, shed-event ledger and
  published status file, and a watchdog loop that also supervises
  bootstrap/telegram/terminal (the mirror of bootstrap restarting the
  watchdog -- a closed recovery loop).
- bootstrap: restart backoff + crash-loop circuit breaker; paused
  services are recorded in the ledger and surfaced in the banner.
- Agent labels: user-created chats/initial-chat tagged user_created=true,
  workers tagged agent_created=true, driving tier 5 vs tier 7.
- SessionStart hook injects a "you were stopped for memory" notice into a
  revived agent; dead-worker-recovery doc gains ledger-check guidance.
- system_interface: /api/memory-status endpoint + a calm pressure banner
  that appears only under sustained pressure.
The runtime path layout was hardcoded independently in the watchdog, the
system interface, and the notice hook, and the MEMORY_WATCHDOG_RUNTIME_DIR
override was honored only by the writer. Make memory_watchdog.ledger the
single source: it resolves relative to MNGR_AGENT_WORK_DIR and honors the
override; the system interface now imports status_path() rather than
rebuilding it; and the stdlib-only hook honors the same base and override
so producer and consumers can't resolve to different files.
Problem: system_probe.py had a public read_agent_label_sets() that was a
pure pass-through over the private _read_agent_label_sets(), unlike the
module's other probes which are public directly -- needless indirection.
Fix: merged the two into a single public read_agent_label_sets(); the
public name and signature are unchanged so the watchdog import still works.
Problem: ShedRecord.timestamp was documented as 'Nanosecond-precision'
but the producing format strftime('...%f000Z') yields microseconds (%f)
with a literal '000' pad, so the last three digits are always zero --
the resolution is microsecond, not nanosecond.
Fix: updated the field description to say microsecond-precision; the
timestamp string format is unchanged (it still sorts and parses correctly).
Problem: The nanosecond-precision ISO timestamp format "%Y-%m-%dT%H:%M:%S.%f000Z"
was hand-rolled and duplicated across ledger.py, shedder.py, and watchdog.py,
reimplementing imbue_common.format_nanosecond_iso_timestamp (already a dependency,
and the helper mngr uses for event/discovery timestamps) via a fragile
%f-plus-literal-000 trick.
Fix: Added a single in-package now_iso_timestamp() helper in data_types.py that
delegates to format_nanosecond_iso_timestamp, plus one ISO_TIMESTAMP_FORMAT
constant for the strptime parser, and routed all three producers and the parser
through them so producers and reader cannot drift. Output is byte-identical, so
the ledger format and the SessionStart hook's string comparisons are unchanged.
Problem: ShedRecord.timestamp in libs/memory_watchdog/data_types.py described
itself as 'Microsecond-precision', but the value is produced by
now_iso_timestamp() -> format_nanosecond_iso_timestamp(), which emits
nanosecond precision (matching ISO_TIMESTAMP_FORMAT). The description was the
only timestamp field in the module that misstated its precision.
Fix: Change the field description to 'Nanosecond-precision' so the
documentation matches the actual format. Description-string only; no runtime,
schema, or serialization change.
Comment thread apps/system_interface/frontend/src/views/MemoryPressureBanner.ts Outdated
Gabriel Guralnick added 3 commits June 22, 2026 15:35
…atchdog onto supervisord

origin/main replaced the tmux-based service manager (services.toml + a bootstrap
restart loop + per-service svc-<name> windows) with supervisord, and retired the
telegram service. The OOM watchdog was built directly on that old model, so this
merge re-architects it to fit supervisord:

- memory-watchdog is now a [program:*] in supervisord.conf instead of a
  services.toml entry.
- The classifier no longer tiers services by tmux window name (every service
  would have collapsed into the single "bootstrap" supervisord pane and never
  been shed). It now walks supervisord's child processes and tiers each by its
  command line; the services tmux session is derived from supervisord's pane
  ancestor rather than an unreliable `tmux display-message`.
- Dropped the bootstrap crash-loop circuit breaker (supervisord owns restarts
  now) and the watchdog's tmux window-supervision of bootstrap/telegram/terminal
  (supervisord owns liveness; telegram is retired). blocked_services plumbing is
  kept but dormant, reserved for a future supervisorctl-based crash-loop signal.
- Re-ported the /api/memory-status endpoint from the old FastAPI server onto the
  new Flask server; the frontend memory-pressure banner merged cleanly.

The OOM core (tier classification, oom_score_adj tagging, tier shedding, ledger,
status file, the SessionStart shed-notice hook, and user_created/agent_created
agent labeling) is unchanged -- it never depended on how services are launched.
Comment thread libs/memory_watchdog/src/memory_watchdog/classifier.py Outdated
Comment thread libs/memory_watchdog/src/memory_watchdog/classifier.py Outdated
Comment thread libs/memory_watchdog/src/memory_watchdog/watchdog.py
Comment thread scripts/claude_shed_notice_hook.py Outdated
Gabriel Guralnick added 11 commits June 24, 2026 15:15
…via agent env; refresh banner copy and docs

- Extract the ledger/status on-disk layout into a dependency-free
  memory_watchdog.paths module, imported by the ledger, the system
  interface status reader, and the revival hook, so the layout cannot
  drift between producer and consumers.
- Resolve the services tmux session from MNGR_AGENT_NAME rather than
  walking supervisord's pane ancestor, removing find_services_session_name
  and the tmux current-session fallback.
- Update the memory-pressure banner copy.
- Fix stale pre-supervisord references (services.toml, svc-<name>
  windows, watchdog-as-supervisor) in OOM_DRILL.md, data_types.py, and
  the ledger docstring.
…ity indicator

- Over-shedding: the shed loop re-read /proc/meminfo immediately after
  SIGKILL, before the kernel reclaimed the freed pages, so it escalated
  through cheaper tiers into the user's own agents even when shedding a
  single large agent-child had already freed enough. Replace the
  synchronous re-read with a pure select_tiers_to_shed that projects each
  tier's reclaim from its processes' resident memory and stops escalating
  once that clears the relief threshold; the next poll corrects any
  under-estimate.
- supervisord detection: the image launches supervisord as
  'python3 /usr/bin/supervisord', so argv[0]'s basename is the
  interpreter and the classifier never recognized it, leaving every
  service in the protected infrastructure tier (never shed). Match the
  basename of either of the first two argv tokens.
- UI: the activity indicator rendered a cached 'Thinking...' from the
  agent-state websocket even after that socket disconnected, so a killed
  or finished agent could stay pinned on 'Thinking...' indefinitely.
  Show a muted 'Reconnecting...' whenever the socket is down; the server
  pushes a fresh snapshot on reconnect.
The banner dumped the raw shed labels inline. Replace that with a calm
count of how much background work was paused, plus a chevron that expands
an itemized list -- each entry naming what was paused, its kind (agent
subprocess / worker / service / agent), and how much memory it freed.
…agent

Shed entries showed only the interpreter name ("python3") with no owner.
Now the classifier labels an agent subprocess past the interpreter
("python3 hog.py", "pytest") and tags it with the agent whose session it
ran under, threaded through the shed records, status file, and API. The
banner dropdown shows e.g. "python3 hog.py — Agent subprocess from alice".
owning_agent_name is distinct from the revival-notice agent_name, so
attributing a subprocess never implies the agent itself was shed.
- Render the paused items as a table (Process / Creator / Freed) instead
  of an inline list; the creator column is just the owning agent's name
  for a subprocess (e.g. 'hogtest') rather than 'Agent subprocess from
  hogtest'.
- Fix the expand pushing the chat input and terminal button off-screen:
  give the workspace pane min-h-0 + overflow-hidden so it shrinks when the
  banner grows, and cap the dropdown at 38vh with internal scroll.
…nly focus ring

The hover underline collided awkwardly with the chevron and the click
focus box looked boxed-in. Replace with a faint hover chip behind the
count, and show a focus ring only for keyboard navigation.
The revival SessionStart hook resolved the shed ledger relative to each
agent's own MNGR_AGENT_WORK_DIR. The watchdog (writer) and system
interface run under the system-services agent (work dir /mngr/code), but a
worker runs in its own worktree, so its hook read a nonexistent
worktree-local ledger and never told the revived worker it had been paused.
Pin MEMORY_WATCHDOG_RUNTIME_DIR in the agent env (same mechanism as
TICKETS_DIR) so writer and every reader share one ledger.

Also persist owning_agent_name in the shed ledger so the durable record is
not lossier than the live status file.
Shedding a whole tier to reclaim one large hog took down everything else in
that tier -- the agent's transcript streamer, its lead's report poll, bare
sleeps -- freeing almost nothing but breaking observability and coordination
for every agent involved. Select and kill individual processes instead, ordered
by tier shed-priority then resident size, stopping the instant the projected
reclaim clears relief. A single hog is now shed on its own. Also never shed a
process below a small resident floor (10 MiB): killing it cannot meaningfully
relieve pressure, so doing so is pure collateral.
Gabriel Guralnick added 14 commits June 25, 2026 13:01
…tall

When the watchdog pauses a worker's own agent, the worker never reports, so the
lead's background poll would just sit until its 30-minute timeout -- and the
lead had no way to tell 'paused for memory' from 'still working'. Have the
await poll watch the shed ledger (via the watchdog's own path module, --name
opt-in) and, on seeing the worker's agent shed, return a distinct code with an
actionable message: revive with 'mngr start <name> --restart' (a plain message/
start will not relaunch a shed agent), then re-send the task. Document the same
in lead-proxy.md.
A worker shed mid memory-heavy task would, on revival, often just re-run the
same command and be shed again. Extend the notice to say so explicitly: don't
blindly re-run a memory-intensive task; find a lower-memory approach first
(smaller batches, streaming, releasing data) and only retry if you can.
…pers

An agent's mngr machinery -- the background-task loop, the transcript streamers
it spawns (which feed the UI), and a lead's worker-report poll (create_worker.py
await) -- shared the expendable agent-child tier with real work. So a memory
shed took them down for ~nothing: the UI went dark and, worse, the lead's poll
(its only signal that a worker was paused) died with the worker it watched.
Classify these helpers as never-shed infrastructure by command pattern,
regardless of depth, so the lead's eyes on a worker outlive the worker.
The await poll's shed check bounded on 'sheds after this poll started', which
broke the realistic case: when a worker is shed its report poll dies too, the
lead re-runs the poll, and that re-run -- started after the shed -- would ignore
the very shed it should report. Use the revival hook's own pending-shed notion
instead: a worker shed not yet followed by a notice_delivered (revival) marker.
Works whether the poll survives the shed (now the common case, since the poll is
protected) or is re-run after it.
A lead following the skill easily omits --name, which silently disabled the
shed-ledger watch (the poll then just waits out its timeout on a paused worker).
Default the worker name to the task file's directory name -- every flow stages
the task at runtime/<flow>/<NAME>/task.md where <NAME> is the worker's mngr agent
name -- so the pause detection works without the flag. --name still overrides; a
wrong-derived name simply never matches a ledger record, so it cannot false-fire.
Agents whose subprocess is shed mid-session only see exit 137 -- there is no
push notification for that case (the revival notice only covers an agent's own
restart). Observed workers diagnosed it by reading the shed ledger directly and
never reached for the dealing-with-the-unexpected skill, so put the guidance in
the always-loaded CLAUDE.md: suspect the watchdog on an unexplained exit 137,
confirm via the ledger, and find a lower-memory approach rather than blindly
re-running.
Problem: the watchdog module docstring, the supervisord program comment, and
the README all still said the watchdog 'sheds whole tiers', but this branch
changed shedding to be per-process (largest-first within each shed-ordered
tier, stopping at the relief threshold, with a minimum-RSS floor). The wording
contradicted the actual behavior and the module's own internal comments.
Fix: reword all three to describe per-process shedding while keeping the tier
ordering guarantee explicit.
… recovery docs

Problem: dead-worker-recovery.md and OOM_DRILL.md described reviving a
memory-shed worker with a plain 'mngr start <worker>' / 'mngr message', but this
branch's code (create_worker.py await notice, lead-proxy.md) states a shed agent
will only relaunch with 'mngr start <worker> --restart'. The shed-revival
guidance therefore gave a command that does not work.
Fix: use '--restart' in the shed-pressure revival paths of both docs and note
that a plain start/message will not relaunch a shed agent. The generic non-shed
restart path (a crashed-but-not-shed worker) is left on plain 'mngr start',
which is correct there.
Problem: libs/memory_watchdog/README.md claimed the SessionStart notice
hook (scripts/claude_shed_notice_hook.py) duplicates the on-disk path
layout because it cannot import the package. The hook actually imports
the shared, dependency-free memory_watchdog.paths helper (via sys.path)
precisely to avoid duplicating the layout -- as the hook's own docstring
states -- so the README contradicted the implementation.
Fix: rewrite the Paths section to name memory_watchdog.paths as the
single dependency-free source of truth (re-exported through ledger,
imported by the system interface) and describe the hook as importing
that same helper rather than duplicating the layout.
Problem: _read_memory_status in apps/system_interface server.py documents that
a missing/unreadable/future-schema watchdog status file leaves the banner
hidden rather than erroring, but its try/except only caught OSError and
JSONDecodeError. A status file that parses to a non-dict (e.g. 'null') or
carries a non-numeric value where a number is expected raised
AttributeError/TypeError/ValueError out of the projection, surfacing as an
HTTP 500 -- contradicting the documented invariant.
Fix: guard the JSON parse and the projection into the response model together,
broadening the caught exceptions to also include AttributeError, TypeError, and
ValueError so any malformed status content falls back to the healthy (no-banner)
response, still logged once (not silent). Added a regression test for a
top-level non-dict status file.
Problem: README step 3 listed the shed order as 'subprocesses, then auxiliary
services, then worker agents', but the implementation sheds worker agents
(tier 7) before auxiliary services (tier 6) -- per SHEDDABLE_TIERS_IN_SHED_ORDER
and the tier ranks. The README's own tier table and shedder.py already state
the correct order; only this prose line was stale.
Fix: swapped the two clauses so the prose matches the implemented order
(subprocesses, then worker agents, then auxiliary services, then user agents).
Problem: the final else branch in the pane-classification pass was commented
'A prefixed session we cannot interpret', but that branch runs only when no
agent name could be resolved -- which happens for NON-prefixed sessions, since
every prefixed session resolves to a non-None agent name and takes the prior
branch. The comment described the opposite condition.
Fix: reword the comment to describe the actual case (a non-services session
whose name lacks the agent prefix, protected like a user agent).
Problem: the "was the worker shed for memory pressure?" block in
.agents/skills/launch-task/references/dead-worker-recovery.md grepped/cat'd the
shed ledger and status file via repo-root-relative paths
(runtime/memory_watchdog/...). This reference is executed by a lead agent whose
cwd is its own worktree (/mngr/worktree/<lead>-<hash>/), not /mngr/code, so the
relative paths resolved to a nonexistent worktree-local location: the grep
silently matched nothing (falsely implying the worker was never shed, prompting a
revive straight back into memory pressure) and the cat failed. The shared ledger
lives at /mngr/code/runtime/memory_watchdog/... -- pinned via
MEMORY_WATCHDOG_RUNTIME_DIR for the Python readers, but that env var does not
affect a shell grep/cat. CLAUDE.md already documents this ledger with the
absolute path for the same reason.

Fix: use the absolute /mngr/code/runtime/memory_watchdog/... paths in that block,
matching CLAUDE.md, with a short note on why relative would miss the file.
Problem: The watchdog README's tier table listed RECOVERY (tier 3) as
containing 'bootstrap, this watchdog' and the Tier enum comment described
it as 'the service manager and this watchdog'. The classifier never
assigns either to RECOVERY: the bootstrap pane shell and supervisord (the
service manager) are classified as INFRASTRUCTURE (tier 1), as asserted by
classifier_test, and the only process mapped to RECOVERY is the watchdog
itself. The stale wording predates the supervisord migration.
Fix: List only the watchdog under RECOVERY in the README table, and note
in the enum comment that supervisord and the bootstrap launcher are tier-1
infrastructure, so the docs match what the classifier actually does.
@gnguralnick gnguralnick marked this pull request as ready for review June 25, 2026 23:11
Comment thread .agents/shared/references/lead-proxy.md
Comment thread .agents/skills/launch-task/scripts/create_worker.py
Comment thread libs/memory_watchdog/src/memory_watchdog/classifier.py
Comment thread libs/memory_watchdog/src/memory_watchdog/classifier.py
Comment on lines +8 to +10
1. Snapshots the process tree (`/proc`), the tmux panes, and the host's agent
labels, then classifies every process into one of eight OOM-priority tiers
(see `data_types.Tier`).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fundamentally seems like a weird way to make this -- can we not have the processes launched with the right priority in the first place?

having an extra service to watch and poll runs the risk of something suddenly allocating tons of memory and either getting killed when it shouldn't or causing something else to get killed when it shouldn't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants