Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
6e26cd0
Add blueprint plan for OOM prioritization and graceful degradation
Jun 10, 2026
6d019d5
Add OOM prioritization and graceful degradation
Jun 12, 2026
b7286eb
Merge remote-tracking branch 'origin/main' into gabriel/oom-prioritiz…
Jun 12, 2026
a80048b
Single-source the watchdog runtime path across writer and readers
Jun 12, 2026
0843b46
Collapse redundant agent-label-scan wrapper into one public function
Jun 12, 2026
b52044a
Correct ShedRecord.timestamp precision description to microsecond
Jun 12, 2026
60a7923
Reuse shared nanosecond ISO timestamp helper in memory_watchdog
Jun 12, 2026
77f1843
Correct ShedRecord.timestamp precision in its field description
Jun 12, 2026
eed958f
Revert "Correct ShedRecord.timestamp precision in its field description"
Jun 12, 2026
31fe025
Revert "Correct ShedRecord.timestamp precision description to microse…
Jun 12, 2026
eb2c5d9
Merge remote-tracking branch 'origin/main' into gabriel/oom-prioritiz…
Jun 15, 2026
4837d6d
Merge origin/main into gabriel/oom-prioritization; re-architect OOM w…
Jun 22, 2026
435c5eb
Merge remote-tracking branch 'origin/main' into gabriel/oom-prioritiz…
Jun 23, 2026
5fdfde7
Merge remote-tracking branch 'origin/main' into gabriel/oom-prioritiz…
Jun 23, 2026
1ac0673
Refactor watchdog paths into shared module; resolve services session …
Jun 24, 2026
3bc42ca
Fix watchdog over-shedding and supervisord detection; harden UI activ…
Jun 25, 2026
e217431
Apply prettier formatting to ActivityIndicator
Jun 25, 2026
6cae525
Memory banner: collapse shed details into an expandable count + dropdown
Jun 25, 2026
9c703e7
Apply prettier formatting to MemoryPressureBanner
Jun 25, 2026
742620e
Memory banner: specific subprocess labels attributed to their owning …
Jun 25, 2026
ac9c40c
Memory banner: Process/Creator table + fix expand overflow
Jun 25, 2026
f933656
Apply prettier formatting to MemoryPressureBanner
Jun 25, 2026
98a83e5
Memory banner: drop toggle underline, use a hover chip and keyboard-o…
Jun 25, 2026
ebfe662
Memory watchdog: pin shared ledger dir so worker revival notices fire
Jun 25, 2026
b58f5d1
Memory watchdog: shed individual processes, not whole tiers
Jun 25, 2026
1cb7412
launch-task: surface a memory-pause to the lead instead of a silent s…
Jun 25, 2026
94c0060
Revival notice: tell a paused agent to retry with less memory, not as-is
Jun 25, 2026
a47083b
Memory watchdog: never shed an agent's coordination/observability hel…
Jun 25, 2026
cab6a00
launch-task: detect a worker pause by pending shed, not poll-start time
Jun 25, 2026
bab71eb
launch-task: auto-derive the worker name for await's shed watch
Jun 25, 2026
ac8e59a
Document the memory watchdog and how to react to a shed (exit 137)
Jun 25, 2026
9105c99
memory_watchdog: describe per-process shedding, not whole-tier
Jun 25, 2026
defd996
memory_watchdog: use 'mngr start --restart' to revive a shed agent in…
Jun 25, 2026
58d4f87
Correct README: notice hook imports shared path helper, not duplicate
Jun 25, 2026
c857784
Keep the memory-status endpoint from 500ing on a malformed status file
Jun 25, 2026
b7e2c1a
memory_watchdog README: correct the shed-order prose
Jun 25, 2026
1a35265
memory_watchdog classifier: fix the fallback-branch comment
Jun 25, 2026
e7f0e0d
dead-worker-recovery: read the shed ledger by absolute path
Jun 25, 2026
1ffc361
Correct RECOVERY-tier docs: only the watchdog, not bootstrap
Jun 25, 2026
644f805
Merge remote-tracking branch 'origin/main' into gabriel/oom-prioritiz…
Jun 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion .agents/shared/references/lead-proxy.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ Start a background poll for the report file with `create_worker.py await`. It
reads `finish_report_path` from the task file's frontmatter, blocks until that
file appears, prints its contents, and exits 0; on timeout it exits non-zero
(code 124). Run it with Bash's `run_in_background: true` so it returns the
instant the report lands.
instant the report lands. Pass `--name <WORKER_NAME>` so the poll also watches
the memory watchdog's shed ledger.

`await` is a generic poll-until-file primitive; the gate cycle below is this
flow's *use* of it. Non-interactive callers that launch a tightly-scoped agent
Expand All @@ -20,6 +21,7 @@ and wait for one finish report use the same `await` (or the synchronous
```bash
# Run with Bash run_in_background: true
uv run .agents/skills/launch-task/scripts/create_worker.py await \
--name <WORKER_NAME> \
--task-file <TASK_FILE>
```

Expand All @@ -29,6 +31,13 @@ plus a body. If await exits non-zero (timeout) without printing a report, do
*not* immediately treat it as a terminal failure -- see "Diagnose worker
liveness" below.

If await exits with code 75, the worker's own agent was **paused by the memory
watchdog** to relieve memory pressure: it will not report until revived. This is
not a worker bug -- revive it with `mngr start <WORKER_NAME> --restart` (a plain
`mngr message` or `mngr start` does not relaunch a shed agent), then re-send its
Comment thread
gnguralnick marked this conversation as resolved.
task. On restart the worker is told it was paused, so it can re-check state
before continuing.

## Diagnose worker liveness before invoking failure flow

If the timeout trips without a report appearing, the worker may still be
Expand Down
6 changes: 5 additions & 1 deletion .agents/skills/launch-task/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,11 +100,15 @@ Poll with `create_worker.py await` as a background task
(`run_in_background: true`) and continue with whatever else you were doing. It
reads `finish_report_path` from the task file
(`runtime/launch-task/$NAME/reports/report.md`), blocks until the worker pushes
back, then prints the report.
back, then prints the report. Pass `--name $NAME` so the poll also watches the
memory watchdog's shed ledger: if the worker's own agent is paused for memory
pressure (so it will never report until revived), the poll surfaces that
promptly and actionably instead of waiting out the full timeout.

```bash
# Run with Bash run_in_background: true
uv run .agents/skills/launch-task/scripts/create_worker.py await \
--name $NAME \
--task-file runtime/launch-task/$NAME/task.md
```

Expand Down
22 changes: 22 additions & 0 deletions .agents/skills/launch-task/references/dead-worker-recovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,28 @@

When a worker (sub-agent created via `launch-task`) is in `STOPPED` state -- claude session died mid-iteration, but the worktree (and any uncommitted work in it) is still intact -- the default path is to restart it, not to manually salvage. `mngr start` only re-creates the tmux session and re-execs claude in the existing worktree; it does not touch git state, so uncommitted changes survive the restart.

## First: was the worker shed for memory pressure?

A worker can die because the **memory watchdog** shed it -- the container was running out of memory and the watchdog killed the most-expendable work first. Check the shed ledger before reviving, because reviving into ongoing memory pressure just gets the worker killed again:

```bash
# Did the watchdog shed this worker? (look for your worker's name)
# Absolute path: the ledger is shared at /mngr/code/runtime/, but your cwd is
# your own worktree, so a relative `runtime/...` would miss it.
grep '"agent_name": *"<worker>"' /mngr/code/runtime/memory_watchdog/events/shed/events.jsonl

# Is the container still under pressure right now?
cat /mngr/code/runtime/memory_watchdog/status.json # is_under_pressure, used_fraction
```

Revival guidelines when a worker was shed:

- **If pressure is still elevated** (`is_under_pressure` is true, or `used_fraction` is near the threshold): do NOT revive. Surface the situation to the user and let them decide -- reviving now will likely just be shed again and deepen the crunch.
- **If pressure has cleared**: revive at most once with `mngr start <worker> --restart`, then re-establish your report poll. A shed agent needs `--restart` -- a plain `mngr start` or `mngr message` will not relaunch it. On restart it is told it was paused, so it can re-check state before continuing; re-send its task.
- **If the same worker has already been shed twice** (two `process_shed` lines naming it): stop. Do not keep reviving. Surface to the user with the ledger details -- something about this worker's footprint is incompatible with the current memory budget.

If the worker was *not* in the ledger, it died for some other reason (e.g. a claude crash); proceed with the normal restart path below, where a plain `mngr start` suffices.

## Default: restart the worker and resume

1. Bring claude back up in the existing worktree:
Expand Down
124 changes: 124 additions & 0 deletions .agents/skills/launch-task/scripts/create_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,11 @@
# matching coreutils ``timeout``'s convention so the prose's mental model
# carries over.
_AWAIT_TIMEOUT_RC = 124
# Distinct exit code for an await that stopped early because the worker's own
# agent was shed by the memory watchdog (so it will never report until revived).
# Separate from the timeout code so the lead can tell "paused for memory" apart
# from "still running, just slow".
_AWAIT_SHED_RC = 75


def _normalize_dir(value: str) -> str:
Expand Down Expand Up @@ -224,6 +229,20 @@ def _read_finish_report_path(task_file: Path) -> Path:
return Path(value)


def _worker_name_from_task_file(task_file: Path) -> str:
"""Best-effort worker (mngr agent) name, from the task file's directory.

Every flow stages the task at ``runtime/<flow>/<NAME>/task.md`` where
``<NAME>`` is the worker's mngr agent name (launch passes the same ``<NAME>``
as both the directory and ``mngr create <NAME>``). So the parent directory
name is the worker name -- which is what lets ``await`` watch the shed ledger
for this worker even when the caller did not pass ``--name`` explicitly. If
the derived name is wrong it simply never matches a ledger record (no false
positive), so this is safe as a default.
"""
return task_file.resolve().parent.name


class Runner:
"""Indirection over ``subprocess.run`` so tests can intercept commands.

Expand Down Expand Up @@ -357,6 +376,11 @@ def launch(
template,
"--label",
f"workspace={workspace}",
# Marks this as an agent-created (worker) agent so the memory
# watchdog classifies it at tier 7 -- shed before user-created
# agents (tier 5) under memory pressure.
"--label",
"agent_created=true",
],
check=True,
)
Expand All @@ -382,13 +406,70 @@ def launch(
return 0


def _resolve_shed_ledger_path() -> Path | None:
"""Locate the memory watchdog's shed ledger via the watchdog's own path
module, so await resolves the exact file the watchdog writes and the revival
hook reads -- no second copy of the layout to drift. Returns None if the
watchdog package can't be imported, in which case the shed check is skipped
(await falls back to plain timeout behaviour).
"""
try:
watchdog_src = (
Path(__file__).resolve().parents[4] / "libs" / "memory_watchdog" / "src"
)
if watchdog_src.is_dir() and str(watchdog_src) not in sys.path:
sys.path.insert(0, str(watchdog_src))
from memory_watchdog.paths import shed_ledger_path

return shed_ledger_path()
except ImportError:
return None


def _worker_has_pending_shed(ledger_path: Path, worker_name: str) -> bool:
"""Whether the worker's own agent is currently shed and not yet revived.

Uses the same "pending" notion as the revival hook: a ``process_shed`` record
whose ``agent_name`` is the worker (the watchdog stamps ``agent_name`` only
when it sheds an agent's main process -- tier 5/7 -- not a mere subprocess),
newer than the latest ``notice_delivered`` marker for that worker (which the
revival hook writes when the worker restarts). So a shed that has already been
followed by a revival does not count, while a shed not yet revived does --
regardless of whether this await was started before or after the shed (the
realistic case is a re-run poll started *after* the worker died).
"""
try:
text = ledger_path.read_text(encoding="utf-8")
except OSError:
return False
last_delivered = ""
shed_timestamps: list[str] = []
for line in text.splitlines():
if not line.strip():
continue
try:
record = json.loads(line)
except json.JSONDecodeError:
continue
if record.get("agent_name") != worker_name:
continue
record_type = record.get("type")
if record_type == "notice_delivered":
last_delivered = max(last_delivered, str(record.get("up_to_timestamp", "")))
elif record_type == "process_shed":
shed_timestamps.append(str(record.get("timestamp", "")))
return any(timestamp > last_delivered for timestamp in shed_timestamps)


def await_report(
report_path: Path,
timeout_seconds: float,
poll_interval_seconds: float,
sleeper: Callable[[float], None] = time.sleep,
clock: Callable[[], float] = time.monotonic,
out: TextIO | None = None,
worker_name: str | None = None,
shed_ledger_path: Path | None = None,
) -> int:
"""Block until ``report_path`` exists, then print its contents.

Expand All @@ -397,6 +478,15 @@ def await_report(
on stderr so the caller diagnoses worker liveness per lead-proxy.md rather
than treating the timeout as a terminal failure.

If ``worker_name`` and ``shed_ledger_path`` are supplied, each poll also
checks whether the worker's own agent was shed by the memory watchdog. A shed
worker will never report until it is revived, so rather than wait out the full
timeout we surface an actionable message and return ``_AWAIT_SHED_RC`` -- this
is what turns the lead's silent "poll died / timed out" into "your worker was
paused for memory; revive it". The report file is still checked first each
loop, so a report that landed before the shed (or a worker revived and
reporting) still wins.

``sleeper``/``clock`` are injected so tests can drive the poll loop without
real time. The file is checked before the first sleep, so a report already
present returns immediately.
Expand All @@ -407,6 +497,23 @@ def await_report(
if report_path.is_file():
stream.write(report_path.read_text(encoding="utf-8"))
return 0
if (
worker_name is not None
and shed_ledger_path is not None
and _worker_has_pending_shed(shed_ledger_path, worker_name)
):
print(
f"create_worker: worker '{worker_name}' was stopped by the memory "
"watchdog to relieve memory pressure -- its agent process was shed "
"and its background tasks (including its own report poll) were "
Comment thread
gnguralnick marked this conversation as resolved.
"cancelled, so it will NOT report until it is revived. Revive it "
f"with: mngr start {worker_name} --restart (a plain `mngr message` "
"or `mngr start` will not relaunch a shed agent), then re-send its "
"task. On restart it is told it was paused, so it can re-check "
"state before continuing.",
file=sys.stderr,
)
return _AWAIT_SHED_RC
if clock() >= deadline:
print(
f"create_worker: timed out after {timeout_seconds:g}s waiting for "
Expand Down Expand Up @@ -601,10 +708,20 @@ def _run_await(args: argparse.Namespace) -> int:
# file; let the ValueError raise for a full traceback rather than swallowing
# it into a terse exit-2 message (matches ``launch``'s handling above).
report_path = _read_finish_report_path(args.task_file)
# Watch the watchdog's shed ledger so a worker paused for memory pressure
# surfaces promptly (and actionably) instead of as a silent 30-minute
# timeout. The worker name defaults to the task file's directory name (the
# runtime/<flow>/<NAME>/ convention) so this works even when the caller did
# not pass --name explicitly -- which a lead following the skill easily
# forgets. --name overrides when given.
worker_name = args.name or _worker_name_from_task_file(args.task_file)
shed_ledger = _resolve_shed_ledger_path()
return await_report(
report_path=report_path,
timeout_seconds=args.timeout,
poll_interval_seconds=args.poll_interval,
worker_name=worker_name,
shed_ledger_path=shed_ledger,
)


Expand Down Expand Up @@ -677,6 +794,13 @@ def main(argv: Sequence[str] | None = None, runner: Runner | None = None) -> int
help="Same task file as launch; its frontmatter `finish_report_path` "
"names the file to wait for.",
)
await_parser.add_argument(
"--name",
default=None,
help="Worker name. When given, await also watches the memory watchdog's "
"shed ledger so a worker paused for memory pressure is surfaced promptly "
"(and actionably) instead of as a silent timeout.",
)
await_parser.add_argument(
"--timeout",
default=_DEFAULT_TIMEOUT,
Expand Down
Loading