imbue-ai · gnguralnick · Jun 10, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/.agents/shared/references/lead-proxy.md b/.agents/shared/references/lead-proxy.md
@@ -10,7 +10,8 @@ Start a background poll for the report file with `create_worker.py await`. It
 reads `finish_report_path` from the task file's frontmatter, blocks until that
 file appears, prints its contents, and exits 0; on timeout it exits non-zero
 (code 124). Run it with Bash's `run_in_background: true` so it returns the
-instant the report lands.
+instant the report lands. Pass `--name <WORKER_NAME>` so the poll also watches
+the memory watchdog's shed ledger.
 
 `await` is a generic poll-until-file primitive; the gate cycle below is this
 flow's *use* of it. Non-interactive callers that launch a tightly-scoped agent
@@ -20,6 +21,7 @@ and wait for one finish report use the same `await` (or the synchronous
 ```bash
 # Run with Bash run_in_background: true
 uv run .agents/skills/launch-task/scripts/create_worker.py await \
+    --name <WORKER_NAME> \
     --task-file <TASK_FILE>
 ```
 
@@ -29,6 +31,13 @@ plus a body. If await exits non-zero (timeout) without printing a report, do
 *not* immediately treat it as a terminal failure -- see "Diagnose worker
 liveness" below.
 
+If await exits with code 75, the worker's own agent was **paused by the memory
+watchdog** to relieve memory pressure: it will not report until revived. This is
+not a worker bug -- revive it with `mngr start <WORKER_NAME> --restart` (a plain
+`mngr message` or `mngr start` does not relaunch a shed agent), then re-send its
+task. On restart the worker is told it was paused, so it can re-check state
+before continuing.
+
 ## Diagnose worker liveness before invoking failure flow
 
 If the timeout trips without a report appearing, the worker may still be

diff --git a/.agents/skills/launch-task/SKILL.md b/.agents/skills/launch-task/SKILL.md
@@ -100,11 +100,15 @@ Poll with `create_worker.py await` as a background task
 (`run_in_background: true`) and continue with whatever else you were doing. It
 reads `finish_report_path` from the task file
 (`runtime/launch-task/$NAME/reports/report.md`), blocks until the worker pushes
-back, then prints the report.
+back, then prints the report. Pass `--name $NAME` so the poll also watches the
+memory watchdog's shed ledger: if the worker's own agent is paused for memory
+pressure (so it will never report until revived), the poll surfaces that
+promptly and actionably instead of waiting out the full timeout.
 
 ```bash
 # Run with Bash run_in_background: true
 uv run .agents/skills/launch-task/scripts/create_worker.py await \
+    --name $NAME \
     --task-file runtime/launch-task/$NAME/task.md
 ```
 

diff --git a/.agents/skills/launch-task/references/dead-worker-recovery.md b/.agents/skills/launch-task/references/dead-worker-recovery.md
@@ -2,6 +2,28 @@
 
 When a worker (sub-agent created via `launch-task`) is in `STOPPED` state -- claude session died mid-iteration, but the worktree (and any uncommitted work in it) is still intact -- the default path is to restart it, not to manually salvage. `mngr start` only re-creates the tmux session and re-execs claude in the existing worktree; it does not touch git state, so uncommitted changes survive the restart.
 
+## First: was the worker shed for memory pressure?
+
+A worker can die because the **memory watchdog** shed it -- the container was running out of memory and the watchdog killed the most-expendable work first. Check the shed ledger before reviving, because reviving into ongoing memory pressure just gets the worker killed again:
+
+```bash
+# Did the watchdog shed this worker? (look for your worker's name)
+# Absolute path: the ledger is shared at /mngr/code/runtime/, but your cwd is
+# your own worktree, so a relative `runtime/...` would miss it.
+grep '"agent_name": *"<worker>"' /mngr/code/runtime/memory_watchdog/events/shed/events.jsonl
+
+# Is the container still under pressure right now?
+cat /mngr/code/runtime/memory_watchdog/status.json   # is_under_pressure, used_fraction
+```
+
+Revival guidelines when a worker was shed:
+
+- **If pressure is still elevated** (`is_under_pressure` is true, or `used_fraction` is near the threshold): do NOT revive. Surface the situation to the user and let them decide -- reviving now will likely just be shed again and deepen the crunch.
+- **If pressure has cleared**: revive at most once with `mngr start <worker> --restart`, then re-establish your report poll. A shed agent needs `--restart` -- a plain `mngr start` or `mngr message` will not relaunch it. On restart it is told it was paused, so it can re-check state before continuing; re-send its task.
+- **If the same worker has already been shed twice** (two `process_shed` lines naming it): stop. Do not keep reviving. Surface to the user with the ledger details -- something about this worker's footprint is incompatible with the current memory budget.
+
+If the worker was *not* in the ledger, it died for some other reason (e.g. a claude crash); proceed with the normal restart path below, where a plain `mngr start` suffices.
+
 ## Default: restart the worker and resume
 
 1. Bring claude back up in the existing worktree:

diff --git a/.agents/skills/launch-task/scripts/create_worker.py b/.agents/skills/launch-task/scripts/create_worker.py
@@ -114,6 +114,11 @@
 # matching coreutils ``timeout``'s convention so the prose's mental model
 # carries over.
 _AWAIT_TIMEOUT_RC = 124
+# Distinct exit code for an await that stopped early because the worker's own
+# agent was shed by the memory watchdog (so it will never report until revived).
+# Separate from the timeout code so the lead can tell "paused for memory" apart
+# from "still running, just slow".
+_AWAIT_SHED_RC = 75
 
 
 def _normalize_dir(value: str) -> str:
@@ -224,6 +229,20 @@ def _read_finish_report_path(task_file: Path) -> Path:
     return Path(value)
 
 
+def _worker_name_from_task_file(task_file: Path) -> str:
+    """Best-effort worker (mngr agent) name, from the task file's directory.
+
+    Every flow stages the task at ``runtime/<flow>/<NAME>/task.md`` where
+    ``<NAME>`` is the worker's mngr agent name (launch passes the same ``<NAME>``
+    as both the directory and ``mngr create <NAME>``). So the parent directory
+    name is the worker name -- which is what lets ``await`` watch the shed ledger
+    for this worker even when the caller did not pass ``--name`` explicitly. If
+    the derived name is wrong it simply never matches a ledger record (no false
+    positive), so this is safe as a default.
+    """
+    return task_file.resolve().parent.name
+
+
 class Runner:
     """Indirection over ``subprocess.run`` so tests can intercept commands.
 
@@ -357,6 +376,11 @@ def launch(
             template,
             "--label",
             f"workspace={workspace}",
+            # Marks this as an agent-created (worker) agent so the memory
+            # watchdog classifies it at tier 7 -- shed before user-created
+            # agents (tier 5) under memory pressure.
+            "--label",
+            "agent_created=true",
         ],
         check=True,
     )
@@ -382,13 +406,70 @@ def launch(
     return 0
 
 
+def _resolve_shed_ledger_path() -> Path | None:
+    """Locate the memory watchdog's shed ledger via the watchdog's own path
+    module, so await resolves the exact file the watchdog writes and the revival
+    hook reads -- no second copy of the layout to drift. Returns None if the
+    watchdog package can't be imported, in which case the shed check is skipped
+    (await falls back to plain timeout behaviour).
+    """
+    try:
+        watchdog_src = (
+            Path(__file__).resolve().parents[4] / "libs" / "memory_watchdog" / "src"
+        )
+        if watchdog_src.is_dir() and str(watchdog_src) not in sys.path:
+            sys.path.insert(0, str(watchdog_src))
+        from memory_watchdog.paths import shed_ledger_path
+
+        return shed_ledger_path()
+    except ImportError:
+        return None
+
+
+def _worker_has_pending_shed(ledger_path: Path, worker_name: str) -> bool:
+    """Whether the worker's own agent is currently shed and not yet revived.
+
+    Uses the same "pending" notion as the revival hook: a ``process_shed`` record
+    whose ``agent_name`` is the worker (the watchdog stamps ``agent_name`` only
+    when it sheds an agent's main process -- tier 5/7 -- not a mere subprocess),
+    newer than the latest ``notice_delivered`` marker for that worker (which the
+    revival hook writes when the worker restarts). So a shed that has already been
+    followed by a revival does not count, while a shed not yet revived does --
+    regardless of whether this await was started before or after the shed (the
+    realistic case is a re-run poll started *after* the worker died).
+    """
+    try:
+        text = ledger_path.read_text(encoding="utf-8")
+    except OSError:
+        return False
+    last_delivered = ""
+    shed_timestamps: list[str] = []
+    for line in text.splitlines():
+        if not line.strip():
+            continue
+        try:
+            record = json.loads(line)
+        except json.JSONDecodeError:
+            continue
+        if record.get("agent_name") != worker_name:
+            continue
+        record_type = record.get("type")
+        if record_type == "notice_delivered":
+            last_delivered = max(last_delivered, str(record.get("up_to_timestamp", "")))
+        elif record_type == "process_shed":
+            shed_timestamps.append(str(record.get("timestamp", "")))
+    return any(timestamp > last_delivered for timestamp in shed_timestamps)
+
+
 def await_report(
     report_path: Path,
     timeout_seconds: float,
     poll_interval_seconds: float,
     sleeper: Callable[[float], None] = time.sleep,
     clock: Callable[[], float] = time.monotonic,
     out: TextIO | None = None,
+    worker_name: str | None = None,
+    shed_ledger_path: Path | None = None,
 ) -> int:
     """Block until ``report_path`` exists, then print its contents.
 
@@ -397,6 +478,15 @@ def await_report(
     on stderr so the caller diagnoses worker liveness per lead-proxy.md rather
     than treating the timeout as a terminal failure.
 
+    If ``worker_name`` and ``shed_ledger_path`` are supplied, each poll also
+    checks whether the worker's own agent was shed by the memory watchdog. A shed
+    worker will never report until it is revived, so rather than wait out the full
+    timeout we surface an actionable message and return ``_AWAIT_SHED_RC`` -- this
+    is what turns the lead's silent "poll died / timed out" into "your worker was
+    paused for memory; revive it". The report file is still checked first each
+    loop, so a report that landed before the shed (or a worker revived and
+    reporting) still wins.
+
     ``sleeper``/``clock`` are injected so tests can drive the poll loop without
     real time. The file is checked before the first sleep, so a report already
     present returns immediately.
@@ -407,6 +497,23 @@ def await_report(
         if report_path.is_file():
             stream.write(report_path.read_text(encoding="utf-8"))
             return 0
+        if (
+            worker_name is not None
+            and shed_ledger_path is not None
+            and _worker_has_pending_shed(shed_ledger_path, worker_name)
+        ):
+            print(
+                f"create_worker: worker '{worker_name}' was stopped by the memory "
+                "watchdog to relieve memory pressure -- its agent process was shed "
+                "and its background tasks (including its own report poll) were "
+                "cancelled, so it will NOT report until it is revived. Revive it "
+                f"with: mngr start {worker_name} --restart  (a plain `mngr message` "
+                "or `mngr start` will not relaunch a shed agent), then re-send its "
+                "task. On restart it is told it was paused, so it can re-check "
+                "state before continuing.",
+                file=sys.stderr,
+            )
+            return _AWAIT_SHED_RC
         if clock() >= deadline:
             print(
                 f"create_worker: timed out after {timeout_seconds:g}s waiting for "
@@ -601,10 +708,20 @@ def _run_await(args: argparse.Namespace) -> int:
     # file; let the ValueError raise for a full traceback rather than swallowing
     # it into a terse exit-2 message (matches ``launch``'s handling above).
     report_path = _read_finish_report_path(args.task_file)
+    # Watch the watchdog's shed ledger so a worker paused for memory pressure
+    # surfaces promptly (and actionably) instead of as a silent 30-minute
+    # timeout. The worker name defaults to the task file's directory name (the
+    # runtime/<flow>/<NAME>/ convention) so this works even when the caller did
+    # not pass --name explicitly -- which a lead following the skill easily
+    # forgets. --name overrides when given.
+    worker_name = args.name or _worker_name_from_task_file(args.task_file)
+    shed_ledger = _resolve_shed_ledger_path()
     return await_report(
         report_path=report_path,
         timeout_seconds=args.timeout,
         poll_interval_seconds=args.poll_interval,
+        worker_name=worker_name,
+        shed_ledger_path=shed_ledger,
     )
 
 
@@ -677,6 +794,13 @@ def main(argv: Sequence[str] | None = None, runner: Runner | None = None) -> int
         help="Same task file as launch; its frontmatter `finish_report_path` "
         "names the file to wait for.",
     )
+    await_parser.add_argument(
+        "--name",
+        default=None,
+        help="Worker name. When given, await also watches the memory watchdog's "
+        "shed ledger so a worker paused for memory pressure is surfaced promptly "
+        "(and actionably) instead of as a silent timeout.",
+    )
     await_parser.add_argument(
         "--timeout",
         default=_DEFAULT_TIMEOUT,