| name | babysit-job |
|---|---|
| description | Monitor/babysit a job continuously and recover on failure. Use when asked to babysit, monitor, or watch a job, pipeline, workflow, or training run. |
Monitor a job continuously and recover on failure. For Zephyr pipelines, delegate to babysit-zephyr instead. Otherwise, follow this skill — Iris is the execution backend.
job_id— Iris job ID in canonical format/<user>/<job>(e.g.,/dlwh/iris-run-train_tiny_model_tpu-20260302-185630)config— Iris config path (e.g.,lib/iris/examples/marin.yaml). When the user refers to a cluster by shorthand name (e.g., "marin_dev", "marin-dev", "marin", "coreweave"), resolve it to the matching config file underlib/iris/examples/. Common mappings:marin/marin_prod->lib/iris/examples/marin.yamlmarin_dev/marin-dev->lib/iris/examples/marin-dev.yamlcoreweave->lib/iris/examples/coreweave.yaml
resubmit_command— exact Iris submit command for resubmission; must include--no-wait- For Marin TPU training jobs, use
--extra marin:tpu(not--extra marin:cpu) - For TPU jobs, the resubmit command must request TPU resources with
--tpu <variant>.--reserve <variant>only holds capacity; it does not attach TPU devices to the task container.
Example resubmit command:
uv run iris --config lib/iris/examples/marin.yaml job run --no-wait --extra marin:tpu --tpu v5litepod-16 -- python experiments/tutorials/train_tiny_model_tpu.py
If any required field is missing, ask for it before proceeding.
- Recovery is stop then resubmit at the job level.
- Cluster-level actions are out of scope. Do not restart, recreate, or otherwise mutate the cluster unless the user gives explicit consent in the current thread.
- For TPU bad-node errors, escalate to debug-tpu.
- Assign a single monitoring owner when the loop starts.
- Keep this loop running until one of the following:
- the job reaches a terminal state and the user has acknowledged next action
- a user-specified stopping point is reached
- an unrecoverable error is found and reported to the user
- Do not stop early after seeing first loss lines, first eval, or first W&B link.
- Expect monitoring to commonly take 4-5 hours for ferry-scale runs.
- For GPT Codex specifically: unless otherwise directed, do not end your turn just to give a status update; keep monitoring until terminal state or until the user's goal is reached.
- If the user requests continuous monitoring, do not end the turn while monitoring is active; continue until terminal state, a user-specified stopping point, or an unrecoverable error.
- If handoff is needed, transfer ownership explicitly with: current
job_id, latest error/signal, W&B link(s), and resubmission metadata.
- Cadence default after startup stabilization is
sleep 570. - Startup stabilization sequence (after submit/resubmit):
- once the job is submitted, sleep
120and check for immediate failure - if still alive, switch to the normal
570cadence
- once the job is submitted, sleep
- Tool-runtime workaround:
- keep one long-running monitor process/session
- poll the same session in ~30 second chunks as needed by tool runtime limits
- repeated no-output polls are expected while waiting for the next 570-second check
- Single monitor process rule:
- run only one active monitor loop per job to avoid duplicate SSH tunnel and port-binding conflicts
- Sleep must be foreground (max ~10 min due to tool timeout).
- Loop control is at agent level, not bash.
When testing or using marin-mcp-babysitter, keep the MCP server resident and
verify the job through MCP tools, not only through Iris CLI commands.
- Keep the controller tunnel and MCP server in named, restartable sessions
(
screen,tmux, or one long-running exec session). Record session names, ports, and log paths in the state file. - Start MCP with a stable local controller URL and streamable HTTP transport:
uv run --package marin marin-mcp-babysitter --controller-url <URL> --cluster <CLUSTER> --transport streamable-http --host 127.0.0.1 --port <PORT> - Verify with
iris_job_summaryandiris_tail_logs. For heartbeat-style monitoring, report: job state, latest progress/tick/log line, timestamp, and error signal. - If the MCP server is reachable but tool calls fail with connection refused to the controller URL, restart only the smoke-test tunnel/session. Do not restart or mutate the Iris cluster.
- If a sandbox blocks direct localhost TCP probes, run the probe inside an
existing long-lived session and write a small JSON result under
scratch/. - For bounded smoke tests, create a thread heartbeat only after the job is submitted, MCP is reachable, and at least one expected log/progress line has appeared. Delete the heartbeat and stop the smoke-test sessions/listeners when the job reaches the expected terminal state.
Write to scratch/<create_timestamp>_monitoring_state.json, create the scratch
directory if needed. <create_timestamp> should have format YYYYMMDD-HHMM.
Track restart_count to detect flapping. Add MCP fields when a resident MCP
server is part of the monitoring setup. State file allows resume after context reset.
{
"ts": <timestamp_ms>,
"job_id": "<JOB_ID>",
"config": "<IRIS_CONFIG_PATH>",
"mcp_url": "http://127.0.0.1:<PORT>/mcp",
"tunnel_session": "<SESSION_NAME>",
"server_session": "<SESSION_NAME>",
"tunnel_log": "scratch/<TUNNEL_LOG>",
"server_log": "scratch/<SERVER_LOG>",
"resubmit_command": "<IRIS_JOB_RUN_COMMAND_WITH_NO_WAIT>",
"restart_count": 0
}1. SLEEP
- if just submitted/restarted: sleep 120 once
- otherwise: sleep 570
2. CHECK LOGS
uv run iris --config <CONFIG> job logs --since-seconds 900 --include-children <JOB_ID> | rg -i -e "loss|error|traceback|exception|resource_exhausted|oom|compiler_base\.cc:2587|program hbm requirement|largest program allocations|ownerdiederror|dead node|node death|autoscaler unsatisfied resources|no accelerator found|failed_precondition|device or resource busy"
3. CHECK STATUS
uv run iris --config <CONFIG> job list --json --prefix <JOB_ID>
Terminal success: JOB_STATE_SUCCEEDED
Terminal non-success: JOB_STATE_FAILED, JOB_STATE_KILLED, JOB_STATE_WORKER_FAILED, JOB_STATE_UNSCHEDULABLE
Non-terminal: JOB_STATE_PENDING, JOB_STATE_BUILDING, JOB_STATE_RUNNING
If `pending_reason` indicates worker scale-up/capacity wait, treat as scheduler
capacity wait — do not run cluster update/recreate/restart actions. Continue
waiting on cadence, or stop+resubmit only if user explicitly asks.
Treat RUNNING as controller-level signal only; confirm allocation via expected
W&B run when possible.
3a. ON TERMINAL STATE / OOM-LIKE SIGNAL — get a structured per-task summary
(final state, exit, duration, peak memory) instead of grepping logs:
uv run iris --config <CONFIG> job summary --json <JOB_ID>
Fast postmortem: e.g. "13/14 shards peaked near the container memory limit
and failed with exit 137" → cgroup OOM, raise `--memory` on resubmit.
4. PRINT W&B RUN IDS/LINKS (once per training run)
5. REPORT PROGRESS (format: ~<current>/<exact_max>)
6. EVALUATE (terminal? error? stalled? -> recover or continue)
7. RECOVER (STOP -> RESUBMIT)
- If current job is still non-terminal, stop it first:
uv run iris --config <CONFIG> job stop <JOB_ID>
- Then resubmit:
<RESUBMIT_COMMAND>
- Capture `job_id` from output (line like `Job submitted: /<user>/<job>`).
- Iris nuance:
- if `resubmit_command` omits `--job-name`, Iris auto-generates a fresh id each resubmission.
- if `resubmit_command` uses a fixed `--job-name`, Iris may reuse the same id
after terminal completion by replacing the finished job.
- Update state file: `job_id=<NEW_JOB_ID>`, `restart_count += 1`.
- Go to step 1.
When EVALUATE detects an error, before recovery:
- Analyze logs — look for
Traceback,Error,Exception. Identify file and line. - If small fix (typo, missing import, wrong variable name): fix it, then RECOVER.
- If complex (architectural, unclear cause, broad investigation): report to user, exit loop.
Small-fix examples: NameError, ImportError, SyntaxError, obvious KeyError.
Complex examples: OOM, TPU/XLA HBM exhaustion, distributed training failures, data loading issues, unclear multi-file stack traces.
- Treat TPU/XLA HBM reports as failure even without literal OOM:
Program hbm requirement ...Largest program allocations in hbm
- If progress stalls across multiple intervals with
OwnerDiedError, dead node, or unsatisfied resources -> markdegradedand notify user. - If same error repeats after one fix attempt, do not retry blindly; report to user.
- Debug Zephyr pipeline issues -> debug-zephyr-job
- Debug TPU bad-node errors -> debug-tpu
- Debug running tasks with
iris task exec-> debug-iris-job
- Iris
job list --prefixrequires canonical job names (/<user>/<job>), not short names. - Iris monitoring is job-level; cluster updates are not part of normal recovery.