Skip to content

Commit cd31015

Browse files
committed
Document MCP babysitting workflow
1 parent 825f8a2 commit cd31015

1 file changed

Lines changed: 30 additions & 1 deletion

File tree

.agents/skills/babysit-job/SKILL.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,17 +70,46 @@ If any required field is missing, ask for it before proceeding.
7070
- Sleep must be foreground (max ~10 min due to tool timeout).
7171
- Loop control is at agent level, not bash.
7272

73+
## MCP-Assisted Monitoring
74+
75+
When testing or using `marin-mcp-babysitter`, keep the MCP server resident and
76+
verify the job through MCP tools, not only through Iris CLI commands.
77+
78+
- Keep the controller tunnel and MCP server in named, restartable sessions
79+
(`screen`, `tmux`, or one long-running exec session). Record session names,
80+
ports, and log paths in the state file.
81+
- Start MCP with a stable local controller URL and streamable HTTP transport:
82+
`uv run --package marin marin-mcp-babysitter --controller-url <URL> --cluster <CLUSTER> --transport streamable-http --host 127.0.0.1 --port <PORT>`
83+
- Verify with `iris_job_summary` and `iris_tail_logs`. For heartbeat-style
84+
monitoring, report: job state, latest progress/tick/log line, timestamp, and
85+
error signal.
86+
- If the MCP server is reachable but tool calls fail with connection refused to
87+
the controller URL, restart only the smoke-test tunnel/session. Do not restart
88+
or mutate the Iris cluster.
89+
- If a sandbox blocks direct localhost TCP probes, run the probe inside an
90+
existing long-lived session and write a small JSON result under `scratch/`.
91+
- For bounded smoke tests, create a thread heartbeat only after the job is
92+
submitted, MCP is reachable, and at least one expected log/progress line has
93+
appeared. Delete the heartbeat and stop the smoke-test sessions/listeners when
94+
the job reaches the expected terminal state.
95+
7396
## State File
7497

7598
Write to `scratch/<create_timestamp>_monitoring_state.json`, create the `scratch`
7699
directory if needed. `<create_timestamp>` should have format `YYYYMMDD-HHMM`.
77-
Track `restart_count` to detect flapping. State file allows resume after context reset.
100+
Track `restart_count` to detect flapping. Add MCP fields when a resident MCP
101+
server is part of the monitoring setup. State file allows resume after context reset.
78102

79103
```json
80104
{
81105
"ts": <timestamp_ms>,
82106
"job_id": "<JOB_ID>",
83107
"config": "<IRIS_CONFIG_PATH>",
108+
"mcp_url": "http://127.0.0.1:<PORT>/mcp",
109+
"tunnel_session": "<SESSION_NAME>",
110+
"server_session": "<SESSION_NAME>",
111+
"tunnel_log": "scratch/<TUNNEL_LOG>",
112+
"server_log": "scratch/<SERVER_LOG>",
84113
"resubmit_command": "<IRIS_JOB_RUN_COMMAND_WITH_NO_WAIT>",
85114
"restart_count": 0
86115
}

0 commit comments

Comments
 (0)