@@ -70,17 +70,46 @@ If any required field is missing, ask for it before proceeding.
7070- Sleep must be foreground (max ~ 10 min due to tool timeout).
7171- Loop control is at agent level, not bash.
7272
73+ ## MCP-Assisted Monitoring
74+
75+ When testing or using ` marin-mcp-babysitter ` , keep the MCP server resident and
76+ verify the job through MCP tools, not only through Iris CLI commands.
77+
78+ - Keep the controller tunnel and MCP server in named, restartable sessions
79+ (` screen ` , ` tmux ` , or one long-running exec session). Record session names,
80+ ports, and log paths in the state file.
81+ - Start MCP with a stable local controller URL and streamable HTTP transport:
82+ ` uv run --package marin marin-mcp-babysitter --controller-url <URL> --cluster <CLUSTER> --transport streamable-http --host 127.0.0.1 --port <PORT> `
83+ - Verify with ` iris_job_summary ` and ` iris_tail_logs ` . For heartbeat-style
84+ monitoring, report: job state, latest progress/tick/log line, timestamp, and
85+ error signal.
86+ - If the MCP server is reachable but tool calls fail with connection refused to
87+ the controller URL, restart only the smoke-test tunnel/session. Do not restart
88+ or mutate the Iris cluster.
89+ - If a sandbox blocks direct localhost TCP probes, run the probe inside an
90+ existing long-lived session and write a small JSON result under ` scratch/ ` .
91+ - For bounded smoke tests, create a thread heartbeat only after the job is
92+ submitted, MCP is reachable, and at least one expected log/progress line has
93+ appeared. Delete the heartbeat and stop the smoke-test sessions/listeners when
94+ the job reaches the expected terminal state.
95+
7396## State File
7497
7598Write to ` scratch/<create_timestamp>_monitoring_state.json ` , create the ` scratch `
7699directory if needed. ` <create_timestamp> ` should have format ` YYYYMMDD-HHMM ` .
77- Track ` restart_count ` to detect flapping. State file allows resume after context reset.
100+ Track ` restart_count ` to detect flapping. Add MCP fields when a resident MCP
101+ server is part of the monitoring setup. State file allows resume after context reset.
78102
79103``` json
80104{
81105 "ts" : <timestamp_ms>,
82106 "job_id" : " <JOB_ID>" ,
83107 "config" : " <IRIS_CONFIG_PATH>" ,
108+ "mcp_url" : " http://127.0.0.1:<PORT>/mcp" ,
109+ "tunnel_session" : " <SESSION_NAME>" ,
110+ "server_session" : " <SESSION_NAME>" ,
111+ "tunnel_log" : " scratch/<TUNNEL_LOG>" ,
112+ "server_log" : " scratch/<SERVER_LOG>" ,
84113 "resubmit_command" : " <IRIS_JOB_RUN_COMMAND_WITH_NO_WAIT>" ,
85114 "restart_count" : 0
86115}
0 commit comments