Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion .agents/skills/babysit-job/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,17 +70,46 @@ If any required field is missing, ask for it before proceeding.
- Sleep must be foreground (max ~10 min due to tool timeout).
- Loop control is at agent level, not bash.

## MCP-Assisted Monitoring

When testing or using `marin-mcp-babysitter`, keep the MCP server resident and
verify the job through MCP tools, not only through Iris CLI commands.

- Keep the controller tunnel and MCP server in named, restartable sessions
(`screen`, `tmux`, or one long-running exec session). Record session names,
ports, and log paths in the state file.
- Start MCP with a stable local controller URL and streamable HTTP transport:
`uv run --package marin marin-mcp-babysitter --controller-url <URL> --cluster <CLUSTER> --transport streamable-http --host 127.0.0.1 --port <PORT>`
- Verify with `iris_job_summary` and `iris_tail_logs`. For heartbeat-style
monitoring, report: job state, latest progress/tick/log line, timestamp, and
error signal.
- If the MCP server is reachable but tool calls fail with connection refused to
the controller URL, restart only the smoke-test tunnel/session. Do not restart
or mutate the Iris cluster.
- If a sandbox blocks direct localhost TCP probes, run the probe inside an
existing long-lived session and write a small JSON result under `scratch/`.
- For bounded smoke tests, create a thread heartbeat only after the job is
submitted, MCP is reachable, and at least one expected log/progress line has
appeared. Delete the heartbeat and stop the smoke-test sessions/listeners when
the job reaches the expected terminal state.

## State File

Write to `scratch/<create_timestamp>_monitoring_state.json`, create the `scratch`
directory if needed. `<create_timestamp>` should have format `YYYYMMDD-HHMM`.
Track `restart_count` to detect flapping. State file allows resume after context reset.
Track `restart_count` to detect flapping. Add MCP fields when a resident MCP
server is part of the monitoring setup. State file allows resume after context reset.

```json
{
"ts": <timestamp_ms>,
"job_id": "<JOB_ID>",
"config": "<IRIS_CONFIG_PATH>",
"mcp_url": "http://127.0.0.1:<PORT>/mcp",
"tunnel_session": "<SESSION_NAME>",
"server_session": "<SESSION_NAME>",
"tunnel_log": "scratch/<TUNNEL_LOG>",
"server_log": "scratch/<SERVER_LOG>",
"resubmit_command": "<IRIS_JOB_RUN_COMMAND_WITH_NO_WAIT>",
"restart_count": 0
}
Expand Down
5 changes: 5 additions & 0 deletions lib/marin/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ dependencies = [
"fasteners>=0.19",
"flask>=3.1.3",
"marin-fray",
"marin-iris",
"marin-rigging",
"fsspec>=2025.3.0",
"gcsfs",
Expand All @@ -31,6 +32,7 @@ dependencies = [
"lxml-html-clean>=0.4.4",
"markdownify>=0.14.1",
"multiprocess==0.70.16",
"mcp>=1.25.0",
"numpy",
"openai",
"pandas>=2.0",
Expand All @@ -52,6 +54,9 @@ dependencies = [
[project.license]
file = "../../LICENSE"

[project.scripts]
marin-mcp-babysitter = "marin.mcp.babysitter:main"

[dependency-groups]
test = [
"pytest>=8.3.2",
Expand Down
4 changes: 4 additions & 0 deletions lib/marin/src/marin/mcp/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright The Marin Authors
# SPDX-License-Identifier: Apache-2.0

"""MCP servers for Marin operational workflows."""
Loading
Loading