docs(tools/mcp): scope policy — environment tooling, not workflow policy (#1712)

ChenhanYu · claude · web-flow · commit 463229ad34c7 · 2026-06-14T04:15:04.000Z
## Summary Adds `tools/mcp/SCOPE.md` documenting the design boundary for the MCP server family (`modelopt-mcp` + `nmm-sandbox-mcp` + `pensieve-intern-mcp`): - **In scope:** universal verb-shaped operations on the cluster, launcher, agent engine — environment tooling that every workflow can rely on as pre-knowledge. - **Out of scope:** workflow-specific logic ("run an EAGLE3 cell", "publish a specdec release"). That belongs in SPEC text + agent reasoning, composed out of these primitives. ## Why Surfaced during the OMNIML-5123 follow-up discussion. The 14 tools currently in scope across the three servers are deliberately a small, closed set. Letting workflow-specific tools sneak in would sprawl the catalog, force per-workflow opt-in, and break the "MCP-as-pre-knowledge" promise that makes the catalog useful as a stable baseline for pensieve-intern's agents. SCOPE.md documents: - The test: would this tool be useful across *any* workflow that uses the same environment? - A side-by-side table of in-scope vs out-of-scope tool shapes - Symptoms that a tool is misclassified - Why the line matters (interface-drift blast radius collapses, agent tool catalog stays learnable) - Four practical questions to ask before adding a tool - A reference list of the 14 currently-in-scope tools ## Anchor [OMNIML-5123](https://jirasw.nvidia.com/browse/OMNIML-5123) (Epic). Docs-only, no code changes.  ## Summary by CodeRabbit * **Documentation** * Added scope documentation clarifying that MCP environment tooling covers universal verb-shaped operations (job submission, verification, artifact reading) while workflow-specific policies belong elsewhere. * Documented inclusion criteria and tool inventory guidance for MCP servers.  Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
diff --git a/tools/mcp/README.md b/tools/mcp/README.md
@@ -132,6 +132,10 @@ Three constants drive the surface here:
 2. **Filesystem is the source of truth.** Status + logs both read from nemo_run's experiment dir. No in-memory registry — survives MCP server restarts. The bridge module never holds per-job state across calls.
 3. **`verify_setup` is auto-called by `submit_job` by default.** The probe takes ~1 second; the cost of a misconfigured submission is 30+ seconds of cluster timeout or container-pull. Always-on verification pays back immediately. Callers can pass `skip_verify=True` when they just probed.
 
+## Scope: environment tooling, not workflow policy
+
+See [SCOPE.md](SCOPE.md) for the policy that gates what belongs in this MCP family. Short version: tools here are universal verb-shaped operations on the cluster / launcher / engine (`submit_job`, `verify_setup`, `read_cluster_artifact`, …). Workflow-specific logic ("run an EAGLE3 cell", "publish a specdec release") stays in SPEC text + agent reasoning, composed out of these primitives. The policy applies to `nmm-sandbox-mcp` and `pensieve-intern-mcp` too.
+
 ## Internal companion (NVIDIA only)
 
 For NVIDIA-internal users running on the in-house clusters, there's a companion server [`nmm-sandbox-mcp`](https://gitlab-master.nvidia.com/omniml/integration/nmm-sandbox/-/tree/main/tools/mcp) that adds:
diff --git a/tools/mcp/SCOPE.md b/tools/mcp/SCOPE.md
@@ -0,0 +1,102 @@
+# MCP scope policy — environment tooling, not workflow policy
+
+This applies to `modelopt-mcp` and its sibling servers
+(`nmm-sandbox-mcp`, `pensieve-intern-mcp`).
+
+## The principle
+
+These MCP servers host **environment tooling** — operations on the
+cluster, launcher, or agent engine that are *universal across
+workflows*. They do **not** host workflow-specific logic.
+
+Workflow-specific logic — "run an EAGLE3 training cell", "publish a
+specdec release" — lives in **SPEC text + agent reasoning**, composed
+out of the environment primitives below.
+
+## The test
+
+A tool belongs in this MCP family if and only if it would be useful
+on its own across *any* workflow that uses the same environment.
+
+| ✅ Environment tooling | ❌ Workflow policy |
+|---|---|
+| `submit_job(yaml_path, ...)` | `bench_eagle3_against_baseline()` |
+| `resolve_cluster_factory(name)` | `create_specdec_release_pr()` |
+| `verify_setup(executor, ...)` | `run_qwen3_quantize_sweep()` |
+| `open_draft_pr(target_repo, ...)` | `dispatch_intern_epic(workflow, ...)` |
+| `read_cluster_artifact(experiment_id, path)` | `auto_skip_lts_failure()` |
+
+## Symptoms a tool is misclassified
+
+- Tool name contains a workload identifier (eagle3, specdec, a model
+  family, a specific algorithm)
+- Tool's job is "do these N steps in this order" rather than one
+  well-defined operation
+- Changes to a workflow's policy (a new model, a new sweep dimension,
+  a renamed stage) require changing the tool's code
+- Two unrelated workflows would *not* naturally compose the tool
+
+If any of these is true, the abstraction belongs in the **composition
+layer** (SPEC text + the agent's reasoning), not the **primitive
+layer** (MCP).
+
+## Why the line matters
+
+Today's three MCPs deliberately host a small, closed set of
+verb-shaped operations on the cluster, launcher, and engine.
+That choice:
+
+- Keeps the agent's tool catalog learnable → less hallucination,
+  shorter reasoning chains
+- Makes the MCPs **pre-knowledge** every workflow can rely on
+  without per-SPEC opt-in (the SPEC stops carrying CLI invocation
+  details; the tool description IS the documentation; the schema
+  IS the contract)
+- Collapses interface-drift blast radius: a launcher refactor →
+  the MCP's bridge layer absorbs the change → SPECs are unaffected.
+  (Compare to today's failure mode, where renaming `launch_train.sh
+  --model` → `--config` silently broke every SPEC that hardcoded
+  the flag form.)
+
+If we cross the line and add workflow-specific MCP tools, the
+catalog sprawls and the value collapses. Each new workflow wants
+its own tools; old SPECs reference tools they no longer need;
+the model spends its budget on tool discovery instead of reasoning;
+the "pre-knowledge" promise breaks because new tools require
+per-workflow opt-in.
+
+## Practical guidance for adding new tools
+
+Before adding a tool, ask:
+
+1. **Would this tool exist whether or not workflow X existed?**
+   If no, it's workflow policy. Compose it in a SPEC instead.
+2. **Does the tool's signature contain any workflow-specific knobs?**
+   If yes, those knobs *are* the workflow policy. Refactor to a
+   primitive that takes generic args.
+3. **Would two unrelated workflows naturally compose this tool?**
+   If yes, it's environment tooling. Add it.
+4. **Is the tool's output the same shape across all callers?**
+   If no, the tool is doing workflow-specific shaping in disguise.
+
+When a piece of work feels like it wants an MCP tool but fails
+these tests, the right move is usually to add a *helper module*
+(Python in `modules/Model-Optimizer/tools/...`) and let SPECs
+invoke it via shell — or, better, to refactor the work so it
+composes the existing environment tools.
+
+## Current scope (reference)
+
+The 14 tools currently in scope, by server:
+
+- **modelopt-mcp** (9): `list_examples`, `verify_setup`, `submit_job`,
+  `job_status`, `job_logs`, `wait_for_experiment`,
+  `provision_passwordless_ssh_dry_run`, `read_cluster_artifact`,
+  `open_draft_pr`
+- **nmm-sandbox-mcp** (3): `list_internal_clusters`,
+  `resolve_cluster_factory`, `submit_via_gitlab_ci`
+- **pensieve-intern-mcp** (2): `clear_labels`, `report_verified`
+
+Phase 3 plans (OMNIML-5133 NEL integration, OMNIML-5134 checkpoint
+introspection) extend this set with more environment primitives;
+both pass the test above.