Skip to content

Commit 463229a

Browse files
ChenhanYuclaude
andauthored
docs(tools/mcp): scope policy — environment tooling, not workflow policy (#1712)
## Summary Adds `tools/mcp/SCOPE.md` documenting the design boundary for the MCP server family (`modelopt-mcp` + `nmm-sandbox-mcp` + `pensieve-intern-mcp`): - **In scope:** universal verb-shaped operations on the cluster, launcher, agent engine — environment tooling that every workflow can rely on as pre-knowledge. - **Out of scope:** workflow-specific logic ("run an EAGLE3 cell", "publish a specdec release"). That belongs in SPEC text + agent reasoning, composed out of these primitives. ## Why Surfaced during the OMNIML-5123 follow-up discussion. The 14 tools currently in scope across the three servers are deliberately a small, closed set. Letting workflow-specific tools sneak in would sprawl the catalog, force per-workflow opt-in, and break the "MCP-as-pre-knowledge" promise that makes the catalog useful as a stable baseline for pensieve-intern's agents. SCOPE.md documents: - The test: would this tool be useful across *any* workflow that uses the same environment? - A side-by-side table of in-scope vs out-of-scope tool shapes - Symptoms that a tool is misclassified - Why the line matters (interface-drift blast radius collapses, agent tool catalog stays learnable) - Four practical questions to ask before adding a tool - A reference list of the 14 currently-in-scope tools ## Anchor [OMNIML-5123](https://jirasw.nvidia.com/browse/OMNIML-5123) (Epic). Docs-only, no code changes. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added scope documentation clarifying that MCP environment tooling covers universal verb-shaped operations (job submission, verification, artifact reading) while workflow-specific policies belong elsewhere. * Documented inclusion criteria and tool inventory guidance for MCP servers. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 9f37fe1 commit 463229a

2 files changed

Lines changed: 106 additions & 0 deletions

File tree

tools/mcp/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,10 @@ Three constants drive the surface here:
132132
2. **Filesystem is the source of truth.** Status + logs both read from nemo_run's experiment dir. No in-memory registry — survives MCP server restarts. The bridge module never holds per-job state across calls.
133133
3. **`verify_setup` is auto-called by `submit_job` by default.** The probe takes ~1 second; the cost of a misconfigured submission is 30+ seconds of cluster timeout or container-pull. Always-on verification pays back immediately. Callers can pass `skip_verify=True` when they just probed.
134134

135+
## Scope: environment tooling, not workflow policy
136+
137+
See [SCOPE.md](SCOPE.md) for the policy that gates what belongs in this MCP family. Short version: tools here are universal verb-shaped operations on the cluster / launcher / engine (`submit_job`, `verify_setup`, `read_cluster_artifact`, …). Workflow-specific logic ("run an EAGLE3 cell", "publish a specdec release") stays in SPEC text + agent reasoning, composed out of these primitives. The policy applies to `nmm-sandbox-mcp` and `pensieve-intern-mcp` too.
138+
135139
## Internal companion (NVIDIA only)
136140

137141
For NVIDIA-internal users running on the in-house clusters, there's a companion server [`nmm-sandbox-mcp`](https://gitlab-master.nvidia.com/omniml/integration/nmm-sandbox/-/tree/main/tools/mcp) that adds:

tools/mcp/SCOPE.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# MCP scope policy — environment tooling, not workflow policy
2+
3+
This applies to `modelopt-mcp` and its sibling servers
4+
(`nmm-sandbox-mcp`, `pensieve-intern-mcp`).
5+
6+
## The principle
7+
8+
These MCP servers host **environment tooling** — operations on the
9+
cluster, launcher, or agent engine that are *universal across
10+
workflows*. They do **not** host workflow-specific logic.
11+
12+
Workflow-specific logic — "run an EAGLE3 training cell", "publish a
13+
specdec release" — lives in **SPEC text + agent reasoning**, composed
14+
out of the environment primitives below.
15+
16+
## The test
17+
18+
A tool belongs in this MCP family if and only if it would be useful
19+
on its own across *any* workflow that uses the same environment.
20+
21+
| ✅ Environment tooling | ❌ Workflow policy |
22+
|---|---|
23+
| `submit_job(yaml_path, ...)` | `bench_eagle3_against_baseline()` |
24+
| `resolve_cluster_factory(name)` | `create_specdec_release_pr()` |
25+
| `verify_setup(executor, ...)` | `run_qwen3_quantize_sweep()` |
26+
| `open_draft_pr(target_repo, ...)` | `dispatch_intern_epic(workflow, ...)` |
27+
| `read_cluster_artifact(experiment_id, path)` | `auto_skip_lts_failure()` |
28+
29+
## Symptoms a tool is misclassified
30+
31+
- Tool name contains a workload identifier (eagle3, specdec, a model
32+
family, a specific algorithm)
33+
- Tool's job is "do these N steps in this order" rather than one
34+
well-defined operation
35+
- Changes to a workflow's policy (a new model, a new sweep dimension,
36+
a renamed stage) require changing the tool's code
37+
- Two unrelated workflows would *not* naturally compose the tool
38+
39+
If any of these is true, the abstraction belongs in the **composition
40+
layer** (SPEC text + the agent's reasoning), not the **primitive
41+
layer** (MCP).
42+
43+
## Why the line matters
44+
45+
Today's three MCPs deliberately host a small, closed set of
46+
verb-shaped operations on the cluster, launcher, and engine.
47+
That choice:
48+
49+
- Keeps the agent's tool catalog learnable → less hallucination,
50+
shorter reasoning chains
51+
- Makes the MCPs **pre-knowledge** every workflow can rely on
52+
without per-SPEC opt-in (the SPEC stops carrying CLI invocation
53+
details; the tool description IS the documentation; the schema
54+
IS the contract)
55+
- Collapses interface-drift blast radius: a launcher refactor →
56+
the MCP's bridge layer absorbs the change → SPECs are unaffected.
57+
(Compare to today's failure mode, where renaming `launch_train.sh
58+
--model``--config` silently broke every SPEC that hardcoded
59+
the flag form.)
60+
61+
If we cross the line and add workflow-specific MCP tools, the
62+
catalog sprawls and the value collapses. Each new workflow wants
63+
its own tools; old SPECs reference tools they no longer need;
64+
the model spends its budget on tool discovery instead of reasoning;
65+
the "pre-knowledge" promise breaks because new tools require
66+
per-workflow opt-in.
67+
68+
## Practical guidance for adding new tools
69+
70+
Before adding a tool, ask:
71+
72+
1. **Would this tool exist whether or not workflow X existed?**
73+
If no, it's workflow policy. Compose it in a SPEC instead.
74+
2. **Does the tool's signature contain any workflow-specific knobs?**
75+
If yes, those knobs *are* the workflow policy. Refactor to a
76+
primitive that takes generic args.
77+
3. **Would two unrelated workflows naturally compose this tool?**
78+
If yes, it's environment tooling. Add it.
79+
4. **Is the tool's output the same shape across all callers?**
80+
If no, the tool is doing workflow-specific shaping in disguise.
81+
82+
When a piece of work feels like it wants an MCP tool but fails
83+
these tests, the right move is usually to add a *helper module*
84+
(Python in `modules/Model-Optimizer/tools/...`) and let SPECs
85+
invoke it via shell — or, better, to refactor the work so it
86+
composes the existing environment tools.
87+
88+
## Current scope (reference)
89+
90+
The 14 tools currently in scope, by server:
91+
92+
- **modelopt-mcp** (9): `list_examples`, `verify_setup`, `submit_job`,
93+
`job_status`, `job_logs`, `wait_for_experiment`,
94+
`provision_passwordless_ssh_dry_run`, `read_cluster_artifact`,
95+
`open_draft_pr`
96+
- **nmm-sandbox-mcp** (3): `list_internal_clusters`,
97+
`resolve_cluster_factory`, `submit_via_gitlab_ci`
98+
- **pensieve-intern-mcp** (2): `clear_labels`, `report_verified`
99+
100+
Phase 3 plans (OMNIML-5133 NEL integration, OMNIML-5134 checkpoint
101+
introspection) extend this set with more environment primitives;
102+
both pass the test above.

0 commit comments

Comments
 (0)