Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion tools/mcp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Mode is determined by which args you pass, not by which tool you call. One tool,
|---|---|
| `list_examples` | Enumerate bundled launcher YAMLs under `tools/launcher/examples/` with model + description metadata extracted from each YAML. Discovery primitive — call this first when you don't know which YAML to launch. |
| `verify_setup(executor, ...)` | Fail-fast probe for the named executor. Docker: `docker info` (daemon up) + `docker info --format` runtime-registry check (looks for `"nvidia"` runtime registered by the NVIDIA Container Toolkit — no image pull, daemon-fast). Slurm: `ssh -o BatchMode=yes -o ConnectTimeout=5` to the cluster login node. Returns structured failure on auth / network / daemon issues — no exception. |
| `submit_job(yaml_path, hf_local? \| cluster_host?, ...)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). |
| `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dryrun --yes` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool, diagnostic?, exit_code, stdout_tail, stderr_tail, argv}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). |
| `job_status(experiment_id)` | Filesystem-based status from nemo_run's experiment dir (`_DONE`, `status_*.out`). Returns `done` / `failed` / `running` plus per-task statuses. No in-memory registry; survives MCP server restarts. |
| `job_logs(experiment_id, task?, tail?)` | Read `log_<task>.out` from the experiment dir. Per-task filtering + optional tail to truncate. |
| `wait_for_experiment(experiment_id, timeout_sec?, poll_interval_sec?)` | Block until `job_status` returns `done` / `failed`, or until `timeout_sec` elapses. Single tool call replaces the agent's `while True: status; sleep` loop — saves tool-call turns and avoids overshooting the poll interval. Returns the final status plus `waited_seconds`. |
Expand Down
182 changes: 181 additions & 1 deletion tools/mcp/modelopt_mcp/bridge.py
Original file line number Diff line number Diff line change
Expand Up @@ -419,19 +419,41 @@ def submit_job_impl(
job_name: str | None,
extra_overrides: dict[str, str] | None,
skip_verify: bool,
dry_run: bool = False,
) -> dict:
"""Submit a launcher YAML.

Mode is determined by mutually-exclusive args:
* ``hf_local`` set → Docker (local GPU)
* ``cluster_host`` set → Slurm (remote SSH)
* Neither set → error
* Neither set → error (unless ``dry_run=True``)
* Both set → error

When ``dry_run=True``, the launcher is invoked with ``--dryrun`` —
the YAML is parsed and validated but no cluster contact / no
container spawn / no sbatch happens. ``hf_local`` and
``cluster_host`` are optional in dry-run mode (pass one to validate
that the YAML's executor-specific config compiles for the intended
target; omit both to validate just the YAML shape). ``verify_setup``
is skipped automatically — there's nothing to talk to.

The actual orchestration is delegated to the launcher's
``core.run_jobs``. We don't re-implement nemo_run integration here —
that lives upstream.
"""
# ---- Dry-run branch (no cluster contact) -----------------------
if dry_run:
return _submit_job_dry_run(
yaml_path=yaml_path,
hf_local=hf_local,
cluster_host=cluster_host,
cluster_user=cluster_user,
identity=identity,
job_dir=job_dir,
job_name=job_name,
extra_overrides=extra_overrides,
)

# ---- Mode resolution -------------------------------------------
if hf_local and cluster_host:
return {
Expand Down Expand Up @@ -681,6 +703,164 @@ def submit_job_impl(
}


def _submit_job_dry_run(
*,
yaml_path: str,
hf_local: str | None,
cluster_host: str | None,
cluster_user: str | None,
identity: str | None,
job_dir: str | None,
job_name: str | None,
extra_overrides: dict[str, str] | None,
) -> dict:
"""Validate a launcher YAML by running ``launch.py --dryrun``.

No cluster contact, no container spawn, no sbatch. Used by
verify-task workflow stages (deployment_support,
hidden_state_dump_support, mlm_eval, ...) that just need to confirm
a YAML compiles before declaring support is ready.

Returns ``{ok, dry_run: True, validated: bool, diagnostic?: str,
exit_code: int|None, stdout_tail: str, stderr_tail: str,
argv: list[str]}``. Never returns ``experiment_id`` or ``pid`` —
there's nothing to track. ``diagnostic`` is present only on the
failure / timeout branches (the validated-success branch omits
it since there's nothing to diagnose).
Comment on lines +724 to +729

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The docstring + server.py:221 Field description both advertise validated: bool and argv: list[str] as if always present in the return shape, but the pre-flight ok=False branches don't include them:

  • yaml_not_found (line 735): no validated, no argv, no exit_code/stdout_tail/stderr_tail
  • launcher_dir_not_found (line 773): same omissions
  • dry_run_timeout (line 813): has argv/exit_code/stdout_tail/stderr_tail but no validated

A defensively-coded caller using .get("validated") won't break, but an agent reading the schema description from the MCP tool definition will reasonably assume validated is always there and may write result["validated"]. Lowest-effort fix: mark them conditional in both doc sites (validated?: bool, argv?: list[str]) and add a one-liner noting that pre-flight failures (ok=False, reason ∈ {yaml_not_found, launcher_dir_not_found}) carry reason + diagnostic only. Higher-effort fix (CodeRabbit's suggestion): populate validated=False, argv=[] etc. on all error branches so the contract is uniform — slightly heavier, but eliminates the question entirely.

Non-blocking; just helpful for downstream agents.

"""
# Same path resolution as the live submit, so dry-run and live use
# exactly the same YAML.
abs_yaml = _normalize_yaml_path(yaml_path)
if not abs_yaml.exists():
return {
"ok": False,
"dry_run": True,
"reason": "yaml_not_found",
"yaml_path": yaml_path,
"resolved_path": str(abs_yaml),
"diagnostic": (
f"YAML not found at {abs_yaml}. Pass a path under "
f"tools/launcher/examples/ (relative), an absolute path, "
f"or one of the examples returned by list_examples."
),
}

# Build argv — launch.py supports --dryrun as a flag that prevents
# actual submission while still exercising the YAML loader, factory
# resolution, and arg parser. Same argv shape as live submit minus
# `--yes` pairs with `--dryrun` in every launcher CLI example (see
# `tools/launcher/CLAUDE.md:28` and `:93`, plus `tools/launcher/docs/
# contributing.md:24`). Without it, nemo_run's `run.cli.entrypoint`
# blocks on its confirmation prompt — and since we're capturing
# stdout (no TTY), the prompt would hang until the 60-second
# timeout fires.
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dryrun", "--yes"]
if hf_local:
argv.append(f"hf_local={hf_local}")
if cluster_user:
argv.append(f"user={cluster_user}")
if identity:
argv.append(f"identity={identity}")
if job_dir:
argv.append(f"job_dir={job_dir}")
if job_name:
argv.append(f"job_name={job_name}")
for k, v in (extra_overrides or {}).items():
argv.append(f"{k}={v}")

launcher_dir = _THIS_DIR.parent.parent / "launcher"
if not launcher_dir.exists():
return {
"ok": False,
"dry_run": True,
"reason": "launcher_dir_not_found",
"diagnostic": (
f"Expected tools/launcher/ at {launcher_dir} but it "
f"doesn't exist. modelopt-mcp must be installed from a "
f"Model-Optimizer clone or via uvx-from-git."
),
}

# Propagate env so the launcher's factory resolution matches what
# the live submit would see (mainly: SLURM_HOST for slurm-factory
# default when cluster_host is set).
child_env = os.environ.copy()
child_env.setdefault("NEMORUN_HOME", os.getcwd())
if cluster_host:
child_env["SLURM_HOST"] = cluster_host

# Dry-run is fast (no network, no container) — 60s timeout is
# generous. Same subprocess invocation shape as the live-submit
# branch above (line 590): list-form argv, no shell, inherited
# env. ``argv`` members are string-literal constants
# ("uv", "run", "launch.py", "--yaml", "--dryrun"), validated
# filesystem paths (``str(abs_yaml)``, ``str(launcher_dir)``), or
# key=value override strings sourced from typed MCP-tool args.
# B603 false-positive matches the precedent in this module's
# `submit_job_impl` (Popen at line 563 + run at line 590), the
# verify probes (line 197 + 251), and the SSH probe (line 326).
try:
proc = subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,
Comment on lines +803 to +810

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the new inline # nosec bypass on subprocess invocation.

Line 788 introduces # nosec B603, which violates the repository security rule against Bandit bypass comments. Please remove the inline bypass and address the finding without suppression.

As per coding guidelines, “Bandit security checks must pass without exceptions. # nosec comments are not allowed as a bypass for security checks.”

🧰 Tools
🪛 ast-grep (0.43.0)

[error] 787-795: Command coming from incoming request
Context: subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,
)
Note: [CWE-20].

(subprocess-from-request)


[error] 787-795: Use of unsanitized data to create processes
Context: subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,
)
Note: [CWE-78].

(os-system-unsanitized-data)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 788 - 795, Remove the inline #
nosec B603 bypass comment from the subprocess.run call in the bridge.py file.
The subprocess invocation is already secure since it passes argv as a list
without shell=True enabled, which prevents shell injection vulnerabilities.
Simply delete the nosec bypass comment and the Bandit security check should pass
without any code changes to the subprocess.run call itself.

Source: Coding guidelines

)
except subprocess.TimeoutExpired as e:
return {
"ok": False,
"dry_run": True,
"reason": "dry_run_timeout",
"exit_code": None,
"stdout_tail": (e.stdout or b"").decode(errors="replace")[-2000:] if e.stdout else "",
"stderr_tail": (e.stderr or b"").decode(errors="replace")[-2000:] if e.stderr else "",
"diagnostic": (
"launch.py --dryrun did not return within 60 seconds. "
"This usually means a YAML import / factory resolution "
"hung."
),
"argv": argv,
}

stdout_tail = str(proc.stdout or "")[-2000:]
stderr_tail = str(proc.stderr or "")[-2000:]

if proc.returncode != 0:
return {
"ok": True, # The tool itself ran cleanly
"dry_run": True,
"validated": False, # ...but the YAML failed validation
"exit_code": proc.returncode,
"stdout_tail": stdout_tail,
"stderr_tail": stderr_tail,
"diagnostic": (
f"launch.py --dryrun rejected the YAML (exit code "
f"{proc.returncode}). Common reasons: invalid YAML "
f"syntax, missing required fields, factory function "
f"not registered, or a referenced file (HF model path, "
f"container tag) doesn't exist. See stderr_tail for the "
f"specific error."
),
"argv": argv,
}

# Success branch returns the same field set as the failure branch
# (plus diagnostic-free since there's nothing to diagnose) so the
# caller can read stderr_tail / exit_code uniformly.
return {
"ok": True,
"dry_run": True,
"validated": True,
"exit_code": 0,
"stdout_tail": stdout_tail,
"stderr_tail": stderr_tail,
"argv": argv,
}
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Comment on lines +853 to +861

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] Return-shape drift between success / failure / timeout branches.

The three terminal returns have inconsistent fields:

field success (L833) failure L815-831 timeout L798-809
stdout_tail
stderr_tail
exit_code
diagnostic

Callers parsing this response (the verify-task workflow stages mentioned in the PR description) have to guard every key with .get() or branch on validated/reason first. The stderr_tail asymmetry is the most likely to bite — a caller that expects stderr_tail for surfacing the error in a workflow step gets a KeyError on the success path.

The bridge docstring at lines 724–726 also describes a diagnostics: str field that none of the branches actually return (singular diagnostic only).

Fix: make all three branches return the same keys (filling missing ones with "" or None):

def _dry_run_response(*, ok, validated, exit_code, stdout_tail, stderr_tail, argv, reason=None, diagnostic=None):
    out = {"ok": ok, "dry_run": True, "validated": validated,
           "exit_code": exit_code, "stdout_tail": stdout_tail,
           "stderr_tail": stderr_tail, "argv": argv}
    if reason is not None:
        out["reason"] = reason
    if diagnostic is not None:
        out["diagnostic"] = diagnostic
    return out

…and update the docstring at L724-726 to match (drop diagnostics, the actual key is diagnostic).



# ---------------------------------------------------------------------------
# job_status / job_logs — filesystem-based
# ---------------------------------------------------------------------------
Expand Down
23 changes: 23 additions & 0 deletions tools/mcp/modelopt_mcp/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,28 @@ def submit_job(
)
),
] = False,
dry_run: Annotated[
bool,
Field(
description=(
"If True, run `launch.py --dryrun --yes` to validate that "
"the YAML parses, the factory resolves, and any "
"referenced files exist — without contacting the "
"cluster, spawning a container, or running sbatch. "
"Used by verify-task workflow stages (deployment_support, "
"hidden_state_dump_support, mlm_eval, ...) that only "
"need to confirm a YAML compiles. Returns "
"`{ok, dry_run: True, validated: bool, diagnostic?: str, "
"exit_code: int|None, stdout_tail: str, stderr_tail: str, "
"argv: list[str]}` with no `experiment_id`. Skips "
Comment on lines +221 to +223

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Documented return shape doesn't match bridge.py implementation for all dry-run branches.

The Field description states the dry-run return will have validated: bool (required), exit_code: int|None, stdout_tail: str, stderr_tail: str, and argv: list[str]. However, cross-referencing with bridge.py's _submit_job_dry_run implementation (context snippet 2) shows three of five return paths omit some of these fields:

  1. yaml_not_found (bridge.py:727-737): missing validated, exit_code, stdout_tail, stderr_tail, argv
  2. launcher_dir_not_found (bridge.py:775-784): missing validated, exit_code, stdout_tail, stderr_tail, argv
  3. dry_run_timeout (bridge.py:803-816): missing validated

Client code expecting result["validated"] on all dry-run returns would raise KeyError on these branches. The implementation also returns an undocumented reason field in error cases.

Either:

  • Fix the bridge.py implementation to always include the documented fields (e.g., validated=False for error cases, empty strings for missing stdout/stderr, empty list for argv), or
  • Update this Field description to clarify that the documented shape applies only to the success/launcher-rejection cases and that error cases return a different subset
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/server.py` around lines 221 - 223, The Field
description in server.py documents a specific return shape for dry-run responses
with required fields validated, exit_code, stdout_tail, stderr_tail, and argv,
but the _submit_job_dry_run function in bridge.py returns different subsets of
fields across different error branches (yaml_not_found, launcher_dir_not_found,
and dry_run_timeout paths omit these fields). Fix this inconsistency by either:
(1) updating the _submit_job_dry_run implementation in bridge.py to ensure all
three error paths return the complete documented set of fields with sensible
defaults (e.g., validated=False, empty strings for stdout_tail and stderr_tail,
empty list for argv, and appropriate exit_code values), or (2) update the Field
description in server.py to document the actual return shapes for each case and
clarify which fields are conditionally present. Option 1 (fixing the
implementation) is preferred as it maintains a consistent API contract.

"verify_setup automatically — "
"no cluster contact happens in dry-run. `hf_local` / "
"`cluster_host` are optional in this mode (pass one to "
"validate executor-specific config, omit both to validate "
"just the YAML shape)."
)
),
] = False,
) -> dict:
return bridge.submit_job_impl(
yaml_path=yaml_path,
Expand All @@ -218,6 +240,7 @@ def submit_job(
job_name=job_name,
extra_overrides=extra_overrides,
skip_verify=skip_verify,
dry_run=dry_run,
)

@mcp.tool(
Expand Down
Loading
Loading