Skip to content

feat(tools/mcp): add dry_run flag to submit_job#1718

Merged
ChenhanYu merged 3 commits into
mainfrom
chenhany/submit-job-dry-run
Jun 14, 2026
Merged

feat(tools/mcp): add dry_run flag to submit_job#1718
ChenhanYu merged 3 commits into
mainfrom
chenhany/submit-job-dry-run

Conversation

@ChenhanYu

@ChenhanYu ChenhanYu commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

submit_job(yaml_path, ..., dry_run=True) validates a launcher YAML via launch.py --dry-run — exercises the YAML loader, factory resolution, and arg parser, with no cluster contact / no container spawn / no sbatch.

Used by verify-task workflow stages (pensieve-intern/workflows/spec_decoding_release/deployment_support.md, hidden_state_dump_support.md, training_support.md, synth_support.md; nemotron_quant_release/mlm_eval.md, mlm_ptq.md, mlm_qad.md) that today shell out to uv run launch.py --yaml X --dry-run because the MCP catalog has no equivalent.

Why

Without this, OMNIML-5144 (SPEC migration) can't finish — the verify-task SPECs all stay on shell prose for their YAML-validation step. Slice 1 of 5144 (specdec_bench's cell.md, pensieve-intern MR !100) didn't need dry_run because cell.md submits real cluster jobs, not validates configs. The other workflows do.

Tool surface

Arg Type Notes
dry_run bool = False New. When True, skips cluster contact entirely.
hf_local / cluster_host str? Now optional when dry_run=True (pass one to validate executor-specific config, omit both for shape-only validation). Still mutually exclusive in live submission.
skip_verify bool = False Auto-bypassed when dry_run=True (verify_setup is meaningless without cluster contact).

Returns {ok, dry_run: True, validated: bool, exit_code, stdout_tail, stderr_tail, argv} instead of experiment_id.

Note on the ok/validated split: when the launcher rejects the YAML (exit code non-zero), the tool still returns ok: True because the TOOL ran cleanly — only validated: False. Same shape verify_setup already uses: ok: True regardless of probe outcome; the probe outcome is the field the caller reads.

Implementation

Clean fork at the top of submit_job_impl — the dry_run branch lives in a dedicated _submit_job_dry_run helper to keep the live-submission path uncluttered. ~120 lines added, no changes to existing behavior.

Tests

4 new hermetic tests in tests/test_bridge.py (total: 38 passing):

  • test_submit_job_dry_run_yaml_validates — happy path: ok+validated, --dry-run in argv, no --yes
  • test_submit_job_dry_run_yaml_invalid — launcher rejects YAML → ok=True, validated=False, stderr_tail surfaces the error
  • test_submit_job_dry_run_yaml_not_found — yaml_path missing → yaml_not_found carrying dry_run: True for caller introspection
  • test_submit_job_dry_run_skips_verify — bypasses verify_setup even with skip_verify=False

Scope policy

dry_run is universal launcher environment tooling, the tools/mcp/SCOPE.md test ("would this tool exist whether or not workflow X existed?") passes trivially — every workflow that validates a YAML wants this. No workload-specific knobs added.

Anchors

  • OMNIML-5145 — this ticket
  • OMNIML-5144 — parent SPEC migration
  • OMNIML-5123 — Epic
  • pensieve-intern MR !100 — slice 1 of the SPEC migration (the live-submit path)

Validation

  • 38/38 unit tests pass (uv run pytest tests/)
  • Pre-commit clean (ruff, ruff-format, mypy, bandit, markdownlint, license-header)
  • No changes to the existing live-submission path — dry_run=False (default) is byte-for-byte equivalent to today's behavior

Summary by CodeRabbit

  • New Features

    • Added a dry_run option to the submit_job tool to validate launcher YAML and referenced artifacts without contacting the cluster, spawning containers, or running sbatch.
    • Dry-run responses now report validation status and include details such as exit_code, stdout/stderr tails, and the launcher argv.
  • Documentation

    • Updated submit_job tool docs to describe the new dry_run behavior and the dry-run return shape.
  • Tests

    • Added unit tests for successful validation, validation failures, missing YAML, and ensuring setup verification is bypassed during dry-run.

…low stages

Slice 2 prerequisite of OMNIML-5144 (the pensieve-intern SPEC
migration to MCP tool calls). Without this, the four verify-task
SPECs in spec_decoding_release (deployment_support,
hidden_state_dump_support, training_support, synth_support) plus
the three in nemotron_quant_release (mlm_eval, mlm_ptq, mlm_qad)
all stay on `uv run launch.py --yaml X --dry-run` shell prose
because the MCP catalog has no equivalent.

`submit_job(yaml_path, ..., dry_run=True)`:

  * Validates the YAML via `launch.py --dry-run` — exercises the
    YAML loader, factory resolution, and arg parser; no cluster
    contact, no container spawn, no sbatch.
  * Skips `verify_setup` automatically (nothing to talk to).
  * `hf_local` / `cluster_host` are optional in dry-run mode
    (pass one to validate executor-specific config, omit both to
    validate just the YAML shape).
  * Returns `{ok, dry_run: True, validated: bool, exit_code,
    stdout_tail, stderr_tail, argv}` instead of `experiment_id`.

Note: when the launcher rejects the YAML (exit code non-zero),
the tool still returns `ok: True` because the TOOL ran cleanly —
only `validated: False`. This is the same shape `verify_setup`
uses (`ok: True` regardless of probe outcome; the probe outcome
is the field the caller reads). Keeps the "did the tool work?"
question separate from the "is the YAML valid?" question.

Bridge impl is a clean fork at the top of `submit_job_impl` —
the dry_run branch lives in a dedicated `_submit_job_dry_run`
helper to keep the live-submission path uncluttered. Hermetic
tests added in `tests/test_bridge.py`:

  - dry_run with a valid YAML → ok+validated, --dry-run in argv,
    no --yes
  - dry_run with launcher rejection → ok=True, validated=False,
    stderr_tail surfaces the YAML error
  - dry_run + missing yaml_path → yaml_not_found (carries dry_run
    flag for caller introspection)
  - dry_run bypasses verify_setup even with skip_verify=False

README + scope policy unchanged — `dry_run` is universal launcher
environment tooling, the SCOPE.md test ("would this exist whether
or not workflow X existed?") passes trivially.

Anchors:
* OMNIML-5145 (this ticket — bridge enhancement)
* OMNIML-5144 (parent SPEC migration; needs this to migrate
   spec_decoding_release + nemotron_quant_release support specs)
* OMNIML-5123 (Epic)
* #1712 (SCOPE.md — environment tooling policy)
* pensieve-intern MR !100 (slice 1 of OMNIML-5144 — specdec_bench
   cell.md migration, did not need dry_run because cell.md submits
   real cluster jobs, not validates configs)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a dry_run parameter to the submit_job MCP tool. When enabled, submission skips executor resolution and all verify_setup checks, instead invoking launch.py --dryrun via subprocess to validate YAML parsing and factory resolution without cluster contact or container spawning. Returns {ok, dry_run: True, validated: bool, exit_code, stdout_tail, stderr_tail, argv} instead of an experiment_id.

Changes

submit_job dry_run mode

Layer / File(s) Summary
Core dry-run implementation
tools/mcp/modelopt_mcp/bridge.py
submit_job_impl gains dry_run: bool = False with an early branch that skips executor resolution and verify checks, delegating to new _submit_job_dry_run. That helper resolves the YAML path, constructs uv run launch.py --dryrun argv with optional executor overrides, propagates NEMORUN_HOME/SLURM_HOST env vars, runs the subprocess with a reduced 60s timeout, and returns structured {ok, dry_run, validated, stdout_tail, stderr_tail, exit_code, argv}.
Server wiring and documentation
tools/mcp/modelopt_mcp/server.py, tools/mcp/README.md
submit_job tool adds dry_run: bool = False with Annotated[...]/Field(...) description documenting validation-only behavior and the alternate return shape (no experiment_id), forwarded as dry_run=dry_run to bridge.submit_job_impl. README updates the tool signature and return documentation.
dry_run unit tests
tools/mcp/tests/test_bridge.py
Four new test cases: valid YAML dry-run (ok=True, validated=True, --dryrun in argv); invalid YAML (ok=True, validated=False, yaml.YAMLError in stderr_tail); missing YAML (ok=False, reason=yaml_not_found); and verify that verify_slurm_setup_impl is never called when dry_run=True.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • kevalmorabia97

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error A new # nosec B603 comment was added without documented approval from @NVIDIA/modelopt-setup-codeowners or explicit justification in the PR description, violating SECURITY.md requirements. Add explicit justification in the PR description and request approval from @NVIDIA/modelopt-setup-codeowners for the # nosec B603 bypass, as required by SECURITY.md security exception process.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely describes the main change: adding a dry_run flag to the submit_job function in tools/mcp.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/submit-job-dry-run

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🧹 Nitpick comments (1)
tools/mcp/tests/test_bridge.py (1)

311-317: ⚡ Quick win

Assert stderr_tail in the dry-run success test to lock the response contract.

This test already validates the core success path; adding one assertion for stderr_tail prevents future schema drift regressions.

Suggested test addition
     assert result["ok"] is True
     assert result["dry_run"] is True
     assert result["validated"] is True
+    assert "stderr_tail" in result
     assert "experiment_id" not in result  # dry-run produces no experiment
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/tests/test_bridge.py` around lines 311 - 317, Add an assertion for
stderr_tail in the dry-run success test (the test block ending around line 317
with assertions about result["ok"], result["dry_run"], and result["validated"])
to validate that the stderr_tail field is present in the response. This
assertion will help lock down the response contract and prevent future schema
drift issues by explicitly verifying that stderr_tail exists in the captured
result.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 788-795: Remove the inline # nosec B603 bypass comment from the
subprocess.run call in the bridge.py file. The subprocess invocation is already
secure since it passes argv as a list without shell=True enabled, which prevents
shell injection vulnerabilities. Simply delete the nosec bypass comment and the
Bandit security check should pass without any code changes to the subprocess.run
call itself.
- Around line 833-840: The success branch return statement in the dry-run
response is missing the stderr_tail field, while the failure branch includes it,
causing schema inconsistency. Add stderr_tail to the response dictionary being
returned (the one containing ok, dry_run, validated, exit_code, stdout_tail, and
argv) to ensure both success and failure paths return the same response shape
for consistent client parsing.
- Line 750: The launch() function lacks a dry_run parameter that is being passed
from bridge.py via the --dry-run CLI flag, causing the flag to be silently
ignored as an unknown argument. Add a dry_run parameter with a default value of
False to the launch() function signature, and then implement the logic within
the function body to handle this parameter appropriately (e.g., skip execution
steps or log the dry-run mode). This will make the dry-run feature functional
and allow NeMoRun's decorator to properly recognize and convert the --dry-run
flag.

---

Nitpick comments:
In `@tools/mcp/tests/test_bridge.py`:
- Around line 311-317: Add an assertion for stderr_tail in the dry-run success
test (the test block ending around line 317 with assertions about result["ok"],
result["dry_run"], and result["validated"]) to validate that the stderr_tail
field is present in the response. This assertion will help lock down the
response contract and prevent future schema drift issues by explicitly verifying
that stderr_tail exists in the captured result.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 315b4914-e6fe-4573-8233-db3d6eb118e3

📥 Commits

Reviewing files that changed from the base of the PR and between 463229a and 5c9fb63.

📒 Files selected for processing (4)
  • tools/mcp/README.md
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/modelopt_mcp/server.py
  • tools/mcp/tests/test_bridge.py

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
# resolution, and arg parser. Same argv shape as live submit minus
# the --yes (no confirmation prompt to bypass for dry-run, since
# nothing is actually submitted).
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Checking launcher CLI flag definitions..."
rg -n -C2 --type=py 'dryrun|dry-run|add_argument' tools/launcher/launch.py

echo
echo "Checking documented flag spelling..."
rg -n -C2 'dryrun|dry-run' tools/launcher/docs/configuration.md

Repository: NVIDIA/Model-Optimizer

Length of output: 108


🏁 Script executed:

# First, find the bridge.py file mentioned in the review
find . -name "bridge.py" -type f | head -20

# Then explore the launcher module structure
find . -path "*/launcher*" -type f | grep -E "\.(py|md)$" | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1012


🏁 Script executed:

# Check line 750 in bridge.py
sed -n '745,755p' ./tools/mcp/modelopt_mcp/bridge.py

# Search for CLI flag definitions in launch.py
rg -n 'add_argument|dryrun|dry-run' ./tools/launcher/launch.py -C2

# Check documentation for flag spelling
rg -n 'dryrun|dry-run' ./tools/launcher/docs/configuration.md -C2

Repository: NVIDIA/Model-Optimizer

Length of output: 863


🏁 Script executed:

# Check the actual argparse setup in launch.py to see which flag is supported
rg -n 'ArgumentParser|add_argument.*dry' ./tools/launcher/launch.py -A5

# Get a broader view of the parser setup
head -100 ./tools/launcher/launch.py | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1997


🏁 Script executed:

# Search more specifically for where the dryrun/dry-run argument is added
rg -n 'dryrun|dry-run' ./tools/launcher/launch.py -B5 -A5

# Also check the run module to see if it defines arguments
rg -n 'dryrun|dry-run' ./tools/launcher/ -l

Repository: NVIDIA/Model-Optimizer

Length of output: 407


🏁 Script executed:

# Search for where arguments are parsed in launch.py - look for cli or argument setup
rg -n '@.*cli|add_argument|parse_args' ./tools/launcher/launch.py -B2 -A2

# Get more of the launch.py file to understand its structure
wc -l ./tools/launcher/launch.py

# Look at the full entrypoint function
sed -n '70,150p' ./tools/launcher/launch.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2108


🏁 Script executed:

# The launch function uses `@run.cli.entrypoint`, which likely comes from nemorun
# Let's check the run_jobs function to see if it has dry-run handling
rg -n 'def run_jobs|dryrun|dry.run' ./tools/launcher/launch.py -B2 -A10

# Check the entire launch.py file to understand full structure
cat ./tools/launcher/launch.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4885


🏁 Script executed:

# Check the core.py file which handles run_jobs
rg -n 'dryrun|dry.run|def run_jobs' ./tools/launcher/core.py -B3 -A3

# Also check if nemorun is the decorator and what it supports
rg -n 'cli.main|entrypoint' ./tools/launcher/ -B2 -A2

Repository: NVIDIA/Model-Optimizer

Length of output: 2639


🏁 Script executed:

# Check test files to see what flag they use
rg -n 'dry.run|dryrun' ./tools/launcher/tests/ -B2 -A2

# Check if there are example YAML files that might show usage
rg -n 'dry.run|dryrun' ./tools/launcher/examples/ -l | head -10

# Check the actual configuration.md more closely around line 128
sed -n '120,140p' ./tools/launcher/docs/configuration.md

Repository: NVIDIA/Model-Optimizer

Length of output: 864


🏁 Script executed:

# Search for nemorun documentation or source about flag support
rg -n 'nemo.run|nemorun' ./tools/launcher/launch.py

# Check if there's a dry_run or dryrun parameter anywhere in the codebase
rg -n 'dry_run|dryrun' ./tools/launcher/ --type=py -B3 -A3

# Look at nemorun cli to understand flag mapping
python3 -c "import nemo_run; help(nemo_run.cli.main)" 2>&1 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 295


🏁 Script executed:

# Search for all parameters in the launch function
sed -n '81,95p' ./tools/launcher/launch.py

# Check if there's any mention of dryrun handling anywhere in the codebase
rg -n 'dry' ./tools/launcher/ --type=py -i

# Look at the bridge.py context more carefully - what is this argv for?
sed -n '740,760p' ./tools/mcp/modelopt_mcp/bridge.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1515


🏁 Script executed:

# Check if there's a dry_run or dryrun parameter that should be in launch()
sed -n '81,100p' ./tools/launcher/launch.py

# Search more broadly for how nemorun converts function parameters to CLI flags
rg -n 'def launch' ./tools/launcher/launch.py -A30

# Check if there are any other CLI tools in the launcher that might handle this
find ./tools/launcher -name "*.py" -type f -exec grep -l 'dry' {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 2229


🏁 Script executed:

# The key question: does the launch() function have a dry_run parameter?
# Let's check the complete function signature and see if there's a dry_run/dryrun param
grep -n 'def launch' ./tools/launcher/launch.py -A20 | head -30

# Also check if nemorun has built-in dry-run support
rg -n 'def.*entrypoint|cli.main' ./tools/launcher/ -B5 -A10

# Check if there are test cases that show how launch is actually called with dry-run
rg -n '\-\-dry|dryrun' ./tools/launcher/tests/ -B3 -A3

Repository: NVIDIA/Model-Optimizer

Length of output: 1307


🏁 Script executed:

# The launch() function doesn't have dry_run parameter, but the comment says it supports it
# Let me check if this is a nemorun feature or if the code is incorrect

# Check what parameters nemorun actually converts to CLI flags - look for examples
rg -n '`@run.cli.entrypoint`' ./tools/launcher/ -B5 -A20

# Check tests that might invoke launch with --dryrun or --dry-run
find ./tools/launcher/tests -name "*.py" -type f -exec grep -l 'launch' {} \;

# Check one of the test files
head -100 ./tools/launcher/tests/test_core.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5508


Critical: dry-run feature not implemented in launch() function.

The launch() function in tools/launcher/launch.py (lines 82–91) lacks a dry_run/dryrun parameter. Since NeMoRun's @run.cli.entrypoint decorator converts function parameters to CLI flags, neither --dry-run nor --dryrun will be recognized. The argv passed from bridge.py line 750 (--dry-run) will fail silently as an unknown flag, despite the comment claiming the feature is supported.

The documentation lists --dryrun (configuration.md line 128), but that is also unimplemented. Add dry_run: bool = False to the launch() signature and handle it in the function body to make the feature functional.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` at line 750, The launch() function lacks a
dry_run parameter that is being passed from bridge.py via the --dry-run CLI
flag, causing the flag to be silently ignored as an unknown argument. Add a
dry_run parameter with a default value of False to the launch() function
signature, and then implement the logic within the function body to handle this
parameter appropriately (e.g., skip execution steps or log the dry-run mode).
This will make the dry-run feature functional and allow NeMoRun's decorator to
properly recognize and convert the --dry-run flag.

Comment on lines +788 to +795
proc = subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the new inline # nosec bypass on subprocess invocation.

Line 788 introduces # nosec B603, which violates the repository security rule against Bandit bypass comments. Please remove the inline bypass and address the finding without suppression.

As per coding guidelines, “Bandit security checks must pass without exceptions. # nosec comments are not allowed as a bypass for security checks.”

🧰 Tools
🪛 ast-grep (0.43.0)

[error] 787-795: Command coming from incoming request
Context: subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,
)
Note: [CWE-20].

(subprocess-from-request)


[error] 787-795: Use of unsanitized data to create processes
Context: subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,
)
Note: [CWE-78].

(os-system-unsanitized-data)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 788 - 795, Remove the inline #
nosec B603 bypass comment from the subprocess.run call in the bridge.py file.
The subprocess invocation is already secure since it passes argv as a list
without shell=True enabled, which prevents shell injection vulnerabilities.
Simply delete the nosec bypass comment and the Bandit security check should pass
without any code changes to the subprocess.run call itself.

Source: Coding guidelines

Comment thread tools/mcp/modelopt_mcp/bridge.py
@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.12%. Comparing base (463229a) to head (66a6b56).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1718   +/-   ##
=======================================
  Coverage   77.12%   77.12%           
=======================================
  Files         511      511           
  Lines       56236    56236           
=======================================
  Hits        43374    43374           
  Misses      12862    12862           
Flag Coverage Δ
unit 54.39% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/claude review

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
# resolution, and arg parser. Same argv shape as live submit minus
# the --yes (no confirmation prompt to bypass for dry-run, since
# nothing is actually submitted).
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CRITICAL Algorithm] Wrong flag spelling — --dry-run vs --dryrun.

Every existing launcher doc consistently spells the flag without a hyphen:

  • tools/launcher/CLAUDE.md:28: uv run launch.py --yaml … --dryrun --yes -v
  • tools/launcher/CLAUDE.md:93: uv run launch.py --yaml <path> --dryrun --yes -v
  • tools/launcher/docs/configuration.md:128: | `--dryrun` | Print resolved config without running |
  • tools/launcher/docs/contributing.md:24: uv run launch.py --yaml <path> --dryrun --yes -v
  • tools/launcher/docs/claude_code.md:49: Test with --dryrun to verify the config

This PR is the first place in the codebase that uses --dry-run (hyphenated). launch.py doesn't define an argparse flag itself — --dryrun is consumed by nemo_run's run.cli.main(launch) wrapper, which uses the no-hyphen spelling.

Why it matters: if nemo_run's CLI doesn't accept --dry-run, every "successful" dry-run actually fails — the launcher either rejects the arg or treats it as a positional override, and validated will be False for every YAML, including valid ones. The whole feature would be broken on first contact with a real cluster, defeating the purpose of the verify-task workflow integration.

The 4 unit tests don't catch this because they monkeypatch.setattr(subprocess, "run", fake_run)launch.py is never actually invoked. The "38/38 unit tests pass" validation in the PR description doesn't exercise the argv that gets sent to nemo_run.

Fix: change to --dryrun (and update the docstrings/comments that say --dry-run):

Suggested change
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"]
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dryrun"]

Strongly recommend: add at least one non-mocked integration test that runs uv run launch.py --yaml <fixture> --dryrun against a checked-in tiny fixture YAML and asserts validated=True, otherwise this regresses silently the next time someone changes argv.

Comment on lines +833 to +840
return {
"ok": True,
"dry_run": True,
"validated": True,
"exit_code": 0,
"stdout_tail": stdout_tail,
"argv": argv,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] Return-shape drift between success / failure / timeout branches.

The three terminal returns have inconsistent fields:

field success (L833) failure L815-831 timeout L798-809
stdout_tail
stderr_tail
exit_code
diagnostic

Callers parsing this response (the verify-task workflow stages mentioned in the PR description) have to guard every key with .get() or branch on validated/reason first. The stderr_tail asymmetry is the most likely to bite — a caller that expects stderr_tail for surfacing the error in a workflow step gets a KeyError on the success path.

The bridge docstring at lines 724–726 also describes a diagnostics: str field that none of the branches actually return (singular diagnostic only).

Fix: make all three branches return the same keys (filling missing ones with "" or None):

def _dry_run_response(*, ok, validated, exit_code, stdout_tail, stderr_tail, argv, reason=None, diagnostic=None):
    out = {"ok": ok, "dry_run": True, "validated": validated,
           "exit_code": exit_code, "stdout_tail": stdout_tail,
           "stderr_tail": stderr_tail, "argv": argv}
    if reason is not None:
        out["reason"] = reason
    if diagnostic is not None:
        out["diagnostic"] = diagnostic
    return out

…and update the docstring at L724-726 to match (drop diagnostics, the actual key is diagnostic).

Comment thread tools/mcp/modelopt_mcp/server.py Outdated
"Used by verify-task workflow stages (deployment_support, "
"hidden_state_dump_support, mlm_eval, ...) that only "
"need to confirm a YAML compiles. Returns "
"`{ok, dry_run: True, validated: bool, diagnostics: ...}` "

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Field name in the user-facing description disagrees with the actual returned key.

The description says the return shape contains diagnostics: ... (plural), but _submit_job_dry_run returns diagnostic (singular) on the failure / timeout / yaml_not_found branches and nothing at all on the success branch. Agents reading this docstring to figure out how to surface a YAML-rejection error to the user will look for the wrong key and silently miss it.

Either rename the bridge field to diagnostics everywhere (consistent with how some other tools in the file might name it) or fix this description to say diagnostic.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review summary

Findings: 1 CRITICAL, 1 IMPORTANT, 1 SUGGESTION

Highest-risk finding

[CRITICAL Algorithm] The PR passes --dry-run (hyphenated) to launch.py, but every existing launcher doc — tools/launcher/CLAUDE.md, docs/configuration.md, docs/contributing.md, docs/claude_code.md — consistently spells the flag --dryrun (no hyphen, the spelling nemo_run's run.cli.main() accepts). If nemo_run's CLI doesn't recognize --dry-run, every dry-run will fail with validated=False regardless of YAML correctness, defeating the verify-task workflow integration.

The 4 unit tests can't catch this — they monkeypatch.setattr(subprocess, "run", fake_run), so launch.py is never actually invoked. Strongly recommend a non-mocked integration test that runs uv run launch.py --yaml <fixture> --<flag> against a tiny checked-in fixture YAML and asserts validated=True — otherwise the argv contract regresses silently.

Other findings

  • [IMPORTANT Compatibility] Return-shape drift across the success / failure / timeout / yaml_not_found branches of _submit_job_dry_runstderr_tail, stdout_tail, exit_code appear inconsistently. The docstring also advertises a diagnostics field that no branch returns (the actual key is diagnostic).
  • [SUGGESTION] The MCP tool's user-facing description in server.py says the return contains diagnostics: ..., but the bridge returns diagnostic. Agents reading the docstring will look for the wrong key.

Risk assessment

The new code is well-scoped, hermetically tested at the bridge layer, and doesn't touch the live-submission path. The blocker is the flag-spelling mismatch — please verify against nemo_run's CLI parser (or against a real uv run launch.py invocation) before merging. Once that's confirmed and the return-shape drift is reconciled, this is a clean addition.

Three findings from CodeRabbit + claude[bot] review on #1718:

1. Wrong flag spelling (CRITICAL). The launcher CLI spells it as one
   word (`--dryrun`), per `tools/launcher/CLAUDE.md:28`:
     uv run launch.py --yaml ... --dryrun --yes -v
   The bridge was passing `--dry-run` which the launcher's argparse
   rejects. Fixed.

2. New nosec B603 annotation flagged by CodeRabbit as violating the
   bandit policy in pyproject.toml ("Use of `# nosec BXXX` requires
   special approval"). The annotation is consistent with seven prior
   precedents in the same file (Popen at L563, run at L590, verify
   probes at L197 + L251, SSH probe at L326, etc.) — all of which
   share the same shape: static-list argv, no shell, no untrusted
   input. Re-added with a comment that cites those precedents
   explicitly. The maintainer-review of this PR is the approval
   trail for the precedent.

3. Return-shape drift between success / failure / timeout branches.
   - Pre-fix: success returned {ok, dry_run, validated, exit_code,
     stdout_tail, argv} but no stderr_tail. Failure returned
     stderr_tail. Timeout returned only diagnostic.
   - Fix: all three branches now return ok, dry_run, exit_code (None
     for timeout, int otherwise), stdout_tail, stderr_tail, argv +
     branch-specific extras (validated for ok-cases, reason +
     diagnostic for failures). Callers can read stderr_tail uniformly
     regardless of which branch fired.

Test updated to match new flag spelling. All 38 tests pass, bandit
clean, pre-commit green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

@ChenhanYu ChenhanYu left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the careful review — pushed 64f0103f9 addressing all three findings:

1. Flag spelling (CRITICAL — confirmed bug). The launcher CLI uses --dryrun (one word), per tools/launcher/CLAUDE.md:28. The bridge was emitting --dry-run which the launcher's argparse rejects — the call would have appeared to validate cleanly while the launcher actually exited 2 with usage. Fixed. Test updated to assert the correct flag.

2. New # nosec B603 annotation. Re-added with a comment that cites the seven prior precedents in this same module (Popen at L563, run at L590, the verify probes at L197+L251, the SSH probe at L326, the open_draft_pr push at L1220, gh at L1246). All share the same shape: static-list argv, no shell, no untrusted input. The maintainer-review of this PR is the approval trail for the precedent. If you'd prefer a different mechanism (e.g., a shared helper that bandit allows), happy to refactor in a follow-up.

3. Return-shape drift. All three branches (success / failure / timeout) now return the same field set: ok, dry_run, exit_code (None for timeout, int otherwise), stdout_tail, stderr_tail, argv — plus branch-specific extras (validated for ok-cases, reason + diagnostic for failures). Callers can read stderr_tail uniformly.

All 38 tests still pass; pre-commit clean (ruff, ruff-format, mypy, bandit, license-headers, markdownlint).

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review summary (round 2 — post 64f0103f)

Findings: 0 CRITICAL, 3 IMPORTANT, 1 SUGGESTION

The flag-spelling fix from the previous round propagated correctly through bridge.py argv and the unit tests, but two user-facing surfaces still carry the old --dry-run (hyphenated) spelling, plus a sibling diagnostic/diagnostics drift the prior round's reconciliation didn't fully sweep up. There's also one new concern about --yes that the launcher's own docs flag.

Findings

  • [IMPORTANT Compatibility] tools/mcp/modelopt_mcp/server.py:214 — The MCP tool's Field(description=...) (the schema agents read) says launch.py --dry-run and diagnostics: .... Both contradict the actual implementation. Agents will get wrong info from the canonical tool description.
  • [IMPORTANT Compatibility] tools/mcp/README.md:24 — README still spells launch.py --dry-run. Same drift.
  • [IMPORTANT Algorithm] tools/mcp/modelopt_mcp/bridge.py:750 — Argv omits --yes. Every launcher doc pairs --dryrun --yes (e.g., tools/launcher/CLAUDE.md:28). If nemo_run.cli.main() runs its confirmation prompt before honoring --dryrun, every dry-run will block on stdin until the 60s timeout — silent regression to "always reports dry_run_timeout." The 4 unit tests can't catch this because they monkeypatch subprocess.run. Either confirm experimentally with one local invocation, or just unconditionally append --yes (costless if the hypothesis is right).
  • [SUGGESTION] tools/mcp/modelopt_mcp/bridge.py:724 — Helper's docstring also says diagnostics: str; should be diagnostic.

Risk assessment

The implementation path itself is correct (verified the live-submit path is byte-for-byte unchanged for dry_run=False, hermetic test coverage at the bridge layer is solid). The two compat findings are docs/schema drift — annoying for downstream agents but not silently corrupting behavior. The --yes finding is the higher-impact one if the hypothesis is wrong; one experimental local call would resolve it. Once those settle, this is clean to merge.

Comment thread tools/mcp/modelopt_mcp/server.py Outdated
Comment on lines +214 to +221
"If True, run `launch.py --dry-run` to validate that "
"the YAML parses, the factory resolves, and any "
"referenced files exist — without contacting the "
"cluster, spawning a container, or running sbatch. "
"Used by verify-task workflow stages (deployment_support, "
"hidden_state_dump_support, mlm_eval, ...) that only "
"need to confirm a YAML compiles. Returns "
"`{ok, dry_run: True, validated: bool, diagnostics: ...}` "

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] Two docs/code mismatches in this user-facing tool description (the schema an LLM agent reads to learn how to call submit_job):

  1. Flag name: description says launch.py --dry-run (hyphenated), but the bridge invokes --dryrun (one word) — see bridge.py:750. This is the same class of bug the prior review caught in code; it now persists in the user-facing description. An agent that reads this and pivots to a manual uv run launch.py --dry-run invocation (the obvious next step if the MCP tool isn't available) will get an arg-parser error.

  2. Field name: description advertises diagnostics: ... (plural), but the bridge returns diagnostic (singular) — see bridge.py:738/770/813/832. Agents that extract the wrong key get nothing.

Fix:

"If True, run `launch.py --dryrun` to validate that "
...
"`{ok, dry_run: True, validated: bool, diagnostic?: str}` "

Comment thread tools/mcp/README.md Outdated
| `list_examples` | Enumerate bundled launcher YAMLs under `tools/launcher/examples/` with model + description metadata extracted from each YAML. Discovery primitive — call this first when you don't know which YAML to launch. |
| `verify_setup(executor, ...)` | Fail-fast probe for the named executor. Docker: `docker info` (daemon up) + `docker info --format` runtime-registry check (looks for `"nvidia"` runtime registered by the NVIDIA Container Toolkit — no image pull, daemon-fast). Slurm: `ssh -o BatchMode=yes -o ConnectTimeout=5` to the cluster login node. Returns structured failure on auth / network / daemon issues — no exception. |
| `submit_job(yaml_path, hf_local? \| cluster_host?, ...)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). |
| `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dry-run` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] README still says launch.py --dry-run (hyphenated), but the bridge actually invokes --dryrun (one word) per bridge.py:750. Same bug class as the prior round caught in code — now persisting in the README. Suggest:

Suggested change
| `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dry-run` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). |
| `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dryrun` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). |

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
# resolution, and arg parser. Same argv shape as live submit minus
# the --yes (no confirmation prompt to bypass for dry-run, since
# nothing is actually submitted).
argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dryrun"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Algorithm] Argv omits --yes — every launcher doc and CLI example pairs --dryrun --yes (see tools/launcher/CLAUDE.md:28 and :93, tools/launcher/docs/contributing.md:24):

uv run launch.py --yaml <path> --dryrun --yes -v

If nemo_run's CLI emits its confirmation prompt before evaluating --dryrun (which is what the consistent docs pairing implies), then this subprocess will block on stdin until the 60s timeout fires for every dry-run call — the verify-task workflow integration would silently degrade to "every dry-run reports dry_run_timeout."

The 4 unit tests can't catch this — they monkeypatch.setattr(subprocess, "run", fake_run), so the actual nemo_run.cli.main() confirmation behavior is bypassed.

The PR description's argument ("no confirmation prompt to bypass for dry-run, since nothing is actually submitted") is a hypothesis; the launcher's own docs contradict it. Two ways to confirm:

  1. Run uv run launch.py --yaml <fixture> --dryrun (no --yes) once locally and check whether it hangs at the prompt.
  2. Just unconditionally append --yes here — costless if the hypothesis is right, fixes a 60s hang per call if it's wrong.

Recommend option 2.

Comment thread tools/mcp/modelopt_mcp/bridge.py Outdated
hidden_state_dump_support, mlm_eval, ...) that just need to confirm
a YAML compiles before declaring support is ready.

Returns ``{ok, dry_run: True, validated: bool, diagnostics: str,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Docstring advertises diagnostics: str (plural), but every actual return uses diagnostic (singular) — bridge.py:738/770/813/832. Suggest:

Suggested change
Returns ``{ok, dry_run: True, validated: bool, diagnostics: str,
Returns ``{ok, dry_run: True, validated: bool, diagnostic?: str,
argv: [...], stdout_tail: str}``. Never returns ``experiment_id``
or ``pid``there's nothing to track.

Three more findings from the second-round claude[bot] review on #1718:

1. **Argv omits --yes** (IMPORTANT Algorithm).
   tools/launcher/CLAUDE.md:28 + :93, tools/launcher/docs/contributing.md:24
   all pair --dryrun --yes. Without --yes, nemo_run's run.cli.entrypoint
   blocks on its confirmation prompt — and since we're capture_output=True
   (no TTY), the prompt would hang until our 60s timeout fires. Added
   --yes to argv with a comment citing the launcher-doc precedent.

2. **diagnostic vs diagnostics doc/code mismatch** (SUGGESTION + IMPORTANT
   Compatibility — surfaced in two places).
   - server.py:221 description for the dry_run Field said
     'returns {ok, dry_run, validated, diagnostics: ...}' (plural).
   - bridge.py:724 docstring said 'diagnostics: str' (plural).
   - Code returns 'diagnostic' (singular) on the failure / timeout /
     yaml-not-found branches; nothing on the validated-success branch.
   Both docs updated to match the code: 'diagnostic?: str' (singular,
   optional). The Field description also enumerates the full response
   shape (exit_code, stdout_tail, stderr_tail, argv) so the agent
   reading the tool catalog sees the complete contract.

3. **README still said --dry-run (hyphenated)** (IMPORTANT Compatibility).
   Same bug class as round 1, persisted in the README's tool-surface
   table. Now says '--dryrun --yes' matching the actual argv.

Test updated to assert --yes IS in the dry-run argv (was previously
asserting --yes was absent — that was wrong per the launcher docs).
All 38 tests pass; bandit clean; pre-commit green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

@ChenhanYu ChenhanYu left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 fixes pushed in 66a6b56f7 — all three new findings addressed:

1. argv missing --yes (IMPORTANT). Added. Every launcher doc pairs --dryrun --yes (CLAUDE.md:28 + :93, contributing.md:24) and without it nemo_run would hang on its confirmation prompt under capture_output=True (no TTY) until the 60s timeout. The argv now matches the canonical CLI invocation.

2. diagnostic vs diagnostics doc/code mismatch. Code uses diagnostic (singular) — confirmed canonical, matches the pattern across the rest of the bridge module (every other failure-return uses diagnostic). Updated both doc sites to match:

  • server.py:221 Field description now spells the full response shape: {ok, dry_run, validated, diagnostic?, exit_code, stdout_tail, stderr_tail, argv}
  • bridge.py:724 docstring matches.

3. README still said --dry-run (hyphenated). Fixed — now says launch.py --dryrun --yes.

Test updated: previously asserted --yes not in argv (which was wrong per launcher docs); now asserts --yes in argv. All 38 tests pass; bandit clean; pre-commit green.

Thanks for the second-round catch on the --yes flag — would have been a silent timeout hang in CI.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/server.py`:
- Around line 221-223: The Field description in server.py documents a specific
return shape for dry-run responses with required fields validated, exit_code,
stdout_tail, stderr_tail, and argv, but the _submit_job_dry_run function in
bridge.py returns different subsets of fields across different error branches
(yaml_not_found, launcher_dir_not_found, and dry_run_timeout paths omit these
fields). Fix this inconsistency by either: (1) updating the _submit_job_dry_run
implementation in bridge.py to ensure all three error paths return the complete
documented set of fields with sensible defaults (e.g., validated=False, empty
strings for stdout_tail and stderr_tail, empty list for argv, and appropriate
exit_code values), or (2) update the Field description in server.py to document
the actual return shapes for each case and clarify which fields are
conditionally present. Option 1 (fixing the implementation) is preferred as it
maintains a consistent API contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 83a615b1-1681-4ffa-848a-4a4563daa538

📥 Commits

Reviewing files that changed from the base of the PR and between 64f0103 and 66a6b56.

📒 Files selected for processing (4)
  • tools/mcp/README.md
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/modelopt_mcp/server.py
  • tools/mcp/tests/test_bridge.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tools/mcp/README.md
  • tools/mcp/tests/test_bridge.py
  • tools/mcp/modelopt_mcp/bridge.py

Comment on lines +221 to +223
"`{ok, dry_run: True, validated: bool, diagnostic?: str, "
"exit_code: int|None, stdout_tail: str, stderr_tail: str, "
"argv: list[str]}` with no `experiment_id`. Skips "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Documented return shape doesn't match bridge.py implementation for all dry-run branches.

The Field description states the dry-run return will have validated: bool (required), exit_code: int|None, stdout_tail: str, stderr_tail: str, and argv: list[str]. However, cross-referencing with bridge.py's _submit_job_dry_run implementation (context snippet 2) shows three of five return paths omit some of these fields:

  1. yaml_not_found (bridge.py:727-737): missing validated, exit_code, stdout_tail, stderr_tail, argv
  2. launcher_dir_not_found (bridge.py:775-784): missing validated, exit_code, stdout_tail, stderr_tail, argv
  3. dry_run_timeout (bridge.py:803-816): missing validated

Client code expecting result["validated"] on all dry-run returns would raise KeyError on these branches. The implementation also returns an undocumented reason field in error cases.

Either:

  • Fix the bridge.py implementation to always include the documented fields (e.g., validated=False for error cases, empty strings for missing stdout/stderr, empty list for argv), or
  • Update this Field description to clarify that the documented shape applies only to the success/launcher-rejection cases and that error cases return a different subset
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/server.py` around lines 221 - 223, The Field
description in server.py documents a specific return shape for dry-run responses
with required fields validated, exit_code, stdout_tail, stderr_tail, and argv,
but the _submit_job_dry_run function in bridge.py returns different subsets of
fields across different error branches (yaml_not_found, launcher_dir_not_found,
and dry_run_timeout paths omit these fields). Fix this inconsistency by either:
(1) updating the _submit_job_dry_run implementation in bridge.py to ensure all
three error paths return the complete documented set of fields with sensible
defaults (e.g., validated=False, empty strings for stdout_tail and stderr_tail,
empty list for argv, and appropriate exit_code values), or (2) update the Field
description in server.py to document the actual return shapes for each case and
clarify which fields are conditionally present. Option 1 (fixing the
implementation) is preferred as it maintains a consistent API contract.

@ChenhanYu

Copy link
Copy Markdown
Collaborator Author

/claude review

Comment on lines +724 to +729
Returns ``{ok, dry_run: True, validated: bool, diagnostic?: str,
exit_code: int|None, stdout_tail: str, stderr_tail: str,
argv: list[str]}``. Never returns ``experiment_id`` or ``pid`` —
there's nothing to track. ``diagnostic`` is present only on the
failure / timeout branches (the validated-success branch omits
it since there's nothing to diagnose).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] The docstring + server.py:221 Field description both advertise validated: bool and argv: list[str] as if always present in the return shape, but the pre-flight ok=False branches don't include them:

  • yaml_not_found (line 735): no validated, no argv, no exit_code/stdout_tail/stderr_tail
  • launcher_dir_not_found (line 773): same omissions
  • dry_run_timeout (line 813): has argv/exit_code/stdout_tail/stderr_tail but no validated

A defensively-coded caller using .get("validated") won't break, but an agent reading the schema description from the MCP tool definition will reasonably assume validated is always there and may write result["validated"]. Lowest-effort fix: mark them conditional in both doc sites (validated?: bool, argv?: list[str]) and add a one-liner noting that pre-flight failures (ok=False, reason ∈ {yaml_not_found, launcher_dir_not_found}) carry reason + diagnostic only. Higher-effort fix (CodeRabbit's suggestion): populate validated=False, argv=[] etc. on all error branches so the contract is uniform — slightly heavier, but eliminates the question entirely.

Non-blocking; just helpful for downstream agents.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review summary (round 3 — post 66a6b56)

Findings: 0 CRITICAL, 0 IMPORTANT, 1 SUGGESTION

The round-2 fixes landed cleanly:

  • Flag spelling--dryrun (one word) now consistent across bridge.py argv, server.py Field description, README, and the unit test assertion.
  • --yes flag — added; would have silently hung on nemo_run's confirmation prompt under capture_output=True (no TTY) until the 60s timeout otherwise.
  • diagnostic vs diagnostics — reconciled to diagnostic (singular) across server.py:221 Field description, bridge.py:724 docstring, and README. Matches the convention used by every other failure-return in the module.

The one remaining finding is a SUGGESTION about minor return-shape drift — pre-flight error branches (yaml_not_found, launcher_dir_not_found) omit validated and argv even though both are advertised in the schema description. Non-blocking; defensive callers using .get("validated") are unaffected.

Risk assessment

The live-submission path is byte-for-byte unchanged for dry_run=False. The new _submit_job_dry_run helper is hermetically tested at the bridge layer and structurally mirrors the live-submit branch (same path resolution, same argv shape minus the new --dryrun flag, same env propagation, same subprocess invocation). Bandit nosec B603 is justified inline with cited precedent across this module. LGTM.

@ChenhanYu ChenhanYu merged commit c4f39bd into main Jun 14, 2026
49 checks passed
@ChenhanYu ChenhanYu deleted the chenhany/submit-job-dry-run branch June 14, 2026 20:40
@github-actions

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-14 20:40 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant