feat(tools/mcp): add dry_run flag to submit_job by ChenhanYu · Pull Request #1718 · NVIDIA/Model-Optimizer

ChenhanYu · 2026-06-14T17:00:26Z

Summary

submit_job(yaml_path, ..., dry_run=True) validates a launcher YAML via launch.py --dry-run — exercises the YAML loader, factory resolution, and arg parser, with no cluster contact / no container spawn / no sbatch.

Used by verify-task workflow stages (pensieve-intern/workflows/spec_decoding_release/deployment_support.md, hidden_state_dump_support.md, training_support.md, synth_support.md; nemotron_quant_release/mlm_eval.md, mlm_ptq.md, mlm_qad.md) that today shell out to uv run launch.py --yaml X --dry-run because the MCP catalog has no equivalent.

Why

Without this, OMNIML-5144 (SPEC migration) can't finish — the verify-task SPECs all stay on shell prose for their YAML-validation step. Slice 1 of 5144 (specdec_bench's cell.md, pensieve-intern MR !100) didn't need dry_run because cell.md submits real cluster jobs, not validates configs. The other workflows do.

Tool surface

Arg	Type	Notes
`dry_run`	`bool = False`	New. When True, skips cluster contact entirely.
`hf_local` / `cluster_host`	`str?`	Now optional when `dry_run=True` (pass one to validate executor-specific config, omit both for shape-only validation). Still mutually exclusive in live submission.
`skip_verify`	`bool = False`	Auto-bypassed when `dry_run=True` (verify_setup is meaningless without cluster contact).

Returns {ok, dry_run: True, validated: bool, exit_code, stdout_tail, stderr_tail, argv} instead of experiment_id.

Note on the ok/validated split: when the launcher rejects the YAML (exit code non-zero), the tool still returns ok: True because the TOOL ran cleanly — only validated: False. Same shape verify_setup already uses: ok: True regardless of probe outcome; the probe outcome is the field the caller reads.

Implementation

Clean fork at the top of submit_job_impl — the dry_run branch lives in a dedicated _submit_job_dry_run helper to keep the live-submission path uncluttered. ~120 lines added, no changes to existing behavior.

Tests

4 new hermetic tests in tests/test_bridge.py (total: 38 passing):

test_submit_job_dry_run_yaml_validates — happy path: ok+validated, --dry-run in argv, no --yes
test_submit_job_dry_run_yaml_invalid — launcher rejects YAML → ok=True, validated=False, stderr_tail surfaces the error
test_submit_job_dry_run_yaml_not_found — yaml_path missing → yaml_not_found carrying dry_run: True for caller introspection
test_submit_job_dry_run_skips_verify — bypasses verify_setup even with skip_verify=False

Scope policy

dry_run is universal launcher environment tooling, the tools/mcp/SCOPE.md test ("would this tool exist whether or not workflow X existed?") passes trivially — every workflow that validates a YAML wants this. No workload-specific knobs added.

Anchors

OMNIML-5145 — this ticket
OMNIML-5144 — parent SPEC migration
OMNIML-5123 — Epic
pensieve-intern MR !100 — slice 1 of the SPEC migration (the live-submit path)

Validation

38/38 unit tests pass (uv run pytest tests/)
Pre-commit clean (ruff, ruff-format, mypy, bandit, markdownlint, license-header)
No changes to the existing live-submission path — dry_run=False (default) is byte-for-byte equivalent to today's behavior

Summary by CodeRabbit

New Features
- Added a dry_run option to the submit_job tool to validate launcher YAML and referenced artifacts without contacting the cluster, spawning containers, or running sbatch.
- Dry-run responses now report validation status and include details such as exit_code, stdout/stderr tails, and the launcher argv.
Documentation
- Updated submit_job tool docs to describe the new dry_run behavior and the dry-run return shape.
Tests
- Added unit tests for successful validation, validation failures, missing YAML, and ensuring setup verification is bypassed during dry-run.

…low stages Slice 2 prerequisite of OMNIML-5144 (the pensieve-intern SPEC migration to MCP tool calls). Without this, the four verify-task SPECs in spec_decoding_release (deployment_support, hidden_state_dump_support, training_support, synth_support) plus the three in nemotron_quant_release (mlm_eval, mlm_ptq, mlm_qad) all stay on `uv run launch.py --yaml X --dry-run` shell prose because the MCP catalog has no equivalent. `submit_job(yaml_path, ..., dry_run=True)`: * Validates the YAML via `launch.py --dry-run` — exercises the YAML loader, factory resolution, and arg parser; no cluster contact, no container spawn, no sbatch. * Skips `verify_setup` automatically (nothing to talk to). * `hf_local` / `cluster_host` are optional in dry-run mode (pass one to validate executor-specific config, omit both to validate just the YAML shape). * Returns `{ok, dry_run: True, validated: bool, exit_code, stdout_tail, stderr_tail, argv}` instead of `experiment_id`. Note: when the launcher rejects the YAML (exit code non-zero), the tool still returns `ok: True` because the TOOL ran cleanly — only `validated: False`. This is the same shape `verify_setup` uses (`ok: True` regardless of probe outcome; the probe outcome is the field the caller reads). Keeps the "did the tool work?" question separate from the "is the YAML valid?" question. Bridge impl is a clean fork at the top of `submit_job_impl` — the dry_run branch lives in a dedicated `_submit_job_dry_run` helper to keep the live-submission path uncluttered. Hermetic tests added in `tests/test_bridge.py`: - dry_run with a valid YAML → ok+validated, --dry-run in argv, no --yes - dry_run with launcher rejection → ok=True, validated=False, stderr_tail surfaces the YAML error - dry_run + missing yaml_path → yaml_not_found (carries dry_run flag for caller introspection) - dry_run bypasses verify_setup even with skip_verify=False README + scope policy unchanged — `dry_run` is universal launcher environment tooling, the SCOPE.md test ("would this exist whether or not workflow X existed?") passes trivially. Anchors: * OMNIML-5145 (this ticket — bridge enhancement) * OMNIML-5144 (parent SPEC migration; needs this to migrate spec_decoding_release + nemotron_quant_release support specs) * OMNIML-5123 (Epic) * #1712 (SCOPE.md — environment tooling policy) * pensieve-intern MR !100 (slice 1 of OMNIML-5144 — specdec_bench cell.md migration, did not need dry_run because cell.md submits real cluster jobs, not validates configs) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

coderabbitai · 2026-06-14T17:00:40Z

📝 Walkthrough

Walkthrough

Adds a dry_run parameter to the submit_job MCP tool. When enabled, submission skips executor resolution and all verify_setup checks, instead invoking launch.py --dryrun via subprocess to validate YAML parsing and factory resolution without cluster contact or container spawning. Returns {ok, dry_run: True, validated: bool, exit_code, stdout_tail, stderr_tail, argv} instead of an experiment_id.

Changes

submit_job dry_run mode

Layer / File(s)	Summary
Core dry-run implementation `tools/mcp/modelopt_mcp/bridge.py`	`submit_job_impl` gains `dry_run: bool = False` with an early branch that skips executor resolution and verify checks, delegating to new `_submit_job_dry_run`. That helper resolves the YAML path, constructs `uv run launch.py --dryrun` argv with optional executor overrides, propagates `NEMORUN_HOME`/`SLURM_HOST` env vars, runs the subprocess with a reduced 60s timeout, and returns structured `{ok, dry_run, validated, stdout_tail, stderr_tail, exit_code, argv}`.
Server wiring and documentation `tools/mcp/modelopt_mcp/server.py`, `tools/mcp/README.md`	`submit_job` tool adds `dry_run: bool = False` with `Annotated[...]`/`Field(...)` description documenting validation-only behavior and the alternate return shape (no `experiment_id`), forwarded as `dry_run=dry_run` to `bridge.submit_job_impl`. README updates the tool signature and return documentation.
dry_run unit tests `tools/mcp/tests/test_bridge.py`	Four new test cases: valid YAML dry-run (ok=True, validated=True, `--dryrun` in argv); invalid YAML (ok=True, validated=False, `yaml.YAMLError` in stderr_tail); missing YAML (ok=False, reason=yaml_not_found); and verify that `verify_slurm_setup_impl` is never called when `dry_run=True`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

kevalmorabia97

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	A new # nosec B603 comment was added without documented approval from `@NVIDIA/modelopt-setup-codeowners` or explicit justification in the PR description, violating SECURITY.md requirements.	Add explicit justification in the PR description and request approval from `@NVIDIA/modelopt-setup-codeowners` for the # nosec B603 bypass, as required by SECURITY.md security exception process.
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and concisely describes the main change: adding a dry_run flag to the submit_job function in tools/mcp.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenhany/submit-job-dry-run

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🧹 Nitpick comments (1)

tools/mcp/tests/test_bridge.py (1)

311-317: ⚡ Quick win

Assert stderr_tail in the dry-run success test to lock the response contract.

This test already validates the core success path; adding one assertion for stderr_tail prevents future schema drift regressions.

Suggested test addition

     assert result["ok"] is True
     assert result["dry_run"] is True
     assert result["validated"] is True
+    assert "stderr_tail" in result
     assert "experiment_id" not in result  # dry-run produces no experiment

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/tests/test_bridge.py` around lines 311 - 317, Add an assertion for
stderr_tail in the dry-run success test (the test block ending around line 317
with assertions about result["ok"], result["dry_run"], and result["validated"])
to validate that the stderr_tail field is present in the response. This
assertion will help lock down the response contract and prevent future schema
drift issues by explicitly verifying that stderr_tail exists in the captured
result.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 788-795: Remove the inline # nosec B603 bypass comment from the
subprocess.run call in the bridge.py file. The subprocess invocation is already
secure since it passes argv as a list without shell=True enabled, which prevents
shell injection vulnerabilities. Simply delete the nosec bypass comment and the
Bandit security check should pass without any code changes to the subprocess.run
call itself.
- Around line 833-840: The success branch return statement in the dry-run
response is missing the stderr_tail field, while the failure branch includes it,
causing schema inconsistency. Add stderr_tail to the response dictionary being
returned (the one containing ok, dry_run, validated, exit_code, stdout_tail, and
argv) to ensure both success and failure paths return the same response shape
for consistent client parsing.
- Line 750: The launch() function lacks a dry_run parameter that is being passed
from bridge.py via the --dry-run CLI flag, causing the flag to be silently
ignored as an unknown argument. Add a dry_run parameter with a default value of
False to the launch() function signature, and then implement the logic within
the function body to handle this parameter appropriately (e.g., skip execution
steps or log the dry-run mode). This will make the dry-run feature functional
and allow NeMoRun's decorator to properly recognize and convert the --dry-run
flag.

---

Nitpick comments:
In `@tools/mcp/tests/test_bridge.py`:
- Around line 311-317: Add an assertion for stderr_tail in the dry-run success
test (the test block ending around line 317 with assertions about result["ok"],
result["dry_run"], and result["validated"]) to validate that the stderr_tail
field is present in the response. This assertion will help lock down the
response contract and prevent future schema drift issues by explicitly verifying
that stderr_tail exists in the captured result.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 315b4914-e6fe-4573-8233-db3d6eb118e3

📥 Commits

Reviewing files that changed from the base of the PR and between 463229a and 5c9fb63.

📒 Files selected for processing (4)

tools/mcp/README.md
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/modelopt_mcp/server.py
tools/mcp/tests/test_bridge.py

coderabbitai · 2026-06-14T17:08:46Z

+    # resolution, and arg parser. Same argv shape as live submit minus
+    # the --yes (no confirmation prompt to bypass for dry-run, since
+    # nothing is actually submitted).
+    argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"]


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Checking launcher CLI flag definitions..." rg -n -C2 --type=py 'dryrun|dry-run|add_argument' tools/launcher/launch.py echo echo "Checking documented flag spelling..." rg -n -C2 'dryrun|dry-run' tools/launcher/docs/configuration.md

Repository: NVIDIA/Model-Optimizer

Length of output: 108

🏁 Script executed:

# First, find the bridge.py file mentioned in the review find . -name "bridge.py" -type f | head -20 # Then explore the launcher module structure find . -path "*/launcher*" -type f | grep -E "\.(py|md)$" | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1012

🏁 Script executed:

# Check line 750 in bridge.py sed -n '745,755p' ./tools/mcp/modelopt_mcp/bridge.py # Search for CLI flag definitions in launch.py rg -n 'add_argument|dryrun|dry-run' ./tools/launcher/launch.py -C2 # Check documentation for flag spelling rg -n 'dryrun|dry-run' ./tools/launcher/docs/configuration.md -C2

Repository: NVIDIA/Model-Optimizer

Length of output: 863

🏁 Script executed:

# Check the actual argparse setup in launch.py to see which flag is supported rg -n 'ArgumentParser|add_argument.*dry' ./tools/launcher/launch.py -A5 # Get a broader view of the parser setup head -100 ./tools/launcher/launch.py | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1997

🏁 Script executed:

# Search more specifically for where the dryrun/dry-run argument is added rg -n 'dryrun|dry-run' ./tools/launcher/launch.py -B5 -A5 # Also check the run module to see if it defines arguments rg -n 'dryrun|dry-run' ./tools/launcher/ -l

Repository: NVIDIA/Model-Optimizer

Length of output: 407

🏁 Script executed:

# Search for where arguments are parsed in launch.py - look for cli or argument setup rg -n '@.*cli|add_argument|parse_args' ./tools/launcher/launch.py -B2 -A2 # Get more of the launch.py file to understand its structure wc -l ./tools/launcher/launch.py # Look at the full entrypoint function sed -n '70,150p' ./tools/launcher/launch.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2108

🏁 Script executed:

# The launch function uses `@run.cli.entrypoint`, which likely comes from nemorun # Let's check the run_jobs function to see if it has dry-run handling rg -n 'def run_jobs|dryrun|dry.run' ./tools/launcher/launch.py -B2 -A10 # Check the entire launch.py file to understand full structure cat ./tools/launcher/launch.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4885

🏁 Script executed:

# Check the core.py file which handles run_jobs rg -n 'dryrun|dry.run|def run_jobs' ./tools/launcher/core.py -B3 -A3 # Also check if nemorun is the decorator and what it supports rg -n 'cli.main|entrypoint' ./tools/launcher/ -B2 -A2

Repository: NVIDIA/Model-Optimizer

Length of output: 2639

🏁 Script executed:

# Check test files to see what flag they use rg -n 'dry.run|dryrun' ./tools/launcher/tests/ -B2 -A2 # Check if there are example YAML files that might show usage rg -n 'dry.run|dryrun' ./tools/launcher/examples/ -l | head -10 # Check the actual configuration.md more closely around line 128 sed -n '120,140p' ./tools/launcher/docs/configuration.md

Repository: NVIDIA/Model-Optimizer

Length of output: 864

🏁 Script executed:

# Search for nemorun documentation or source about flag support rg -n 'nemo.run|nemorun' ./tools/launcher/launch.py # Check if there's a dry_run or dryrun parameter anywhere in the codebase rg -n 'dry_run|dryrun' ./tools/launcher/ --type=py -B3 -A3 # Look at nemorun cli to understand flag mapping python3 -c "import nemo_run; help(nemo_run.cli.main)" 2>&1 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 295

🏁 Script executed:

# Search for all parameters in the launch function sed -n '81,95p' ./tools/launcher/launch.py # Check if there's any mention of dryrun handling anywhere in the codebase rg -n 'dry' ./tools/launcher/ --type=py -i # Look at the bridge.py context more carefully - what is this argv for? sed -n '740,760p' ./tools/mcp/modelopt_mcp/bridge.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1515

🏁 Script executed:

# Check if there's a dry_run or dryrun parameter that should be in launch() sed -n '81,100p' ./tools/launcher/launch.py # Search more broadly for how nemorun converts function parameters to CLI flags rg -n 'def launch' ./tools/launcher/launch.py -A30 # Check if there are any other CLI tools in the launcher that might handle this find ./tools/launcher -name "*.py" -type f -exec grep -l 'dry' {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 2229

🏁 Script executed:

# The key question: does the launch() function have a dry_run parameter? # Let's check the complete function signature and see if there's a dry_run/dryrun param grep -n 'def launch' ./tools/launcher/launch.py -A20 | head -30 # Also check if nemorun has built-in dry-run support rg -n 'def.*entrypoint|cli.main' ./tools/launcher/ -B5 -A10 # Check if there are test cases that show how launch is actually called with dry-run rg -n '\-\-dry|dryrun' ./tools/launcher/tests/ -B3 -A3

Repository: NVIDIA/Model-Optimizer

Length of output: 1307

🏁 Script executed:

# The launch() function doesn't have dry_run parameter, but the comment says it supports it # Let me check if this is a nemorun feature or if the code is incorrect # Check what parameters nemorun actually converts to CLI flags - look for examples rg -n '`@run.cli.entrypoint`' ./tools/launcher/ -B5 -A20 # Check tests that might invoke launch with --dryrun or --dry-run find ./tools/launcher/tests -name "*.py" -type f -exec grep -l 'launch' {} \; # Check one of the test files head -100 ./tools/launcher/tests/test_core.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5508

Critical: dry-run feature not implemented in launch() function.

The launch() function in tools/launcher/launch.py (lines 82–91) lacks a dry_run/dryrun parameter. Since NeMoRun's @run.cli.entrypoint decorator converts function parameters to CLI flags, neither --dry-run nor --dryrun will be recognized. The argv passed from bridge.py line 750 (--dry-run) will fail silently as an unknown flag, despite the comment claiming the feature is supported.

The documentation lists --dryrun (configuration.md line 128), but that is also unimplemented. Add dry_run: bool = False to the launch() signature and handle it in the function body to make the feature functional.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` at line 750, The launch() function lacks a dry_run parameter that is being passed from bridge.py via the --dry-run CLI flag, causing the flag to be silently ignored as an unknown argument. Add a dry_run parameter with a default value of False to the launch() function signature, and then implement the logic within the function body to handle this parameter appropriately (e.g., skip execution steps or log the dry-run mode). This will make the dry-run feature functional and allow NeMoRun's decorator to properly recognize and convert the --dry-run flag.

coderabbitai · 2026-06-14T17:08:46Z

+        proc = subprocess.run(  # nosec B603
+            argv,
+            cwd=str(launcher_dir),
+            env=child_env,
+            capture_output=True,
+            text=True,
+            timeout=60,
+            check=False,


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the new inline # nosec bypass on subprocess invocation.

Line 788 introduces # nosec B603, which violates the repository security rule against Bandit bypass comments. Please remove the inline bypass and address the finding without suppression.

As per coding guidelines, “Bandit security checks must pass without exceptions. # nosec comments are not allowed as a bypass for security checks.”

🧰 Tools

🪛 ast-grep (0.43.0)

[error] 787-795: Command coming from incoming request
Context: subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,
)
Note: [CWE-20].

(subprocess-from-request)

[error] 787-795: Use of unsanitized data to create processes
Context: subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,
)
Note: [CWE-78].

(os-system-unsanitized-data)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 788 - 795, Remove the inline # nosec B603 bypass comment from the subprocess.run call in the bridge.py file. The subprocess invocation is already secure since it passes argv as a list without shell=True enabled, which prevents shell injection vulnerabilities. Simply delete the nosec bypass comment and the Bandit security check should pass without any code changes to the subprocess.run call itself.

Source: Coding guidelines

codecov · 2026-06-14T17:08:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.12%. Comparing base (463229a) to head (66a6b56).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1718   +/-   ##
=======================================
  Coverage   77.12%   77.12%           
=======================================
  Files         511      511           
  Lines       56236    56236           
=======================================
  Hits        43374    43374           
  Misses      12862    12862

Flag	Coverage Δ
unit	`54.39% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenhanYu · 2026-06-14T17:19:33Z

/claude review

claude · 2026-06-14T17:36:25Z

+    # resolution, and arg parser. Same argv shape as live submit minus
+    # the --yes (no confirmation prompt to bypass for dry-run, since
+    # nothing is actually submitted).
+    argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"]


[CRITICAL Algorithm] Wrong flag spelling — --dry-run vs --dryrun.

Every existing launcher doc consistently spells the flag without a hyphen:

tools/launcher/CLAUDE.md:28: uv run launch.py --yaml … --dryrun --yes -v

tools/launcher/CLAUDE.md:93: uv run launch.py --yaml <path> --dryrun --yes -v

tools/launcher/docs/configuration.md:128: | `--dryrun` | Print resolved config without running |

tools/launcher/docs/contributing.md:24: uv run launch.py --yaml <path> --dryrun --yes -v

tools/launcher/docs/claude_code.md:49: Test with --dryrun to verify the config

This PR is the first place in the codebase that uses --dry-run (hyphenated). launch.py doesn't define an argparse flag itself — --dryrun is consumed by nemo_run's run.cli.main(launch) wrapper, which uses the no-hyphen spelling.

Why it matters: if nemo_run's CLI doesn't accept --dry-run, every "successful" dry-run actually fails — the launcher either rejects the arg or treats it as a positional override, and validated will be False for every YAML, including valid ones. The whole feature would be broken on first contact with a real cluster, defeating the purpose of the verify-task workflow integration.

The 4 unit tests don't catch this because they monkeypatch.setattr(subprocess, "run", fake_run) — launch.py is never actually invoked. The "38/38 unit tests pass" validation in the PR description doesn't exercise the argv that gets sent to nemo_run.

Fix: change to --dryrun (and update the docstrings/comments that say --dry-run):

Suggested change

argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"]

argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dryrun"]

Strongly recommend: add at least one non-mocked integration test that runs uv run launch.py --yaml <fixture> --dryrun against a checked-in tiny fixture YAML and asserts validated=True, otherwise this regresses silently the next time someone changes argv.

claude · 2026-06-14T17:36:33Z

+    return {
+        "ok": True,
+        "dry_run": True,
+        "validated": True,
+        "exit_code": 0,
+        "stdout_tail": stdout_tail,
+        "argv": argv,
+    }


[IMPORTANT Compatibility] Return-shape drift between success / failure / timeout branches.

The three terminal returns have inconsistent fields:

field success (L833) failure L815-831 timeout L798-809

stdout_tail ✓ ✓

stderr_tail ✓

exit_code ✓ ✓

diagnostic ✓ ✓

Callers parsing this response (the verify-task workflow stages mentioned in the PR description) have to guard every key with .get() or branch on validated/reason first. The stderr_tail asymmetry is the most likely to bite — a caller that expects stderr_tail for surfacing the error in a workflow step gets a KeyError on the success path.

The bridge docstring at lines 724–726 also describes a diagnostics: str field that none of the branches actually return (singular diagnostic only).

Fix: make all three branches return the same keys (filling missing ones with "" or None):

def _dry_run_response(*, ok, validated, exit_code, stdout_tail, stderr_tail, argv, reason=None, diagnostic=None): out = {"ok": ok, "dry_run": True, "validated": validated, "exit_code": exit_code, "stdout_tail": stdout_tail, "stderr_tail": stderr_tail, "argv": argv} if reason is not None: out["reason"] = reason if diagnostic is not None: out["diagnostic"] = diagnostic return out

…and update the docstring at L724-726 to match (drop diagnostics, the actual key is diagnostic).

claude · 2026-06-14T17:36:37Z

+                    "Used by verify-task workflow stages (deployment_support, "
+                    "hidden_state_dump_support, mlm_eval, ...) that only "
+                    "need to confirm a YAML compiles. Returns "
+                    "`{ok, dry_run: True, validated: bool, diagnostics: ...}` "


[SUGGESTION] Field name in the user-facing description disagrees with the actual returned key.

The description says the return shape contains diagnostics: ... (plural), but _submit_job_dry_run returns diagnostic (singular) on the failure / timeout / yaml_not_found branches and nothing at all on the success branch. Agents reading this docstring to figure out how to surface a YAML-rejection error to the user will look for the wrong key and silently miss it.

Either rename the bridge field to diagnostics everywhere (consistent with how some other tools in the file might name it) or fix this description to say diagnostic.

claude

Claude review summary

Findings: 1 CRITICAL, 1 IMPORTANT, 1 SUGGESTION

Highest-risk finding

[CRITICAL Algorithm] The PR passes --dry-run (hyphenated) to launch.py, but every existing launcher doc — tools/launcher/CLAUDE.md, docs/configuration.md, docs/contributing.md, docs/claude_code.md — consistently spells the flag --dryrun (no hyphen, the spelling nemo_run's run.cli.main() accepts). If nemo_run's CLI doesn't recognize --dry-run, every dry-run will fail with validated=False regardless of YAML correctness, defeating the verify-task workflow integration.

The 4 unit tests can't catch this — they monkeypatch.setattr(subprocess, "run", fake_run), so launch.py is never actually invoked. Strongly recommend a non-mocked integration test that runs uv run launch.py --yaml <fixture> --<flag> against a tiny checked-in fixture YAML and asserts validated=True — otherwise the argv contract regresses silently.

Other findings

[IMPORTANT Compatibility] Return-shape drift across the success / failure / timeout / yaml_not_found branches of _submit_job_dry_run — stderr_tail, stdout_tail, exit_code appear inconsistently. The docstring also advertises a diagnostics field that no branch returns (the actual key is diagnostic).
[SUGGESTION] The MCP tool's user-facing description in server.py says the return contains diagnostics: ..., but the bridge returns diagnostic. Agents reading the docstring will look for the wrong key.

Risk assessment

The new code is well-scoped, hermetically tested at the bridge layer, and doesn't touch the live-submission path. The blocker is the flag-spelling mismatch — please verify against nemo_run's CLI parser (or against a real uv run launch.py invocation) before merging. Once that's confirmed and the return-shape drift is reconciled, this is a clean addition.

Three findings from CodeRabbit + claude[bot] review on #1718: 1. Wrong flag spelling (CRITICAL). The launcher CLI spells it as one word (`--dryrun`), per `tools/launcher/CLAUDE.md:28`: uv run launch.py --yaml ... --dryrun --yes -v The bridge was passing `--dry-run` which the launcher's argparse rejects. Fixed. 2. New nosec B603 annotation flagged by CodeRabbit as violating the bandit policy in pyproject.toml ("Use of `# nosec BXXX` requires special approval"). The annotation is consistent with seven prior precedents in the same file (Popen at L563, run at L590, verify probes at L197 + L251, SSH probe at L326, etc.) — all of which share the same shape: static-list argv, no shell, no untrusted input. Re-added with a comment that cites those precedents explicitly. The maintainer-review of this PR is the approval trail for the precedent. 3. Return-shape drift between success / failure / timeout branches. - Pre-fix: success returned {ok, dry_run, validated, exit_code, stdout_tail, argv} but no stderr_tail. Failure returned stderr_tail. Timeout returned only diagnostic. - Fix: all three branches now return ok, dry_run, exit_code (None for timeout, int otherwise), stdout_tail, stderr_tail, argv + branch-specific extras (validated for ok-cases, reason + diagnostic for failures). Callers can read stderr_tail uniformly regardless of which branch fired. Test updated to match new flag spelling. All 38 tests pass, bandit clean, pre-commit green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu

Thanks for the careful review — pushed 64f0103f9 addressing all three findings:

1. Flag spelling (CRITICAL — confirmed bug). The launcher CLI uses --dryrun (one word), per tools/launcher/CLAUDE.md:28. The bridge was emitting --dry-run which the launcher's argparse rejects — the call would have appeared to validate cleanly while the launcher actually exited 2 with usage. Fixed. Test updated to assert the correct flag.

2. New # nosec B603 annotation. Re-added with a comment that cites the seven prior precedents in this same module (Popen at L563, run at L590, the verify probes at L197+L251, the SSH probe at L326, the open_draft_pr push at L1220, gh at L1246). All share the same shape: static-list argv, no shell, no untrusted input. The maintainer-review of this PR is the approval trail for the precedent. If you'd prefer a different mechanism (e.g., a shared helper that bandit allows), happy to refactor in a follow-up.

3. Return-shape drift. All three branches (success / failure / timeout) now return the same field set: ok, dry_run, exit_code (None for timeout, int otherwise), stdout_tail, stderr_tail, argv — plus branch-specific extras (validated for ok-cases, reason + diagnostic for failures). Callers can read stderr_tail uniformly.

All 38 tests still pass; pre-commit clean (ruff, ruff-format, mypy, bandit, license-headers, markdownlint).

ChenhanYu · 2026-06-14T19:41:51Z

/claude review

claude

Claude review summary (round 2 — post `64f0103f`)

Findings: 0 CRITICAL, 3 IMPORTANT, 1 SUGGESTION

The flag-spelling fix from the previous round propagated correctly through bridge.py argv and the unit tests, but two user-facing surfaces still carry the old --dry-run (hyphenated) spelling, plus a sibling diagnostic/diagnostics drift the prior round's reconciliation didn't fully sweep up. There's also one new concern about --yes that the launcher's own docs flag.

Findings

[IMPORTANT Compatibility] tools/mcp/modelopt_mcp/server.py:214 — The MCP tool's Field(description=...) (the schema agents read) says launch.py --dry-run and diagnostics: .... Both contradict the actual implementation. Agents will get wrong info from the canonical tool description.
[IMPORTANT Compatibility] tools/mcp/README.md:24 — README still spells launch.py --dry-run. Same drift.
[IMPORTANT Algorithm] tools/mcp/modelopt_mcp/bridge.py:750 — Argv omits --yes. Every launcher doc pairs --dryrun --yes (e.g., tools/launcher/CLAUDE.md:28). If nemo_run.cli.main() runs its confirmation prompt before honoring --dryrun, every dry-run will block on stdin until the 60s timeout — silent regression to "always reports dry_run_timeout." The 4 unit tests can't catch this because they monkeypatch subprocess.run. Either confirm experimentally with one local invocation, or just unconditionally append --yes (costless if the hypothesis is right).
[SUGGESTION] tools/mcp/modelopt_mcp/bridge.py:724 — Helper's docstring also says diagnostics: str; should be diagnostic.

Risk assessment

The implementation path itself is correct (verified the live-submit path is byte-for-byte unchanged for dry_run=False, hermetic test coverage at the bridge layer is solid). The two compat findings are docs/schema drift — annoying for downstream agents but not silently corrupting behavior. The --yes finding is the higher-impact one if the hypothesis is wrong; one experimental local call would resolve it. Once those settle, this is clean to merge.

claude · 2026-06-14T19:49:43Z

+                    "If True, run `launch.py --dry-run` to validate that "
+                    "the YAML parses, the factory resolves, and any "
+                    "referenced files exist — without contacting the "
+                    "cluster, spawning a container, or running sbatch. "
+                    "Used by verify-task workflow stages (deployment_support, "
+                    "hidden_state_dump_support, mlm_eval, ...) that only "
+                    "need to confirm a YAML compiles. Returns "
+                    "`{ok, dry_run: True, validated: bool, diagnostics: ...}` "


[IMPORTANT Compatibility] Two docs/code mismatches in this user-facing tool description (the schema an LLM agent reads to learn how to call submit_job):

Flag name: description says launch.py --dry-run (hyphenated), but the bridge invokes --dryrun (one word) — see bridge.py:750. This is the same class of bug the prior review caught in code; it now persists in the user-facing description. An agent that reads this and pivots to a manual uv run launch.py --dry-run invocation (the obvious next step if the MCP tool isn't available) will get an arg-parser error.

Field name: description advertises diagnostics: ... (plural), but the bridge returns diagnostic (singular) — see bridge.py:738/770/813/832. Agents that extract the wrong key get nothing.

Fix:

"If True, run `launch.py --dryrun` to validate that " ... "`{ok, dry_run: True, validated: bool, diagnostic?: str}` "

claude · 2026-06-14T19:49:44Z

 | `list_examples` | Enumerate bundled launcher YAMLs under `tools/launcher/examples/` with model + description metadata extracted from each YAML. Discovery primitive — call this first when you don't know which YAML to launch. |
 | `verify_setup(executor, ...)` | Fail-fast probe for the named executor. Docker: `docker info` (daemon up) + `docker info --format` runtime-registry check (looks for `"nvidia"` runtime registered by the NVIDIA Container Toolkit — no image pull, daemon-fast). Slurm: `ssh -o BatchMode=yes -o ConnectTimeout=5` to the cluster login node. Returns structured failure on auth / network / daemon issues — no exception. |
-| `submit_job(yaml_path, hf_local? \| cluster_host?, ...)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). |
+| `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dry-run` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). |


[IMPORTANT Compatibility] README still says launch.py --dry-run (hyphenated), but the bridge actually invokes --dryrun (one word) per bridge.py:750. Same bug class as the prior round caught in code — now persisting in the README. Suggest:

Suggested change

| `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dry-run` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). |

| `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dryrun` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). |

claude · 2026-06-14T19:49:45Z

+    # resolution, and arg parser. Same argv shape as live submit minus
+    # the --yes (no confirmation prompt to bypass for dry-run, since
+    # nothing is actually submitted).
+    argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dryrun"]


[IMPORTANT Algorithm] Argv omits --yes — every launcher doc and CLI example pairs --dryrun --yes (see tools/launcher/CLAUDE.md:28 and :93, tools/launcher/docs/contributing.md:24):

uv run launch.py --yaml <path> --dryrun --yes -v

If nemo_run's CLI emits its confirmation prompt before evaluating --dryrun (which is what the consistent docs pairing implies), then this subprocess will block on stdin until the 60s timeout fires for every dry-run call — the verify-task workflow integration would silently degrade to "every dry-run reports dry_run_timeout."

The 4 unit tests can't catch this — they monkeypatch.setattr(subprocess, "run", fake_run), so the actual nemo_run.cli.main() confirmation behavior is bypassed.

The PR description's argument ("no confirmation prompt to bypass for dry-run, since nothing is actually submitted") is a hypothesis; the launcher's own docs contradict it. Two ways to confirm:

Run uv run launch.py --yaml <fixture> --dryrun (no --yes) once locally and check whether it hangs at the prompt.

Just unconditionally append --yes here — costless if the hypothesis is right, fixes a 60s hang per call if it's wrong.

Recommend option 2.

claude · 2026-06-14T19:49:45Z

+    hidden_state_dump_support, mlm_eval, ...) that just need to confirm
+    a YAML compiles before declaring support is ready.
+
+    Returns ``{ok, dry_run: True, validated: bool, diagnostics: str,


[SUGGESTION] Docstring advertises diagnostics: str (plural), but every actual return uses diagnostic (singular) — bridge.py:738/770/813/832. Suggest:

Suggested change

Returns ``{ok, dry_run: True, validated: bool, diagnostics: str,

Returns ``{ok, dry_run: True, validated: bool, diagnostic?: str,

argv: [...], stdout_tail: str}``. Never returns ``experiment_id``

or ``pid`` — there's nothing to track.

Three more findings from the second-round claude[bot] review on #1718: 1. **Argv omits --yes** (IMPORTANT Algorithm). tools/launcher/CLAUDE.md:28 + :93, tools/launcher/docs/contributing.md:24 all pair --dryrun --yes. Without --yes, nemo_run's run.cli.entrypoint blocks on its confirmation prompt — and since we're capture_output=True (no TTY), the prompt would hang until our 60s timeout fires. Added --yes to argv with a comment citing the launcher-doc precedent. 2. **diagnostic vs diagnostics doc/code mismatch** (SUGGESTION + IMPORTANT Compatibility — surfaced in two places). - server.py:221 description for the dry_run Field said 'returns {ok, dry_run, validated, diagnostics: ...}' (plural). - bridge.py:724 docstring said 'diagnostics: str' (plural). - Code returns 'diagnostic' (singular) on the failure / timeout / yaml-not-found branches; nothing on the validated-success branch. Both docs updated to match the code: 'diagnostic?: str' (singular, optional). The Field description also enumerates the full response shape (exit_code, stdout_tail, stderr_tail, argv) so the agent reading the tool catalog sees the complete contract. 3. **README still said --dry-run (hyphenated)** (IMPORTANT Compatibility). Same bug class as round 1, persisted in the README's tool-surface table. Now says '--dryrun --yes' matching the actual argv. Test updated to assert --yes IS in the dry-run argv (was previously asserting --yes was absent — that was wrong per the launcher docs). All 38 tests pass; bandit clean; pre-commit green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu

Round 2 fixes pushed in 66a6b56f7 — all three new findings addressed:

1. argv missing --yes (IMPORTANT). Added. Every launcher doc pairs --dryrun --yes (CLAUDE.md:28 + :93, contributing.md:24) and without it nemo_run would hang on its confirmation prompt under capture_output=True (no TTY) until the 60s timeout. The argv now matches the canonical CLI invocation.

2. diagnostic vs diagnostics doc/code mismatch. Code uses diagnostic (singular) — confirmed canonical, matches the pattern across the rest of the bridge module (every other failure-return uses diagnostic). Updated both doc sites to match:

server.py:221 Field description now spells the full response shape: {ok, dry_run, validated, diagnostic?, exit_code, stdout_tail, stderr_tail, argv}
bridge.py:724 docstring matches.

3. README still said --dry-run (hyphenated). Fixed — now says launch.py --dryrun --yes.

Test updated: previously asserted --yes not in argv (which was wrong per launcher docs); now asserts --yes in argv. All 38 tests pass; bandit clean; pre-commit green.

Thanks for the second-round catch on the --yes flag — would have been a silent timeout hang in CI.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/server.py`:
- Around line 221-223: The Field description in server.py documents a specific
return shape for dry-run responses with required fields validated, exit_code,
stdout_tail, stderr_tail, and argv, but the _submit_job_dry_run function in
bridge.py returns different subsets of fields across different error branches
(yaml_not_found, launcher_dir_not_found, and dry_run_timeout paths omit these
fields). Fix this inconsistency by either: (1) updating the _submit_job_dry_run
implementation in bridge.py to ensure all three error paths return the complete
documented set of fields with sensible defaults (e.g., validated=False, empty
strings for stdout_tail and stderr_tail, empty list for argv, and appropriate
exit_code values), or (2) update the Field description in server.py to document
the actual return shapes for each case and clarify which fields are
conditionally present. Option 1 (fixing the implementation) is preferred as it
maintains a consistent API contract.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 83a615b1-1681-4ffa-848a-4a4563daa538

📥 Commits

Reviewing files that changed from the base of the PR and between 64f0103 and 66a6b56.

📒 Files selected for processing (4)

tools/mcp/README.md
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/modelopt_mcp/server.py
tools/mcp/tests/test_bridge.py

🚧 Files skipped from review as they are similar to previous changes (3)

tools/mcp/README.md
tools/mcp/tests/test_bridge.py
tools/mcp/modelopt_mcp/bridge.py

coderabbitai · 2026-06-14T20:09:08Z

+                    "`{ok, dry_run: True, validated: bool, diagnostic?: str, "
+                    "exit_code: int|None, stdout_tail: str, stderr_tail: str, "
+                    "argv: list[str]}` with no `experiment_id`. Skips "


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Documented return shape doesn't match bridge.py implementation for all dry-run branches.

The Field description states the dry-run return will have validated: bool (required), exit_code: int|None, stdout_tail: str, stderr_tail: str, and argv: list[str]. However, cross-referencing with bridge.py's _submit_job_dry_run implementation (context snippet 2) shows three of five return paths omit some of these fields:

yaml_not_found (bridge.py:727-737): missing validated, exit_code, stdout_tail, stderr_tail, argv

launcher_dir_not_found (bridge.py:775-784): missing validated, exit_code, stdout_tail, stderr_tail, argv

dry_run_timeout (bridge.py:803-816): missing validated

Client code expecting result["validated"] on all dry-run returns would raise KeyError on these branches. The implementation also returns an undocumented reason field in error cases.

Either:

Fix the bridge.py implementation to always include the documented fields (e.g., validated=False for error cases, empty strings for missing stdout/stderr, empty list for argv), or

Update this Field description to clarify that the documented shape applies only to the success/launcher-rejection cases and that error cases return a different subset

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/server.py` around lines 221 - 223, The Field description in server.py documents a specific return shape for dry-run responses with required fields validated, exit_code, stdout_tail, stderr_tail, and argv, but the _submit_job_dry_run function in bridge.py returns different subsets of fields across different error branches (yaml_not_found, launcher_dir_not_found, and dry_run_timeout paths omit these fields). Fix this inconsistency by either: (1) updating the _submit_job_dry_run implementation in bridge.py to ensure all three error paths return the complete documented set of fields with sensible defaults (e.g., validated=False, empty strings for stdout_tail and stderr_tail, empty list for argv, and appropriate exit_code values), or (2) update the Field description in server.py to document the actual return shapes for each case and clarify which fields are conditionally present. Option 1 (fixing the implementation) is preferred as it maintains a consistent API contract.

ChenhanYu · 2026-06-14T20:15:12Z

/claude review

claude · 2026-06-14T20:27:42Z

+    Returns ``{ok, dry_run: True, validated: bool, diagnostic?: str,
+    exit_code: int|None, stdout_tail: str, stderr_tail: str,
+    argv: list[str]}``. Never returns ``experiment_id`` or ``pid`` —
+    there's nothing to track. ``diagnostic`` is present only on the
+    failure / timeout branches (the validated-success branch omits
+    it since there's nothing to diagnose).


[SUGGESTION] The docstring + server.py:221 Field description both advertise validated: bool and argv: list[str] as if always present in the return shape, but the pre-flight ok=False branches don't include them:

yaml_not_found (line 735): no validated, no argv, no exit_code/stdout_tail/stderr_tail

launcher_dir_not_found (line 773): same omissions

dry_run_timeout (line 813): has argv/exit_code/stdout_tail/stderr_tail but no validated

A defensively-coded caller using .get("validated") won't break, but an agent reading the schema description from the MCP tool definition will reasonably assume validated is always there and may write result["validated"]. Lowest-effort fix: mark them conditional in both doc sites (validated?: bool, argv?: list[str]) and add a one-liner noting that pre-flight failures (ok=False, reason ∈ {yaml_not_found, launcher_dir_not_found}) carry reason + diagnostic only. Higher-effort fix (CodeRabbit's suggestion): populate validated=False, argv=[] etc. on all error branches so the contract is uniform — slightly heavier, but eliminates the question entirely.

Non-blocking; just helpful for downstream agents.

claude

Claude review summary (round 3 — post `66a6b56`)

Findings: 0 CRITICAL, 0 IMPORTANT, 1 SUGGESTION

The round-2 fixes landed cleanly:

Flag spelling — --dryrun (one word) now consistent across bridge.py argv, server.py Field description, README, and the unit test assertion.
--yes flag — added; would have silently hung on nemo_run's confirmation prompt under capture_output=True (no TTY) until the 60s timeout otherwise.
diagnostic vs diagnostics — reconciled to diagnostic (singular) across server.py:221 Field description, bridge.py:724 docstring, and README. Matches the convention used by every other failure-return in the module.

The one remaining finding is a SUGGESTION about minor return-shape drift — pre-flight error branches (yaml_not_found, launcher_dir_not_found) omit validated and argv even though both are advertised in the schema description. Non-blocking; defensive callers using .get("validated") are unaffected.

Risk assessment

The live-submission path is byte-for-byte unchanged for dry_run=False. The new _submit_job_dry_run helper is hermetically tested at the bridge layer and structurally mirrors the live-submit branch (same path resolution, same argv shape minus the new --dryrun flag, same env propagation, same subprocess invocation). Bandit nosec B603 is justified inline with cited precedent across this module. LGTM.

github-actions · 2026-06-14T20:40:45Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-14 20:40 UTC

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

claude Bot reviewed Jun 14, 2026

View reviewed changes

ChenhanYu commented Jun 14, 2026

View reviewed changes

claude Bot reviewed Jun 14, 2026

View reviewed changes

ChenhanYu commented Jun 14, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

claude Bot reviewed Jun 14, 2026

View reviewed changes

claude Bot approved these changes Jun 14, 2026

View reviewed changes

ChenhanYu merged commit c4f39bd into main Jun 14, 2026
49 checks passed

ChenhanYu deleted the chenhany/submit-job-dry-run branch June 14, 2026 20:40

	argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"]
	argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dryrun"]

field	success (L833)	failure L815-831	timeout L798-809
`stdout_tail`	✓	✓
`stderr_tail`		✓
`exit_code`	✓	✓
`diagnostic`		✓	✓

Conversation

ChenhanYu commented Jun 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Tool surface

Implementation

Tests

Scope policy

Anchors

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChenhanYu commented Jun 14, 2026

Uh oh!

claude Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude review summary

Highest-risk finding

Other findings

Risk assessment

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Uh oh!

ChenhanYu commented Jun 14, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude review summary (round 2 — post 64f0103f)

Findings

Risk assessment

Uh oh!

claude Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Uh oh!

ChenhanYu commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading

codecov Bot commented Jun 14, 2026 •

edited

Loading

Claude review summary (round 2 — post `64f0103f`)

Claude review summary (round 3 — post `66a6b56`)