feat(tools/mcp): add dry_run flag to submit_job#1718
Conversation
…low stages
Slice 2 prerequisite of OMNIML-5144 (the pensieve-intern SPEC
migration to MCP tool calls). Without this, the four verify-task
SPECs in spec_decoding_release (deployment_support,
hidden_state_dump_support, training_support, synth_support) plus
the three in nemotron_quant_release (mlm_eval, mlm_ptq, mlm_qad)
all stay on `uv run launch.py --yaml X --dry-run` shell prose
because the MCP catalog has no equivalent.
`submit_job(yaml_path, ..., dry_run=True)`:
* Validates the YAML via `launch.py --dry-run` — exercises the
YAML loader, factory resolution, and arg parser; no cluster
contact, no container spawn, no sbatch.
* Skips `verify_setup` automatically (nothing to talk to).
* `hf_local` / `cluster_host` are optional in dry-run mode
(pass one to validate executor-specific config, omit both to
validate just the YAML shape).
* Returns `{ok, dry_run: True, validated: bool, exit_code,
stdout_tail, stderr_tail, argv}` instead of `experiment_id`.
Note: when the launcher rejects the YAML (exit code non-zero),
the tool still returns `ok: True` because the TOOL ran cleanly —
only `validated: False`. This is the same shape `verify_setup`
uses (`ok: True` regardless of probe outcome; the probe outcome
is the field the caller reads). Keeps the "did the tool work?"
question separate from the "is the YAML valid?" question.
Bridge impl is a clean fork at the top of `submit_job_impl` —
the dry_run branch lives in a dedicated `_submit_job_dry_run`
helper to keep the live-submission path uncluttered. Hermetic
tests added in `tests/test_bridge.py`:
- dry_run with a valid YAML → ok+validated, --dry-run in argv,
no --yes
- dry_run with launcher rejection → ok=True, validated=False,
stderr_tail surfaces the YAML error
- dry_run + missing yaml_path → yaml_not_found (carries dry_run
flag for caller introspection)
- dry_run bypasses verify_setup even with skip_verify=False
README + scope policy unchanged — `dry_run` is universal launcher
environment tooling, the SCOPE.md test ("would this exist whether
or not workflow X existed?") passes trivially.
Anchors:
* OMNIML-5145 (this ticket — bridge enhancement)
* OMNIML-5144 (parent SPEC migration; needs this to migrate
spec_decoding_release + nemotron_quant_release support specs)
* OMNIML-5123 (Epic)
* #1712 (SCOPE.md — environment tooling policy)
* pensieve-intern MR !100 (slice 1 of OMNIML-5144 — specdec_bench
cell.md migration, did not need dry_run because cell.md submits
real cluster jobs, not validates configs)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
📝 WalkthroughWalkthroughAdds a Changessubmit_job dry_run mode
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tools/mcp/tests/test_bridge.py (1)
311-317: ⚡ Quick winAssert
stderr_tailin the dry-run success test to lock the response contract.This test already validates the core success path; adding one assertion for
stderr_tailprevents future schema drift regressions.Suggested test addition
assert result["ok"] is True assert result["dry_run"] is True assert result["validated"] is True + assert "stderr_tail" in result assert "experiment_id" not in result # dry-run produces no experiment🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/tests/test_bridge.py` around lines 311 - 317, Add an assertion for stderr_tail in the dry-run success test (the test block ending around line 317 with assertions about result["ok"], result["dry_run"], and result["validated"]) to validate that the stderr_tail field is present in the response. This assertion will help lock down the response contract and prevent future schema drift issues by explicitly verifying that stderr_tail exists in the captured result.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 788-795: Remove the inline # nosec B603 bypass comment from the
subprocess.run call in the bridge.py file. The subprocess invocation is already
secure since it passes argv as a list without shell=True enabled, which prevents
shell injection vulnerabilities. Simply delete the nosec bypass comment and the
Bandit security check should pass without any code changes to the subprocess.run
call itself.
- Around line 833-840: The success branch return statement in the dry-run
response is missing the stderr_tail field, while the failure branch includes it,
causing schema inconsistency. Add stderr_tail to the response dictionary being
returned (the one containing ok, dry_run, validated, exit_code, stdout_tail, and
argv) to ensure both success and failure paths return the same response shape
for consistent client parsing.
- Line 750: The launch() function lacks a dry_run parameter that is being passed
from bridge.py via the --dry-run CLI flag, causing the flag to be silently
ignored as an unknown argument. Add a dry_run parameter with a default value of
False to the launch() function signature, and then implement the logic within
the function body to handle this parameter appropriately (e.g., skip execution
steps or log the dry-run mode). This will make the dry-run feature functional
and allow NeMoRun's decorator to properly recognize and convert the --dry-run
flag.
---
Nitpick comments:
In `@tools/mcp/tests/test_bridge.py`:
- Around line 311-317: Add an assertion for stderr_tail in the dry-run success
test (the test block ending around line 317 with assertions about result["ok"],
result["dry_run"], and result["validated"]) to validate that the stderr_tail
field is present in the response. This assertion will help lock down the
response contract and prevent future schema drift issues by explicitly verifying
that stderr_tail exists in the captured result.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 315b4914-e6fe-4573-8233-db3d6eb118e3
📒 Files selected for processing (4)
tools/mcp/README.mdtools/mcp/modelopt_mcp/bridge.pytools/mcp/modelopt_mcp/server.pytools/mcp/tests/test_bridge.py
| # resolution, and arg parser. Same argv shape as live submit minus | ||
| # the --yes (no confirmation prompt to bypass for dry-run, since | ||
| # nothing is actually submitted). | ||
| argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Checking launcher CLI flag definitions..."
rg -n -C2 --type=py 'dryrun|dry-run|add_argument' tools/launcher/launch.py
echo
echo "Checking documented flag spelling..."
rg -n -C2 'dryrun|dry-run' tools/launcher/docs/configuration.mdRepository: NVIDIA/Model-Optimizer
Length of output: 108
🏁 Script executed:
# First, find the bridge.py file mentioned in the review
find . -name "bridge.py" -type f | head -20
# Then explore the launcher module structure
find . -path "*/launcher*" -type f | grep -E "\.(py|md)$" | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 1012
🏁 Script executed:
# Check line 750 in bridge.py
sed -n '745,755p' ./tools/mcp/modelopt_mcp/bridge.py
# Search for CLI flag definitions in launch.py
rg -n 'add_argument|dryrun|dry-run' ./tools/launcher/launch.py -C2
# Check documentation for flag spelling
rg -n 'dryrun|dry-run' ./tools/launcher/docs/configuration.md -C2Repository: NVIDIA/Model-Optimizer
Length of output: 863
🏁 Script executed:
# Check the actual argparse setup in launch.py to see which flag is supported
rg -n 'ArgumentParser|add_argument.*dry' ./tools/launcher/launch.py -A5
# Get a broader view of the parser setup
head -100 ./tools/launcher/launch.py | tail -50Repository: NVIDIA/Model-Optimizer
Length of output: 1997
🏁 Script executed:
# Search more specifically for where the dryrun/dry-run argument is added
rg -n 'dryrun|dry-run' ./tools/launcher/launch.py -B5 -A5
# Also check the run module to see if it defines arguments
rg -n 'dryrun|dry-run' ./tools/launcher/ -lRepository: NVIDIA/Model-Optimizer
Length of output: 407
🏁 Script executed:
# Search for where arguments are parsed in launch.py - look for cli or argument setup
rg -n '@.*cli|add_argument|parse_args' ./tools/launcher/launch.py -B2 -A2
# Get more of the launch.py file to understand its structure
wc -l ./tools/launcher/launch.py
# Look at the full entrypoint function
sed -n '70,150p' ./tools/launcher/launch.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2108
🏁 Script executed:
# The launch function uses `@run.cli.entrypoint`, which likely comes from nemorun
# Let's check the run_jobs function to see if it has dry-run handling
rg -n 'def run_jobs|dryrun|dry.run' ./tools/launcher/launch.py -B2 -A10
# Check the entire launch.py file to understand full structure
cat ./tools/launcher/launch.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4885
🏁 Script executed:
# Check the core.py file which handles run_jobs
rg -n 'dryrun|dry.run|def run_jobs' ./tools/launcher/core.py -B3 -A3
# Also check if nemorun is the decorator and what it supports
rg -n 'cli.main|entrypoint' ./tools/launcher/ -B2 -A2Repository: NVIDIA/Model-Optimizer
Length of output: 2639
🏁 Script executed:
# Check test files to see what flag they use
rg -n 'dry.run|dryrun' ./tools/launcher/tests/ -B2 -A2
# Check if there are example YAML files that might show usage
rg -n 'dry.run|dryrun' ./tools/launcher/examples/ -l | head -10
# Check the actual configuration.md more closely around line 128
sed -n '120,140p' ./tools/launcher/docs/configuration.mdRepository: NVIDIA/Model-Optimizer
Length of output: 864
🏁 Script executed:
# Search for nemorun documentation or source about flag support
rg -n 'nemo.run|nemorun' ./tools/launcher/launch.py
# Check if there's a dry_run or dryrun parameter anywhere in the codebase
rg -n 'dry_run|dryrun' ./tools/launcher/ --type=py -B3 -A3
# Look at nemorun cli to understand flag mapping
python3 -c "import nemo_run; help(nemo_run.cli.main)" 2>&1 | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 295
🏁 Script executed:
# Search for all parameters in the launch function
sed -n '81,95p' ./tools/launcher/launch.py
# Check if there's any mention of dryrun handling anywhere in the codebase
rg -n 'dry' ./tools/launcher/ --type=py -i
# Look at the bridge.py context more carefully - what is this argv for?
sed -n '740,760p' ./tools/mcp/modelopt_mcp/bridge.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1515
🏁 Script executed:
# Check if there's a dry_run or dryrun parameter that should be in launch()
sed -n '81,100p' ./tools/launcher/launch.py
# Search more broadly for how nemorun converts function parameters to CLI flags
rg -n 'def launch' ./tools/launcher/launch.py -A30
# Check if there are any other CLI tools in the launcher that might handle this
find ./tools/launcher -name "*.py" -type f -exec grep -l 'dry' {} \;Repository: NVIDIA/Model-Optimizer
Length of output: 2229
🏁 Script executed:
# The key question: does the launch() function have a dry_run parameter?
# Let's check the complete function signature and see if there's a dry_run/dryrun param
grep -n 'def launch' ./tools/launcher/launch.py -A20 | head -30
# Also check if nemorun has built-in dry-run support
rg -n 'def.*entrypoint|cli.main' ./tools/launcher/ -B5 -A10
# Check if there are test cases that show how launch is actually called with dry-run
rg -n '\-\-dry|dryrun' ./tools/launcher/tests/ -B3 -A3Repository: NVIDIA/Model-Optimizer
Length of output: 1307
🏁 Script executed:
# The launch() function doesn't have dry_run parameter, but the comment says it supports it
# Let me check if this is a nemorun feature or if the code is incorrect
# Check what parameters nemorun actually converts to CLI flags - look for examples
rg -n '`@run.cli.entrypoint`' ./tools/launcher/ -B5 -A20
# Check tests that might invoke launch with --dryrun or --dry-run
find ./tools/launcher/tests -name "*.py" -type f -exec grep -l 'launch' {} \;
# Check one of the test files
head -100 ./tools/launcher/tests/test_core.pyRepository: NVIDIA/Model-Optimizer
Length of output: 5508
Critical: dry-run feature not implemented in launch() function.
The launch() function in tools/launcher/launch.py (lines 82–91) lacks a dry_run/dryrun parameter. Since NeMoRun's @run.cli.entrypoint decorator converts function parameters to CLI flags, neither --dry-run nor --dryrun will be recognized. The argv passed from bridge.py line 750 (--dry-run) will fail silently as an unknown flag, despite the comment claiming the feature is supported.
The documentation lists --dryrun (configuration.md line 128), but that is also unimplemented. Add dry_run: bool = False to the launch() signature and handle it in the function body to make the feature functional.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/bridge.py` at line 750, The launch() function lacks a
dry_run parameter that is being passed from bridge.py via the --dry-run CLI
flag, causing the flag to be silently ignored as an unknown argument. Add a
dry_run parameter with a default value of False to the launch() function
signature, and then implement the logic within the function body to handle this
parameter appropriately (e.g., skip execution steps or log the dry-run mode).
This will make the dry-run feature functional and allow NeMoRun's decorator to
properly recognize and convert the --dry-run flag.
| proc = subprocess.run( # nosec B603 | ||
| argv, | ||
| cwd=str(launcher_dir), | ||
| env=child_env, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=60, | ||
| check=False, |
There was a problem hiding this comment.
Remove the new inline # nosec bypass on subprocess invocation.
Line 788 introduces # nosec B603, which violates the repository security rule against Bandit bypass comments. Please remove the inline bypass and address the finding without suppression.
As per coding guidelines, “Bandit security checks must pass without exceptions. # nosec comments are not allowed as a bypass for security checks.”
🧰 Tools
🪛 ast-grep (0.43.0)
[error] 787-795: Command coming from incoming request
Context: subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,
)
Note: [CWE-20].
(subprocess-from-request)
[error] 787-795: Use of unsanitized data to create processes
Context: subprocess.run( # nosec B603
argv,
cwd=str(launcher_dir),
env=child_env,
capture_output=True,
text=True,
timeout=60,
check=False,
)
Note: [CWE-78].
(os-system-unsanitized-data)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/bridge.py` around lines 788 - 795, Remove the inline #
nosec B603 bypass comment from the subprocess.run call in the bridge.py file.
The subprocess invocation is already secure since it passes argv as a list
without shell=True enabled, which prevents shell injection vulnerabilities.
Simply delete the nosec bypass comment and the Bandit security check should pass
without any code changes to the subprocess.run call itself.
Source: Coding guidelines
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1718 +/- ##
=======================================
Coverage 77.12% 77.12%
=======================================
Files 511 511
Lines 56236 56236
=======================================
Hits 43374 43374
Misses 12862 12862
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
/claude review |
| # resolution, and arg parser. Same argv shape as live submit minus | ||
| # the --yes (no confirmation prompt to bypass for dry-run, since | ||
| # nothing is actually submitted). | ||
| argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"] |
There was a problem hiding this comment.
[CRITICAL Algorithm] Wrong flag spelling — --dry-run vs --dryrun.
Every existing launcher doc consistently spells the flag without a hyphen:
tools/launcher/CLAUDE.md:28:uv run launch.py --yaml … --dryrun --yes -vtools/launcher/CLAUDE.md:93:uv run launch.py --yaml <path> --dryrun --yes -vtools/launcher/docs/configuration.md:128:| `--dryrun` | Print resolved config without running |tools/launcher/docs/contributing.md:24:uv run launch.py --yaml <path> --dryrun --yes -vtools/launcher/docs/claude_code.md:49:Test with --dryrun to verify the config
This PR is the first place in the codebase that uses --dry-run (hyphenated). launch.py doesn't define an argparse flag itself — --dryrun is consumed by nemo_run's run.cli.main(launch) wrapper, which uses the no-hyphen spelling.
Why it matters: if nemo_run's CLI doesn't accept --dry-run, every "successful" dry-run actually fails — the launcher either rejects the arg or treats it as a positional override, and validated will be False for every YAML, including valid ones. The whole feature would be broken on first contact with a real cluster, defeating the purpose of the verify-task workflow integration.
The 4 unit tests don't catch this because they monkeypatch.setattr(subprocess, "run", fake_run) — launch.py is never actually invoked. The "38/38 unit tests pass" validation in the PR description doesn't exercise the argv that gets sent to nemo_run.
Fix: change to --dryrun (and update the docstrings/comments that say --dry-run):
| argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dry-run"] | |
| argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dryrun"] |
Strongly recommend: add at least one non-mocked integration test that runs uv run launch.py --yaml <fixture> --dryrun against a checked-in tiny fixture YAML and asserts validated=True, otherwise this regresses silently the next time someone changes argv.
| return { | ||
| "ok": True, | ||
| "dry_run": True, | ||
| "validated": True, | ||
| "exit_code": 0, | ||
| "stdout_tail": stdout_tail, | ||
| "argv": argv, | ||
| } |
There was a problem hiding this comment.
[IMPORTANT Compatibility] Return-shape drift between success / failure / timeout branches.
The three terminal returns have inconsistent fields:
| field | success (L833) | failure L815-831 | timeout L798-809 |
|---|---|---|---|
stdout_tail |
✓ | ✓ | |
stderr_tail |
✓ | ||
exit_code |
✓ | ✓ | |
diagnostic |
✓ | ✓ |
Callers parsing this response (the verify-task workflow stages mentioned in the PR description) have to guard every key with .get() or branch on validated/reason first. The stderr_tail asymmetry is the most likely to bite — a caller that expects stderr_tail for surfacing the error in a workflow step gets a KeyError on the success path.
The bridge docstring at lines 724–726 also describes a diagnostics: str field that none of the branches actually return (singular diagnostic only).
Fix: make all three branches return the same keys (filling missing ones with "" or None):
def _dry_run_response(*, ok, validated, exit_code, stdout_tail, stderr_tail, argv, reason=None, diagnostic=None):
out = {"ok": ok, "dry_run": True, "validated": validated,
"exit_code": exit_code, "stdout_tail": stdout_tail,
"stderr_tail": stderr_tail, "argv": argv}
if reason is not None:
out["reason"] = reason
if diagnostic is not None:
out["diagnostic"] = diagnostic
return out…and update the docstring at L724-726 to match (drop diagnostics, the actual key is diagnostic).
| "Used by verify-task workflow stages (deployment_support, " | ||
| "hidden_state_dump_support, mlm_eval, ...) that only " | ||
| "need to confirm a YAML compiles. Returns " | ||
| "`{ok, dry_run: True, validated: bool, diagnostics: ...}` " |
There was a problem hiding this comment.
[SUGGESTION] Field name in the user-facing description disagrees with the actual returned key.
The description says the return shape contains diagnostics: ... (plural), but _submit_job_dry_run returns diagnostic (singular) on the failure / timeout / yaml_not_found branches and nothing at all on the success branch. Agents reading this docstring to figure out how to surface a YAML-rejection error to the user will look for the wrong key and silently miss it.
Either rename the bridge field to diagnostics everywhere (consistent with how some other tools in the file might name it) or fix this description to say diagnostic.
There was a problem hiding this comment.
Claude review summary
Findings: 1 CRITICAL, 1 IMPORTANT, 1 SUGGESTION
Highest-risk finding
[CRITICAL Algorithm] The PR passes --dry-run (hyphenated) to launch.py, but every existing launcher doc — tools/launcher/CLAUDE.md, docs/configuration.md, docs/contributing.md, docs/claude_code.md — consistently spells the flag --dryrun (no hyphen, the spelling nemo_run's run.cli.main() accepts). If nemo_run's CLI doesn't recognize --dry-run, every dry-run will fail with validated=False regardless of YAML correctness, defeating the verify-task workflow integration.
The 4 unit tests can't catch this — they monkeypatch.setattr(subprocess, "run", fake_run), so launch.py is never actually invoked. Strongly recommend a non-mocked integration test that runs uv run launch.py --yaml <fixture> --<flag> against a tiny checked-in fixture YAML and asserts validated=True — otherwise the argv contract regresses silently.
Other findings
- [IMPORTANT Compatibility] Return-shape drift across the success / failure / timeout / yaml_not_found branches of
_submit_job_dry_run—stderr_tail,stdout_tail,exit_codeappear inconsistently. The docstring also advertises adiagnosticsfield that no branch returns (the actual key isdiagnostic). - [SUGGESTION] The MCP tool's user-facing description in
server.pysays the return containsdiagnostics: ..., but the bridge returnsdiagnostic. Agents reading the docstring will look for the wrong key.
Risk assessment
The new code is well-scoped, hermetically tested at the bridge layer, and doesn't touch the live-submission path. The blocker is the flag-spelling mismatch — please verify against nemo_run's CLI parser (or against a real uv run launch.py invocation) before merging. Once that's confirmed and the return-shape drift is reconciled, this is a clean addition.
Three findings from CodeRabbit + claude[bot] review on #1718: 1. Wrong flag spelling (CRITICAL). The launcher CLI spells it as one word (`--dryrun`), per `tools/launcher/CLAUDE.md:28`: uv run launch.py --yaml ... --dryrun --yes -v The bridge was passing `--dry-run` which the launcher's argparse rejects. Fixed. 2. New nosec B603 annotation flagged by CodeRabbit as violating the bandit policy in pyproject.toml ("Use of `# nosec BXXX` requires special approval"). The annotation is consistent with seven prior precedents in the same file (Popen at L563, run at L590, verify probes at L197 + L251, SSH probe at L326, etc.) — all of which share the same shape: static-list argv, no shell, no untrusted input. Re-added with a comment that cites those precedents explicitly. The maintainer-review of this PR is the approval trail for the precedent. 3. Return-shape drift between success / failure / timeout branches. - Pre-fix: success returned {ok, dry_run, validated, exit_code, stdout_tail, argv} but no stderr_tail. Failure returned stderr_tail. Timeout returned only diagnostic. - Fix: all three branches now return ok, dry_run, exit_code (None for timeout, int otherwise), stdout_tail, stderr_tail, argv + branch-specific extras (validated for ok-cases, reason + diagnostic for failures). Callers can read stderr_tail uniformly regardless of which branch fired. Test updated to match new flag spelling. All 38 tests pass, bandit clean, pre-commit green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
ChenhanYu
left a comment
There was a problem hiding this comment.
Thanks for the careful review — pushed 64f0103f9 addressing all three findings:
1. Flag spelling (CRITICAL — confirmed bug). The launcher CLI uses --dryrun (one word), per tools/launcher/CLAUDE.md:28. The bridge was emitting --dry-run which the launcher's argparse rejects — the call would have appeared to validate cleanly while the launcher actually exited 2 with usage. Fixed. Test updated to assert the correct flag.
2. New # nosec B603 annotation. Re-added with a comment that cites the seven prior precedents in this same module (Popen at L563, run at L590, the verify probes at L197+L251, the SSH probe at L326, the open_draft_pr push at L1220, gh at L1246). All share the same shape: static-list argv, no shell, no untrusted input. The maintainer-review of this PR is the approval trail for the precedent. If you'd prefer a different mechanism (e.g., a shared helper that bandit allows), happy to refactor in a follow-up.
3. Return-shape drift. All three branches (success / failure / timeout) now return the same field set: ok, dry_run, exit_code (None for timeout, int otherwise), stdout_tail, stderr_tail, argv — plus branch-specific extras (validated for ok-cases, reason + diagnostic for failures). Callers can read stderr_tail uniformly.
All 38 tests still pass; pre-commit clean (ruff, ruff-format, mypy, bandit, license-headers, markdownlint).
|
/claude review |
There was a problem hiding this comment.
Claude review summary (round 2 — post 64f0103f)
Findings: 0 CRITICAL, 3 IMPORTANT, 1 SUGGESTION
The flag-spelling fix from the previous round propagated correctly through bridge.py argv and the unit tests, but two user-facing surfaces still carry the old --dry-run (hyphenated) spelling, plus a sibling diagnostic/diagnostics drift the prior round's reconciliation didn't fully sweep up. There's also one new concern about --yes that the launcher's own docs flag.
Findings
- [IMPORTANT Compatibility]
tools/mcp/modelopt_mcp/server.py:214— The MCP tool'sField(description=...)(the schema agents read) sayslaunch.py --dry-runanddiagnostics: .... Both contradict the actual implementation. Agents will get wrong info from the canonical tool description. - [IMPORTANT Compatibility]
tools/mcp/README.md:24— README still spellslaunch.py --dry-run. Same drift. - [IMPORTANT Algorithm]
tools/mcp/modelopt_mcp/bridge.py:750— Argv omits--yes. Every launcher doc pairs--dryrun --yes(e.g.,tools/launcher/CLAUDE.md:28). Ifnemo_run.cli.main()runs its confirmation prompt before honoring--dryrun, every dry-run will block on stdin until the 60s timeout — silent regression to "always reportsdry_run_timeout." The 4 unit tests can't catch this because they monkeypatchsubprocess.run. Either confirm experimentally with one local invocation, or just unconditionally append--yes(costless if the hypothesis is right). - [SUGGESTION]
tools/mcp/modelopt_mcp/bridge.py:724— Helper's docstring also saysdiagnostics: str; should bediagnostic.
Risk assessment
The implementation path itself is correct (verified the live-submit path is byte-for-byte unchanged for dry_run=False, hermetic test coverage at the bridge layer is solid). The two compat findings are docs/schema drift — annoying for downstream agents but not silently corrupting behavior. The --yes finding is the higher-impact one if the hypothesis is wrong; one experimental local call would resolve it. Once those settle, this is clean to merge.
| "If True, run `launch.py --dry-run` to validate that " | ||
| "the YAML parses, the factory resolves, and any " | ||
| "referenced files exist — without contacting the " | ||
| "cluster, spawning a container, or running sbatch. " | ||
| "Used by verify-task workflow stages (deployment_support, " | ||
| "hidden_state_dump_support, mlm_eval, ...) that only " | ||
| "need to confirm a YAML compiles. Returns " | ||
| "`{ok, dry_run: True, validated: bool, diagnostics: ...}` " |
There was a problem hiding this comment.
[IMPORTANT Compatibility] Two docs/code mismatches in this user-facing tool description (the schema an LLM agent reads to learn how to call submit_job):
-
Flag name: description says
launch.py --dry-run(hyphenated), but the bridge invokes--dryrun(one word) — seebridge.py:750. This is the same class of bug the prior review caught in code; it now persists in the user-facing description. An agent that reads this and pivots to a manualuv run launch.py --dry-runinvocation (the obvious next step if the MCP tool isn't available) will get an arg-parser error. -
Field name: description advertises
diagnostics: ...(plural), but the bridge returnsdiagnostic(singular) — seebridge.py:738/770/813/832. Agents that extract the wrong key get nothing.
Fix:
"If True, run `launch.py --dryrun` to validate that "
...
"`{ok, dry_run: True, validated: bool, diagnostic?: str}` "| | `list_examples` | Enumerate bundled launcher YAMLs under `tools/launcher/examples/` with model + description metadata extracted from each YAML. Discovery primitive — call this first when you don't know which YAML to launch. | | ||
| | `verify_setup(executor, ...)` | Fail-fast probe for the named executor. Docker: `docker info` (daemon up) + `docker info --format` runtime-registry check (looks for `"nvidia"` runtime registered by the NVIDIA Container Toolkit — no image pull, daemon-fast). Slurm: `ssh -o BatchMode=yes -o ConnectTimeout=5` to the cluster login node. Returns structured failure on auth / network / daemon issues — no exception. | | ||
| | `submit_job(yaml_path, hf_local? \| cluster_host?, ...)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). | | ||
| | `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dry-run` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). | |
There was a problem hiding this comment.
[IMPORTANT Compatibility] README still says launch.py --dry-run (hyphenated), but the bridge actually invokes --dryrun (one word) per bridge.py:750. Same bug class as the prior round caught in code — now persisting in the README. Suggest:
| | `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dry-run` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). | | |
| | `submit_job(yaml_path, hf_local? \| cluster_host?, ..., dry_run?)` | Submit a launcher YAML. Mode resolved from mutually-exclusive args. Returns `experiment_id` (Slurm) or PID (Docker) immediately; the actual job runs detached. Auto-runs `verify_setup` first by default (skippable). **Pass `dry_run=True`** to validate the YAML via `launch.py --dryrun` without contacting the cluster / spawning a container / running sbatch — returns `{ok, dry_run: True, validated: bool}` instead of `experiment_id`. Used by verify-task workflow stages (deployment_support, hidden_state_dump_support, mlm_eval, ...). | |
| # resolution, and arg parser. Same argv shape as live submit minus | ||
| # the --yes (no confirmation prompt to bypass for dry-run, since | ||
| # nothing is actually submitted). | ||
| argv = ["uv", "run", "launch.py", "--yaml", str(abs_yaml), "--dryrun"] |
There was a problem hiding this comment.
[IMPORTANT Algorithm] Argv omits --yes — every launcher doc and CLI example pairs --dryrun --yes (see tools/launcher/CLAUDE.md:28 and :93, tools/launcher/docs/contributing.md:24):
uv run launch.py --yaml <path> --dryrun --yes -v
If nemo_run's CLI emits its confirmation prompt before evaluating --dryrun (which is what the consistent docs pairing implies), then this subprocess will block on stdin until the 60s timeout fires for every dry-run call — the verify-task workflow integration would silently degrade to "every dry-run reports dry_run_timeout."
The 4 unit tests can't catch this — they monkeypatch.setattr(subprocess, "run", fake_run), so the actual nemo_run.cli.main() confirmation behavior is bypassed.
The PR description's argument ("no confirmation prompt to bypass for dry-run, since nothing is actually submitted") is a hypothesis; the launcher's own docs contradict it. Two ways to confirm:
- Run
uv run launch.py --yaml <fixture> --dryrun(no--yes) once locally and check whether it hangs at the prompt. - Just unconditionally append
--yeshere — costless if the hypothesis is right, fixes a 60s hang per call if it's wrong.
Recommend option 2.
| hidden_state_dump_support, mlm_eval, ...) that just need to confirm | ||
| a YAML compiles before declaring support is ready. | ||
|
|
||
| Returns ``{ok, dry_run: True, validated: bool, diagnostics: str, |
There was a problem hiding this comment.
[SUGGESTION] Docstring advertises diagnostics: str (plural), but every actual return uses diagnostic (singular) — bridge.py:738/770/813/832. Suggest:
| Returns ``{ok, dry_run: True, validated: bool, diagnostics: str, | |
| Returns ``{ok, dry_run: True, validated: bool, diagnostic?: str, | |
| argv: [...], stdout_tail: str}``. Never returns ``experiment_id`` | |
| or ``pid`` — there's nothing to track. |
Three more findings from the second-round claude[bot] review on #1718: 1. **Argv omits --yes** (IMPORTANT Algorithm). tools/launcher/CLAUDE.md:28 + :93, tools/launcher/docs/contributing.md:24 all pair --dryrun --yes. Without --yes, nemo_run's run.cli.entrypoint blocks on its confirmation prompt — and since we're capture_output=True (no TTY), the prompt would hang until our 60s timeout fires. Added --yes to argv with a comment citing the launcher-doc precedent. 2. **diagnostic vs diagnostics doc/code mismatch** (SUGGESTION + IMPORTANT Compatibility — surfaced in two places). - server.py:221 description for the dry_run Field said 'returns {ok, dry_run, validated, diagnostics: ...}' (plural). - bridge.py:724 docstring said 'diagnostics: str' (plural). - Code returns 'diagnostic' (singular) on the failure / timeout / yaml-not-found branches; nothing on the validated-success branch. Both docs updated to match the code: 'diagnostic?: str' (singular, optional). The Field description also enumerates the full response shape (exit_code, stdout_tail, stderr_tail, argv) so the agent reading the tool catalog sees the complete contract. 3. **README still said --dry-run (hyphenated)** (IMPORTANT Compatibility). Same bug class as round 1, persisted in the README's tool-surface table. Now says '--dryrun --yes' matching the actual argv. Test updated to assert --yes IS in the dry-run argv (was previously asserting --yes was absent — that was wrong per the launcher docs). All 38 tests pass; bandit clean; pre-commit green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
ChenhanYu
left a comment
There was a problem hiding this comment.
Round 2 fixes pushed in 66a6b56f7 — all three new findings addressed:
1. argv missing --yes (IMPORTANT). Added. Every launcher doc pairs --dryrun --yes (CLAUDE.md:28 + :93, contributing.md:24) and without it nemo_run would hang on its confirmation prompt under capture_output=True (no TTY) until the 60s timeout. The argv now matches the canonical CLI invocation.
2. diagnostic vs diagnostics doc/code mismatch. Code uses diagnostic (singular) — confirmed canonical, matches the pattern across the rest of the bridge module (every other failure-return uses diagnostic). Updated both doc sites to match:
server.py:221Field description now spells the full response shape:{ok, dry_run, validated, diagnostic?, exit_code, stdout_tail, stderr_tail, argv}bridge.py:724docstring matches.
3. README still said --dry-run (hyphenated). Fixed — now says launch.py --dryrun --yes.
Test updated: previously asserted --yes not in argv (which was wrong per launcher docs); now asserts --yes in argv. All 38 tests pass; bandit clean; pre-commit green.
Thanks for the second-round catch on the --yes flag — would have been a silent timeout hang in CI.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/mcp/modelopt_mcp/server.py`:
- Around line 221-223: The Field description in server.py documents a specific
return shape for dry-run responses with required fields validated, exit_code,
stdout_tail, stderr_tail, and argv, but the _submit_job_dry_run function in
bridge.py returns different subsets of fields across different error branches
(yaml_not_found, launcher_dir_not_found, and dry_run_timeout paths omit these
fields). Fix this inconsistency by either: (1) updating the _submit_job_dry_run
implementation in bridge.py to ensure all three error paths return the complete
documented set of fields with sensible defaults (e.g., validated=False, empty
strings for stdout_tail and stderr_tail, empty list for argv, and appropriate
exit_code values), or (2) update the Field description in server.py to document
the actual return shapes for each case and clarify which fields are
conditionally present. Option 1 (fixing the implementation) is preferred as it
maintains a consistent API contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 83a615b1-1681-4ffa-848a-4a4563daa538
📒 Files selected for processing (4)
tools/mcp/README.mdtools/mcp/modelopt_mcp/bridge.pytools/mcp/modelopt_mcp/server.pytools/mcp/tests/test_bridge.py
🚧 Files skipped from review as they are similar to previous changes (3)
- tools/mcp/README.md
- tools/mcp/tests/test_bridge.py
- tools/mcp/modelopt_mcp/bridge.py
| "`{ok, dry_run: True, validated: bool, diagnostic?: str, " | ||
| "exit_code: int|None, stdout_tail: str, stderr_tail: str, " | ||
| "argv: list[str]}` with no `experiment_id`. Skips " |
There was a problem hiding this comment.
Documented return shape doesn't match bridge.py implementation for all dry-run branches.
The Field description states the dry-run return will have validated: bool (required), exit_code: int|None, stdout_tail: str, stderr_tail: str, and argv: list[str]. However, cross-referencing with bridge.py's _submit_job_dry_run implementation (context snippet 2) shows three of five return paths omit some of these fields:
- yaml_not_found (bridge.py:727-737): missing
validated,exit_code,stdout_tail,stderr_tail,argv - launcher_dir_not_found (bridge.py:775-784): missing
validated,exit_code,stdout_tail,stderr_tail,argv - dry_run_timeout (bridge.py:803-816): missing
validated
Client code expecting result["validated"] on all dry-run returns would raise KeyError on these branches. The implementation also returns an undocumented reason field in error cases.
Either:
- Fix the bridge.py implementation to always include the documented fields (e.g.,
validated=Falsefor error cases, empty strings for missing stdout/stderr, empty list for argv), or - Update this Field description to clarify that the documented shape applies only to the success/launcher-rejection cases and that error cases return a different subset
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/server.py` around lines 221 - 223, The Field
description in server.py documents a specific return shape for dry-run responses
with required fields validated, exit_code, stdout_tail, stderr_tail, and argv,
but the _submit_job_dry_run function in bridge.py returns different subsets of
fields across different error branches (yaml_not_found, launcher_dir_not_found,
and dry_run_timeout paths omit these fields). Fix this inconsistency by either:
(1) updating the _submit_job_dry_run implementation in bridge.py to ensure all
three error paths return the complete documented set of fields with sensible
defaults (e.g., validated=False, empty strings for stdout_tail and stderr_tail,
empty list for argv, and appropriate exit_code values), or (2) update the Field
description in server.py to document the actual return shapes for each case and
clarify which fields are conditionally present. Option 1 (fixing the
implementation) is preferred as it maintains a consistent API contract.
|
/claude review |
| Returns ``{ok, dry_run: True, validated: bool, diagnostic?: str, | ||
| exit_code: int|None, stdout_tail: str, stderr_tail: str, | ||
| argv: list[str]}``. Never returns ``experiment_id`` or ``pid`` — | ||
| there's nothing to track. ``diagnostic`` is present only on the | ||
| failure / timeout branches (the validated-success branch omits | ||
| it since there's nothing to diagnose). |
There was a problem hiding this comment.
[SUGGESTION] The docstring + server.py:221 Field description both advertise validated: bool and argv: list[str] as if always present in the return shape, but the pre-flight ok=False branches don't include them:
yaml_not_found(line 735): novalidated, noargv, noexit_code/stdout_tail/stderr_taillauncher_dir_not_found(line 773): same omissionsdry_run_timeout(line 813): hasargv/exit_code/stdout_tail/stderr_tailbut novalidated
A defensively-coded caller using .get("validated") won't break, but an agent reading the schema description from the MCP tool definition will reasonably assume validated is always there and may write result["validated"]. Lowest-effort fix: mark them conditional in both doc sites (validated?: bool, argv?: list[str]) and add a one-liner noting that pre-flight failures (ok=False, reason ∈ {yaml_not_found, launcher_dir_not_found}) carry reason + diagnostic only. Higher-effort fix (CodeRabbit's suggestion): populate validated=False, argv=[] etc. on all error branches so the contract is uniform — slightly heavier, but eliminates the question entirely.
Non-blocking; just helpful for downstream agents.
There was a problem hiding this comment.
Claude review summary (round 3 — post 66a6b56)
Findings: 0 CRITICAL, 0 IMPORTANT, 1 SUGGESTION
The round-2 fixes landed cleanly:
- Flag spelling —
--dryrun(one word) now consistent acrossbridge.pyargv,server.pyField description, README, and the unit test assertion. --yesflag — added; would have silently hung on nemo_run's confirmation prompt undercapture_output=True(no TTY) until the 60s timeout otherwise.diagnosticvsdiagnostics— reconciled todiagnostic(singular) acrossserver.py:221Field description,bridge.py:724docstring, and README. Matches the convention used by every other failure-return in the module.
The one remaining finding is a SUGGESTION about minor return-shape drift — pre-flight error branches (yaml_not_found, launcher_dir_not_found) omit validated and argv even though both are advertised in the schema description. Non-blocking; defensive callers using .get("validated") are unaffected.
Risk assessment
The live-submission path is byte-for-byte unchanged for dry_run=False. The new _submit_job_dry_run helper is hermetically tested at the bridge layer and structurally mirrors the live-submit branch (same path resolution, same argv shape minus the new --dryrun flag, same env propagation, same subprocess invocation). Bandit nosec B603 is justified inline with cited precedent across this module. LGTM.
|
Summary
submit_job(yaml_path, ..., dry_run=True)validates a launcher YAML vialaunch.py --dry-run— exercises the YAML loader, factory resolution, and arg parser, with no cluster contact / no container spawn / no sbatch.Used by verify-task workflow stages (
pensieve-intern/workflows/spec_decoding_release/deployment_support.md,hidden_state_dump_support.md,training_support.md,synth_support.md;nemotron_quant_release/mlm_eval.md,mlm_ptq.md,mlm_qad.md) that today shell out touv run launch.py --yaml X --dry-runbecause the MCP catalog has no equivalent.Why
Without this, OMNIML-5144 (SPEC migration) can't finish — the verify-task SPECs all stay on shell prose for their YAML-validation step. Slice 1 of 5144 (specdec_bench's
cell.md, pensieve-intern MR !100) didn't needdry_runbecause cell.md submits real cluster jobs, not validates configs. The other workflows do.Tool surface
dry_runbool = Falsehf_local/cluster_hoststr?dry_run=True(pass one to validate executor-specific config, omit both for shape-only validation). Still mutually exclusive in live submission.skip_verifybool = Falsedry_run=True(verify_setup is meaningless without cluster contact).Returns
{ok, dry_run: True, validated: bool, exit_code, stdout_tail, stderr_tail, argv}instead ofexperiment_id.Note on the ok/validated split: when the launcher rejects the YAML (exit code non-zero), the tool still returns
ok: Truebecause the TOOL ran cleanly — onlyvalidated: False. Same shapeverify_setupalready uses:ok: Trueregardless of probe outcome; the probe outcome is the field the caller reads.Implementation
Clean fork at the top of
submit_job_impl— the dry_run branch lives in a dedicated_submit_job_dry_runhelper to keep the live-submission path uncluttered. ~120 lines added, no changes to existing behavior.Tests
4 new hermetic tests in
tests/test_bridge.py(total: 38 passing):test_submit_job_dry_run_yaml_validates— happy path: ok+validated,--dry-runin argv, no--yestest_submit_job_dry_run_yaml_invalid— launcher rejects YAML →ok=True, validated=False,stderr_tailsurfaces the errortest_submit_job_dry_run_yaml_not_found— yaml_path missing →yaml_not_foundcarryingdry_run: Truefor caller introspectiontest_submit_job_dry_run_skips_verify— bypassesverify_setupeven withskip_verify=FalseScope policy
dry_runis universal launcher environment tooling, thetools/mcp/SCOPE.mdtest ("would this tool exist whether or not workflow X existed?") passes trivially — every workflow that validates a YAML wants this. No workload-specific knobs added.Anchors
Validation
uv run pytest tests/)dry_run=False(default) is byte-for-byte equivalent to today's behaviorSummary by CodeRabbit
New Features
dry_runoption to thesubmit_jobtool to validate launcher YAML and referenced artifacts without contacting the cluster, spawning containers, or runningsbatch.exit_code, stdout/stderr tails, and the launcher argv.Documentation
submit_jobtool docs to describe the newdry_runbehavior and the dry-run return shape.Tests