update slurm test for nano v3 by wedu-nvidia · Pull Request #1389 · NVIDIA-NeMo/Skills

wedu-nvidia · 2026-04-21T19:12:40Z

Summary by CodeRabbit

Tests
- Added an end-to-end SLURM evaluation workflow for Nano-30B covering no-tools, with-tools, formal-math, and agentic modes.
- Added a results validation CLI that aggregates metrics, normalizes percent/fraction values, and enforces metric bounds.
- Added checks for tool usage patterns and detection of timeout regressions across evaluation outputs.

coderabbitai · 2026-04-21T19:17:41Z

📝 Walkthrough

Walkthrough

Adds two new CLI Python scripts: run_test.py to orchestrate SLURM evaluation jobs (asset download + four benchmark groups) and check_results.py to validate evaluation outputs by reading eval-results/*/metrics.json and output-rs*.jsonl, enforcing metric ranges, tool-usage checks, and timeout regression detection. (46 words)

Changes

Cohort / File(s)	Summary
SLURM Evaluation Orchestration `tests/slurm-tests/nano_30b_eval/run_test.py`	New CLI that prepares assets, schedules an asset-download SLURM job, and submits chained evaluation experiments (no-tools, with-tools, formal-math, agentic) via `nemo_skills.pipeline.cli.eval`, wiring workspace paths, server args, and dependent job relationships.
Results Validation `tests/slurm-tests/nano_30b_eval/check_results.py`	New CLI that loads per-benchmark `eval-results/<bench>/metrics.json` (with resolution for `arena-hard-v2-`), normalizes percentages, asserts metric bounds, parses `output-rs.jsonl` for tool usage counts and conversation/tool-message checks, and scans tool messages for configured timeout indicators; reports and accumulates soft assertions then calls `assert_all()`.

Sequence Diagram(s)

sequenceDiagram
    participant User as User (CLI)
    participant Run as run_test.py
    participant Scheduler as SLURM Scheduler
    participant Eval as nemo_skills.eval Jobs
    participant Storage as workspace / eval-results
    participant Check as check_results.py

    User->>Run: launch with --workspace --cluster --expname_prefix
    Run->>Scheduler: submit asset-download job
    Scheduler-->>Run: asset job scheduled
    Run->>Scheduler: submit eval jobs (no-tools, with-tools, formal-math, agentic) [depend on asset job]
    Scheduler->>Eval: run evaluation jobs
    Eval-->>Storage: write eval-results/<benchmark>/{metrics.json,output-rs*.jsonl}
    Scheduler->>Check: schedule/trigger check_results.py after eval jobs complete
    Check->>Storage: read metrics.json and output-rs*.jsonl
    Check-->>User: print per-metric results and pass/fail via assertions

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Super v3 math tool calling slurm test #1303: Adds similar SLURM orchestration and result-checking scripts with overlapping validation logic for output-rs*.jsonl parsing and timeout/tool-usage checks.

Suggested reviewers

Kipok

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'update slurm test for nano v3' directly describes the main changes: adding a new SLURM test suite for the Nano v3 model with benchmark evaluations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch wedu/update-slurm-test

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/slurm-tests/nano_30b_eval/check_results.py (1)
102-103: Use direct key access for expected output schema fields.

For expected keys (field, num_tool_calls, conversation, role, content), .get() defaults can mask schema breaks and weaken checker signal quality.

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 131-134, 163-166
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 102 - 103,
Replace uses of dict.get(...) with direct indexing for expected schema fields so
missing keys raise errors instead of silently returning None: update occurrences
where metrics[agg_key].get(field) (and similar .get(...) usages for keys
'field', 'num_tool_calls', 'conversation', 'role', 'content' found elsewhere in
the file) are used and change them to direct access like metrics[agg_key][field]
(and data['num_tool_calls'], data['conversation'], data['role'],
data['content']) to enforce the expected schema; keep existing variable names
(metrics, agg_key, field, field_label) and behavior otherwise.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 72-76: The code currently dereferences dict keys immediately
(e.g., in load_metrics_block and other places that access nested keys) which can
raise KeyError/TypeError before soft_assert/aggregate checks run; update each
place that does direct indexing (notably load_metrics_block, and the locations
that access nested keys at the other reported spots) to first soft_assert the
existence of the top-level and nested keys (using soft_assert(benchmark in data,
...), then soft_assert('subkey' in data[benchmark], ...) as needed) and only
after assertions pass perform the actual access, or use data.get('key') and
check for None via soft_assert before using the value; ensure you reference and
keep using soft_assert and assert_all for aggregated failures rather than
letting direct indexing raise exceptions.

In `@tests/slurm-tests/nano_30b_eval/run_test.py`:
- Line 19: The import shadows the Python builtin eval; change the import from
nemo_skills.pipeline.cli to alias eval (e.g., import eval as run_eval) and then
update every callsite that currently calls eval to use run_eval instead (there
are 16 callsites to change). Keep the other imported symbols (prepare_data,
run_cmd, wrap_arguments) unchanged and ensure all references to the original
eval symbol in functions and tests are replaced with run_eval to avoid builtin
shadowing and satisfy Ruff A004.
- Around line 547-559: The declared type for run_after in
nemo_skills/pipeline/eval.py (the eval()/run_exp entry) is List[str] but callers
pass a string; update the signature to accept str | list[str] | None (or
Union[str, List[str], None]) and/or normalize the value immediately in
eval()/run_exp by coercing a single str into [str] (and leaving None as-is)
before passing to add_task()/run_exp; reference the eval/run_exp parameter name
run_after and the downstream add_task utility in
nemo_skills/pipeline/utils/exp.py to ensure consistent handling.

---

Nitpick comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 102-103: Replace uses of dict.get(...) with direct indexing for
expected schema fields so missing keys raise errors instead of silently
returning None: update occurrences where metrics[agg_key].get(field) (and
similar .get(...) usages for keys 'field', 'num_tool_calls', 'conversation',
'role', 'content' found elsewhere in the file) are used and change them to
direct access like metrics[agg_key][field] (and data['num_tool_calls'],
data['conversation'], data['role'], data['content']) to enforce the expected
schema; keep existing variable names (metrics, agg_key, field, field_label) and
behavior otherwise.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 72ba7fda-672b-4de8-b4c0-884bc01efc88

📥 Commits

Reviewing files that changed from the base of the PR and between 94478e5 and dd25eeb.

📒 Files selected for processing (2)

tests/slurm-tests/nano_30b_eval/check_results.py
tests/slurm-tests/nano_30b_eval/run_test.py

coderabbitai · 2026-04-21T19:17:44Z

+    setup_expname = setup(workspace=args.workspace, cluster=args.cluster, expname_prefix=args.expname_prefix)
+
+    no_tools_expnames = eval_no_tools(
+        workspace=args.workspace,
+        cluster=args.cluster,
+        expname_prefix=args.expname_prefix,
+        wandb_project=args.wandb_project,
+        partition=args.partition,
+        num_jobs=args.num_jobs,
+        server_gpus=args.server_gpus,
+        server_container=args.server_container,
+        run_after=setup_expname,
+    )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify how `run_after` is handled inside eval implementation. rg -n -C3 '\brun_after\b' nemo_skills/pipeline/eval.py ast-grep --pattern 'run_after: List[str] = typer.Option($$$)' nemo_skills/pipeline/eval.py

Repository: NVIDIA-NeMo/Skills

Length of output: 2862

🏁 Script executed:

# Check the setup function return type in the test file head -30 tests/slurm-tests/nano_30b_eval/run_test.py | grep -A 20 "def setup" # Look for run_exp function signature and how it handles run_after rg -n "def run_exp" nemo_skills/pipeline/utils/exp.py | head -1 # Check how run_after is used in run_exp rg -n -A 10 "run_after" nemo_skills/pipeline/utils/exp.py | head -40 # Check if there's any type conversion/normalization for run_after rg -n "isinstance.*run_after\|list.*run_after\|if.*run_after" nemo_skills/pipeline/

Repository: NVIDIA-NeMo/Skills

Length of output: 1973

🏁 Script executed:

# Check the setup function return type in the test file rg -n "def setup" tests/slurm-tests/nano_30b_eval/run_test.py -A 10 # Check eval_no_tools function signature - does it pass run_after to eval? rg -n "def eval_no_tools" tests/slurm-tests/nano_30b_eval/run_test.py -A 30 # Verify what run_after parameter receives in eval.py and how it's forwarded rg -n "run_after=" nemo_skills/pipeline/eval.py -B 2 -A 2 | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 3116

Update run_after type hint in nemo_skills/pipeline/eval.py or normalize it at entry.

The test passes setup_expname (a string) to eval() as run_after, but eval.py line 238 declares run_after: List[str]. While this works because downstream functions like add_task() normalize strings to lists (lines 496–498 in nemo_skills/pipeline/utils/exp.py), the type mismatch violates the declared interface.

Either update the type hint to run_after: str | list[str] | None to match the actual accepted types, or normalize at the entry point before passing to run_exp().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/run_test.py` around lines 547 - 559, The declared type for run_after in nemo_skills/pipeline/eval.py (the eval()/run_exp entry) is List[str] but callers pass a string; update the signature to accept str | list[str] | None (or Union[str, List[str], None]) and/or normalize the value immediately in eval()/run_exp by coercing a single str into [str] (and leaving None as-is) before passing to add_task()/run_exp; reference the eval/run_exp parameter name run_after and the downstream add_task utility in nemo_skills/pipeline/utils/exp.py to ensure consistent handling.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

tests/slurm-tests/nano_30b_eval/check_results.py (1)

72-75: ⚠️ Potential issue | 🟠 Major

Guard soft-asserted keys before indexing to keep aggregated failure reporting intact.

At Line 75 and Line 221–222, direct indexing can still raise KeyError before assert_all() runs, which breaks the intended soft-assert flow.

Proposed fix

 def load_metrics_block(metrics_path: Path, benchmark: str):
     data = load_json(metrics_path)
-    soft_assert(benchmark in data, f"Missing benchmark {benchmark} in {metrics_path}")
-    return data[benchmark]
+    if benchmark not in data:
+        soft_assert(False, f"Missing benchmark {benchmark} in {metrics_path}")
+        return None
+    return data[benchmark]
@@
 def check_metric_group(
     eval_dir: Path, metric_config: dict[str, tuple[str, str | tuple[str, ...], tuple[float, float]]]
 ):
     for benchmark, (agg_key, field, (lo, hi)) in metric_config.items():
         metrics_path, metrics, benchmark_label = resolve_metrics_entry(eval_dir, benchmark)
+        if metrics is None:
+            continue
         soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
         if agg_key not in metrics:
             continue
@@
 def check_formal_math(eval_dir: Path):
     for label, (benchmark, agg_key, field, (lo, hi)) in FORMAL_MATH_METRICS.items():
         metrics_path = eval_dir / "eval-results" / benchmark / "metrics.json"
         metrics = load_metrics_block(metrics_path, benchmark)
-        soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
-        soft_assert(field in metrics[agg_key], f"Missing field {field} in {metrics_path}")
+        if metrics is None:
+            continue
+        if agg_key not in metrics:
+            soft_assert(False, f"Missing aggregation key {agg_key} in {metrics_path}")
+            continue
+        if field not in metrics[agg_key]:
+            soft_assert(False, f"Missing field {field} in {metrics_path}")
+            continue
         value = normalize_percent(float(metrics[agg_key][field]))

Also applies to: 216-223

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 72 - 75, The
code currently indexes into metric dictionaries directly which can raise
KeyError and bypass the soft-assert aggregation; in functions like
load_metrics_block and the code that accesses benchmarks (the blocks referencing
data[benchmark] and later data["summaries"]/data["metrics"] access), first call
soft_assert(key in dict, ...) for each required key (e.g., soft_assert(benchmark
in data, ...); soft_assert("summaries" in data, ...); soft_assert("metrics" in
summary, ...)) and only then perform the dict indexing, or use dict.get(...)
after the soft_assert to avoid a KeyError before assert_all() is called. Ensure
you preserve existing error messages and continue to call assert_all() at the
end of the test flow.

🧹 Nitpick comments (1)

tests/slurm-tests/nano_30b_eval/run_test.py (1)

67-70: Use iterable unpacking for cleaner list composition.

This matches Ruff RUF005 and keeps the block simpler.

Proposed fix

     if enable_tools:
-        parts = [
-            "--enable-auto-tool-choice",
-            "--tool-call-parser qwen3_coder",
-        ] + parts
+        parts = [
+            "--enable-auto-tool-choice",
+            "--tool-call-parser qwen3_coder",
+            *parts,
+        ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/run_test.py` around lines 67 - 70, Replace
the list concatenation that prepends flags to the existing variable parts with
iterable unpacking to simplify composition: update the assignment that currently
builds parts by doing ["--enable-auto-tool-choice", "--tool-call-parser
qwen3_coder",] + parts so it uses iterable unpacking with the existing parts
variable (refer to the parts assignment in run_test.py) to produce the same list
more cleanly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 158-160: The division samples_with_tools / total_samples can still
raise ZeroDivisionError even though soft_assert was called; update the logic
around the tool_fraction calculation in check_results.py to guard against
total_samples == 0 (e.g., check total_samples before dividing, set tool_fraction
to 0 or skip downstream processing/return early) and ensure any downstream uses
of tool_fraction handle the fallback; modify the block that contains
soft_assert, total_samples, samples_with_tools and the tool_fraction assignment
to perform the safe check and avoid the division when total_samples is zero.
- Around line 178-180: The loop over bench_dir.glob("output-rs*.jsonl") in
check_timeouts currently treats zero matched files as zero timeouts; update
check_timeouts to first collect matches (e.g.,
list(sorted(bench_dir.glob("output-rs*.jsonl")))) and assert that the list is
non-empty for each benchmark, raising or failing the test with a clear message
referencing the benchmark (bench_dir) when no output-rs*.jsonl files are found
so missing eval outputs cannot be silently ignored.

---

Duplicate comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 72-75: The code currently indexes into metric dictionaries
directly which can raise KeyError and bypass the soft-assert aggregation; in
functions like load_metrics_block and the code that accesses benchmarks (the
blocks referencing data[benchmark] and later data["summaries"]/data["metrics"]
access), first call soft_assert(key in dict, ...) for each required key (e.g.,
soft_assert(benchmark in data, ...); soft_assert("summaries" in data, ...);
soft_assert("metrics" in summary, ...)) and only then perform the dict indexing,
or use dict.get(...) after the soft_assert to avoid a KeyError before
assert_all() is called. Ensure you preserve existing error messages and continue
to call assert_all() at the end of the test flow.

---

Nitpick comments:
In `@tests/slurm-tests/nano_30b_eval/run_test.py`:
- Around line 67-70: Replace the list concatenation that prepends flags to the
existing variable parts with iterable unpacking to simplify composition: update
the assignment that currently builds parts by doing
["--enable-auto-tool-choice", "--tool-call-parser qwen3_coder",] + parts so it
uses iterable unpacking with the existing parts variable (refer to the parts
assignment in run_test.py) to produce the same list more cleanly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 47a70dac-c809-4584-924e-7061c8a04833

📥 Commits

Reviewing files that changed from the base of the PR and between dd25eeb and 2cdab17.

📒 Files selected for processing (2)

tests/slurm-tests/nano_30b_eval/check_results.py
tests/slurm-tests/nano_30b_eval/run_test.py

coderabbitai · 2026-04-22T03:12:52Z

+    soft_assert(total_samples > 0, "No samples found in with_tools outputs")
+    tool_fraction = samples_with_tools / total_samples
+    print(


⚠️ Potential issue | 🟠 Major

Prevent division by zero when tool outputs are missing.

At Line 159, samples_with_tools / total_samples can raise ZeroDivisionError even after soft_assert fails.

Proposed fix

soft_assert(total_samples > 0, "No samples found in with_tools outputs") + if total_samples == 0: + return tool_fraction = samples_with_tools / total_samples

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 158 - 160, The division samples_with_tools / total_samples can still raise ZeroDivisionError even though soft_assert was called; update the logic around the tool_fraction calculation in check_results.py to guard against total_samples == 0 (e.g., check total_samples before dividing, set tool_fraction to 0 or skip downstream processing/return early) and ensure any downstream uses of tool_fraction handle the fallback; modify the block that contains soft_assert, total_samples, samples_with_tools and the tool_fraction assignment to perform the safe check and avoid the division when total_samples is zero.

coderabbitai · 2026-04-22T03:12:52Z

+        for output_path in sorted(bench_dir.glob("output-rs*.jsonl")):
+            file_timeouts = 0
+            with output_path.open("rt", encoding="utf-8") as fin:


⚠️ Potential issue | 🟠 Major

Assert timeout input files exist per benchmark.

check_timeouts currently allows a benchmark with no output-rs*.jsonl files to pass as 0 timeouts, which can mask missing eval outputs.

Proposed fix

for benchmark in TOOL_BENCHMARKS: bench_dir = eval_dir / "eval-results" / benchmark bench_timeouts = 0 + output_files = sorted(bench_dir.glob("output-rs*.jsonl")) + soft_assert(len(output_files) > 0, f"No output files found in {bench_dir}") + if not output_files: + continue - for output_path in sorted(bench_dir.glob("output-rs*.jsonl")): + for output_path in output_files:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 178 - 180, The loop over bench_dir.glob("output-rs*.jsonl") in check_timeouts currently treats zero matched files as zero timeouts; update check_timeouts to first collect matches (e.g., list(sorted(bench_dir.glob("output-rs*.jsonl")))) and assert that the list is non-empty for each benchmark, raising or failing the test with a clear message referencing the benchmark (bench_dir) when no output-rs*.jsonl files are found so missing eval outputs cannot be silently ignored.

wedu-nvidia · 2026-04-22T03:13:35Z

@Kipok @gwarmstrong
I created a SLURM test suite for Nano v3 and included most of the benchmarks listed on the Hugging Face model page: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

For a few tasks, I could not find corresponding support in NeMo-Skills, such as Terminal Bench and TauBench V2. I also skipped RULER for now because it requires downloading and preparing the data on the cluster first, and it seems that AA-LCR is currently the higher-priority long-context benchmark.

Regarding the range, I do not know it is reasonable for all of benchmarks listed above.
Feel free to provide any comments.

Signed-off-by: Wei Du <wedu@nvidia.com>

coderabbitai

♻️ Duplicate comments (3)

tests/slurm-tests/nano_30b_eval/check_results.py (3)

158-160: ⚠️ Potential issue | 🟠 Major

Avoid divide-by-zero in tool-usage summary.

Line 159 divides even when total_samples == 0; soft_assert alone does not stop execution.

Suggested fix

     soft_assert(total_samples > 0, "No samples found in with_tools outputs")
+    if total_samples == 0:
+        return
     tool_fraction = samples_with_tools / total_samples

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 158 - 160, The
code computes tool_fraction = samples_with_tools / total_samples after calling
soft_assert(total_samples > 0, ...), but soft_assert does not abort execution so
a divide-by-zero can occur; update the logic around soft_assert/total_samples
(referencing soft_assert, total_samples, samples_with_tools, tool_fraction, and
the subsequent print) to explicitly guard against total_samples == 0 — either
return/exit when total_samples is zero or set tool_fraction to a safe default
(e.g., 0) and adjust the print/output to reflect no samples instead of
performing the division.

72-76: ⚠️ Potential issue | 🟠 Major

Guard missing metric keys before dereferencing to keep aggregated failures working.

Line 75 and Line 222 can still raise before assert_all() when required keys are absent, which breaks soft-assert aggregation.

Suggested fix

 def load_metrics_block(metrics_path: Path, benchmark: str):
     data = load_json(metrics_path)
-    soft_assert(benchmark in data, f"Missing benchmark {benchmark} in {metrics_path}")
-    return data[benchmark]
+    if benchmark not in data:
+        soft_assert(False, f"Missing benchmark {benchmark} in {metrics_path}")
+        return None
+    return data[benchmark]
@@
     for benchmark, (agg_key, field, (lo, hi)) in metric_config.items():
         metrics_path, metrics, benchmark_label = resolve_metrics_entry(eval_dir, benchmark)
+        if metrics is None:
+            continue
         soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
@@
 def check_formal_math(eval_dir: Path):
     for label, (benchmark, agg_key, field, (lo, hi)) in FORMAL_MATH_METRICS.items():
         metrics_path = eval_dir / "eval-results" / benchmark / "metrics.json"
         metrics = load_metrics_block(metrics_path, benchmark)
-        soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
-        soft_assert(field in metrics[agg_key], f"Missing field {field} in {metrics_path}")
+        if metrics is None:
+            continue
+        soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
+        if agg_key not in metrics:
+            continue
+        soft_assert(field in metrics[agg_key], f"Missing field {field} in {metrics_path}")
+        if field not in metrics[agg_key]:
+            continue
         value = normalize_percent(float(metrics[agg_key][field]))

Also applies to: 96-100, 216-223

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 72 - 76, The
code in load_metrics_block (and similar places that directly index into loaded
metric dicts) dereferences expected keys and can raise KeyError before
soft-assert aggregation runs; update load_metrics_block and the other
metric-access sites (the blocks around the code that reads benchmark keys and
specific metric keys at ~lines 96-100 and 216-223) to check for key existence
using soft_assert/soft_assert_present (or call soft_assert with a message)
before accessing data[benchmark] or data[benchmark][<metric>] so missing keys
produce soft assertion records rather than exceptions, and return a safe default
or None when keys are absent to allow assert_all() to aggregate failures.

174-180: ⚠️ Potential issue | 🟠 Major

Fail when tool benchmark outputs are missing, not as “0 timeouts”.

If no output-rs*.jsonl files exist for a benchmark, current logic treats it as zero timeouts and can mask missing eval artifacts.

Suggested fix

     for benchmark in TOOL_BENCHMARKS:
         bench_dir = eval_dir / "eval-results" / benchmark
         bench_timeouts = 0
+        output_files = sorted(bench_dir.glob("output-rs*.jsonl"))
+        soft_assert(len(output_files) > 0, f"No output files found in {bench_dir}")
+        if not output_files:
+            continue
 
-        for output_path in sorted(bench_dir.glob("output-rs*.jsonl")):
+        for output_path in output_files:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 174 - 180, The
loop over TOOL_BENCHMARKS currently assumes presence of output-rs*.jsonl files
and treats absence as "0 timeouts"; change the logic to detect when
sorted(bench_dir.glob("output-rs*.jsonl")) yields no files and treat that as a
missing-artifact error rather than zero timeouts. Specifically, inside the scan
of each benchmark (using bench_dir and output_path), if no output files are
found set a missing flag or raise/log an error for that benchmark (increment a
missing-artifact counter or fail the run) instead of leaving bench_timeouts at
0, and ensure downstream aggregation uses that missing indicator. Ensure
references to TOOL_BENCHMARKS, bench_dir, bench_timeouts and file_timeouts are
updated so callers can distinguish "no outputs" from "0 timeouts."

🧹 Nitpick comments (1)

tests/slurm-tests/nano_30b_eval/run_test.py (1)
100-355: Refactor repeated run_eval blocks into a table-driven loop.

The no-tools/with-tools sections duplicate a large call pattern, which increases drift risk when changing shared args or benchmark options. A small benchmark spec table + loop would make this much safer to maintain.

As per coding guidelines: "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks, use self-explanatory code over comments, avoid complicated type interfaces with unions, and keep naming consistent with existing conventions."

Also applies to: 358-444
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/run_test.py` around lines 100 - 355, The
eval_no_tools function has many repeated run_eval calls; refactor by creating a
table (list of dicts) of benchmark specs (each entry: key "exp_suffix",
"benchmarks", optional overrides like "ctx_suffix", "num_jobs", "split",
"num_chunks", "judge_model", "judge_server_type", "judge_server_gpus",
"judge_server_container", "extra_judge_args", "wandb_name_override") and loop
over it to call run_eval once per spec, composing expname from expname_prefix +
exp_suffix and building ctx via wrap_arguments(NO_TOOLS_PARAMS +
spec.get("ctx_suffix","")), while passing the shared parameters
(model=get_local_model_path(workspace), server_type="vllm",
server_gpus=server_gpus, server_args=server_args,
server_container=server_container, output_dir=output_dir, partition=partition,
run_after=run_after, wandb_project=wandb_project) and applying any per-spec
overrides (num_jobs, split, benchmarks, judge_* fields, extra_judge_args,
num_chunks, wandb_name) so behavior remains identical to the original run_eval
calls.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 158-160: The code computes tool_fraction = samples_with_tools /
total_samples after calling soft_assert(total_samples > 0, ...), but soft_assert
does not abort execution so a divide-by-zero can occur; update the logic around
soft_assert/total_samples (referencing soft_assert, total_samples,
samples_with_tools, tool_fraction, and the subsequent print) to explicitly guard
against total_samples == 0 — either return/exit when total_samples is zero or
set tool_fraction to a safe default (e.g., 0) and adjust the print/output to
reflect no samples instead of performing the division.
- Around line 72-76: The code in load_metrics_block (and similar places that
directly index into loaded metric dicts) dereferences expected keys and can
raise KeyError before soft-assert aggregation runs; update load_metrics_block
and the other metric-access sites (the blocks around the code that reads
benchmark keys and specific metric keys at ~lines 96-100 and 216-223) to check
for key existence using soft_assert/soft_assert_present (or call soft_assert
with a message) before accessing data[benchmark] or data[benchmark][<metric>] so
missing keys produce soft assertion records rather than exceptions, and return a
safe default or None when keys are absent to allow assert_all() to aggregate
failures.
- Around line 174-180: The loop over TOOL_BENCHMARKS currently assumes presence
of output-rs*.jsonl files and treats absence as "0 timeouts"; change the logic
to detect when sorted(bench_dir.glob("output-rs*.jsonl")) yields no files and
treat that as a missing-artifact error rather than zero timeouts. Specifically,
inside the scan of each benchmark (using bench_dir and output_path), if no
output files are found set a missing flag or raise/log an error for that
benchmark (increment a missing-artifact counter or fail the run) instead of
leaving bench_timeouts at 0, and ensure downstream aggregation uses that missing
indicator. Ensure references to TOOL_BENCHMARKS, bench_dir, bench_timeouts and
file_timeouts are updated so callers can distinguish "no outputs" from "0
timeouts."

---

Nitpick comments:
In `@tests/slurm-tests/nano_30b_eval/run_test.py`:
- Around line 100-355: The eval_no_tools function has many repeated run_eval
calls; refactor by creating a table (list of dicts) of benchmark specs (each
entry: key "exp_suffix", "benchmarks", optional overrides like "ctx_suffix",
"num_jobs", "split", "num_chunks", "judge_model", "judge_server_type",
"judge_server_gpus", "judge_server_container", "extra_judge_args",
"wandb_name_override") and loop over it to call run_eval once per spec,
composing expname from expname_prefix + exp_suffix and building ctx via
wrap_arguments(NO_TOOLS_PARAMS + spec.get("ctx_suffix","")), while passing the
shared parameters (model=get_local_model_path(workspace), server_type="vllm",
server_gpus=server_gpus, server_args=server_args,
server_container=server_container, output_dir=output_dir, partition=partition,
run_after=run_after, wandb_project=wandb_project) and applying any per-spec
overrides (num_jobs, split, benchmarks, judge_* fields, extra_judge_args,
num_chunks, wandb_name) so behavior remains identical to the original run_eval
calls.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 707ba520-00be-45e5-9b84-edf9d718b620

📥 Commits

Reviewing files that changed from the base of the PR and between 2cdab17 and 6bdefa5.

📒 Files selected for processing (2)

tests/slurm-tests/nano_30b_eval/check_results.py
tests/slurm-tests/nano_30b_eval/run_test.py

Kipok · 2026-04-23T00:05:36Z

+NO_TOOLS_METRICS = {
+    "aime25": ("pass@1[avg-of-4]", "symbolic_correct", (88.0, 94.0)),
+    "gpqa": ("pass@1[avg-of-4]", "symbolic_correct", (69.0, 76.0)),
+    "mmlu-pro": ("pass@1", "symbolic_correct", (74.0, 82.0)),


for large datasets like this, I think we can use max_samples=256 or something like this to make it faster

Kipok · 2026-04-23T00:07:43Z

+JUDGE_MODEL = "openai/gpt-oss-120b"
+JUDGE_MODEL_DIRNAME = JUDGE_MODEL.split("/")[-1]
+REASONING_PARSER_FILENAME = "nano_v3_reasoning_parser.py"
+DEFAULT_SERVER_CONTAINER = (


let's not have references to internal infra, this can be specified in your cluster config and by default we just use what's in there

Kipok · 2026-04-23T00:08:07Z

+)
+
+NO_TOOLS_PARAMS = (
+    "++inference.tokens_to_generate=120000 "


instead of setting tokens to generate, let's set max-model-len in server args, going to be more efficient

Kipok · 2026-04-23T00:08:31Z

+)
+
+FORMAL_MATH_PARAMS = (
+    "++inference.tokens_to_generate=38912 "


this one is an exception if it's using lower tokens to generate, so here we can set it explicitly

Kipok · 2026-04-23T00:09:08Z

+    return f"{workspace}/{MODEL_DIRNAME}"
+
+
+def setup(workspace, cluster, expname_prefix):


we shouldn't need this, you can just use model names and it will put them in hf cache

Kipok · 2026-04-23T00:09:42Z

+):
+    output_dir = f"{workspace}/no_tools"
+    server_args = build_server_args(
+        parser_path=f"{workspace}/nano_v3_parser/{REASONING_PARSER_FILENAME}",


latest vllm accepts just nemotron3 for parser or something like this, maybe we use that for simplicitly

Kipok · 2026-04-23T00:11:53Z

+    expnames = []
+
+    expname = f"{expname_prefix}-no-tools-aime25"
+    run_eval(


we should group more benchmarks together for efficiency. Single 8 node server can likely run more in parallel

wedu-nvidia marked this pull request as draft April 21, 2026 19:12

wedu-nvidia changed the title ~~Wedu/update slurm test~~ update slurm test for nano v3 Apr 21, 2026

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

wedu-nvidia marked this pull request as ready for review April 22, 2026 03:08

wedu-nvidia requested review from Kipok and gwarmstrong April 22, 2026 03:08

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

wedu-nvidia added 10 commits April 22, 2026 17:06

Add nano 30b eval slurm test

2dd4c1f

Signed-off-by: Wei Du <wedu@nvidia.com>

turn off parse_reasoning

f05a14f

Signed-off-by: Wei Du <wedu@nvidia.com>

update parameter

4b14e1b

Signed-off-by: Wei Du <wedu@nvidia.com>

remove parse_reasoning

59d07a8

Signed-off-by: Wei Du <wedu@nvidia.com>

add more benchmarks

8acf86d

Signed-off-by: Wei Du <wedu@nvidia.com>

add more benchmarks

21077b6

Signed-off-by: Wei Du <wedu@nvidia.com>

update check scirpt

6ce8450

Signed-off-by: Wei Du <wedu@nvidia.com>

update paras for mmlu-prox

5fe7617

Signed-off-by: Wei Du <wedu@nvidia.com>

update mmul-prox

df54e02

Signed-off-by: Wei Du <wedu@nvidia.com>

update range

6bdefa5

Signed-off-by: Wei Du <wedu@nvidia.com>

wedu-nvidia force-pushed the wedu/update-slurm-test branch from 2cdab17 to 6bdefa5 Compare April 23, 2026 00:07

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Kipok requested changes Apr 23, 2026

View reviewed changes

		return f"{workspace}/{MODEL_DIRNAME}"


		def setup(workspace, cluster, expname_prefix):

Conversation

wedu-nvidia commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

wedu-nvidia commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wedu-nvidia commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

wedu-nvidia commented Apr 22, 2026 •

edited

Loading