Skip to content

update slurm test for nano v3#1389

Open
wedu-nvidia wants to merge 10 commits into
mainfrom
wedu/update-slurm-test
Open

update slurm test for nano v3#1389
wedu-nvidia wants to merge 10 commits into
mainfrom
wedu/update-slurm-test

Conversation

@wedu-nvidia
Copy link
Copy Markdown
Collaborator

@wedu-nvidia wedu-nvidia commented Apr 21, 2026

Summary by CodeRabbit

  • Tests
    • Added an end-to-end SLURM evaluation workflow for Nano-30B covering no-tools, with-tools, formal-math, and agentic modes.
    • Added a results validation CLI that aggregates metrics, normalizes percent/fraction values, and enforces metric bounds.
    • Added checks for tool usage patterns and detection of timeout regressions across evaluation outputs.

@wedu-nvidia wedu-nvidia marked this pull request as draft April 21, 2026 19:12
@wedu-nvidia wedu-nvidia changed the title Wedu/update slurm test update slurm test for nano v3 Apr 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

Adds two new CLI Python scripts: run_test.py to orchestrate SLURM evaluation jobs (asset download + four benchmark groups) and check_results.py to validate evaluation outputs by reading eval-results/*/metrics.json and output-rs*.jsonl, enforcing metric ranges, tool-usage checks, and timeout regression detection. (46 words)

Changes

Cohort / File(s) Summary
SLURM Evaluation Orchestration
tests/slurm-tests/nano_30b_eval/run_test.py
New CLI that prepares assets, schedules an asset-download SLURM job, and submits chained evaluation experiments (no-tools, with-tools, formal-math, agentic) via nemo_skills.pipeline.cli.eval, wiring workspace paths, server args, and dependent job relationships.
Results Validation
tests/slurm-tests/nano_30b_eval/check_results.py
New CLI that loads per-benchmark eval-results/<bench>/metrics.json (with resolution for arena-hard-v2-*), normalizes percentages, asserts metric bounds, parses output-rs*.jsonl for tool usage counts and conversation/tool-message checks, and scans tool messages for configured timeout indicators; reports and accumulates soft assertions then calls assert_all().

Sequence Diagram(s)

sequenceDiagram
    participant User as User (CLI)
    participant Run as run_test.py
    participant Scheduler as SLURM Scheduler
    participant Eval as nemo_skills.eval Jobs
    participant Storage as workspace / eval-results
    participant Check as check_results.py

    User->>Run: launch with --workspace --cluster --expname_prefix
    Run->>Scheduler: submit asset-download job
    Scheduler-->>Run: asset job scheduled
    Run->>Scheduler: submit eval jobs (no-tools, with-tools, formal-math, agentic) [depend on asset job]
    Scheduler->>Eval: run evaluation jobs
    Eval-->>Storage: write eval-results/<benchmark>/{metrics.json,output-rs*.jsonl}
    Scheduler->>Check: schedule/trigger check_results.py after eval jobs complete
    Check->>Storage: read metrics.json and output-rs*.jsonl
    Check-->>User: print per-metric results and pass/fail via assertions
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Kipok
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'update slurm test for nano v3' directly describes the main changes: adding a new SLURM test suite for the Nano v3 model with benchmark evaluations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch wedu/update-slurm-test

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/slurm-tests/nano_30b_eval/check_results.py (1)

102-103: Use direct key access for expected output schema fields.

For expected keys (field, num_tool_calls, conversation, role, content), .get() defaults can mask schema breaks and weaken checker signal quality.

As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 131-134, 163-166

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 102 - 103,
Replace uses of dict.get(...) with direct indexing for expected schema fields so
missing keys raise errors instead of silently returning None: update occurrences
where metrics[agg_key].get(field) (and similar .get(...) usages for keys
'field', 'num_tool_calls', 'conversation', 'role', 'content' found elsewhere in
the file) are used and change them to direct access like metrics[agg_key][field]
(and data['num_tool_calls'], data['conversation'], data['role'],
data['content']) to enforce the expected schema; keep existing variable names
(metrics, agg_key, field, field_label) and behavior otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 72-76: The code currently dereferences dict keys immediately
(e.g., in load_metrics_block and other places that access nested keys) which can
raise KeyError/TypeError before soft_assert/aggregate checks run; update each
place that does direct indexing (notably load_metrics_block, and the locations
that access nested keys at the other reported spots) to first soft_assert the
existence of the top-level and nested keys (using soft_assert(benchmark in data,
...), then soft_assert('subkey' in data[benchmark], ...) as needed) and only
after assertions pass perform the actual access, or use data.get('key') and
check for None via soft_assert before using the value; ensure you reference and
keep using soft_assert and assert_all for aggregated failures rather than
letting direct indexing raise exceptions.

In `@tests/slurm-tests/nano_30b_eval/run_test.py`:
- Line 19: The import shadows the Python builtin eval; change the import from
nemo_skills.pipeline.cli to alias eval (e.g., import eval as run_eval) and then
update every callsite that currently calls eval to use run_eval instead (there
are 16 callsites to change). Keep the other imported symbols (prepare_data,
run_cmd, wrap_arguments) unchanged and ensure all references to the original
eval symbol in functions and tests are replaced with run_eval to avoid builtin
shadowing and satisfy Ruff A004.
- Around line 547-559: The declared type for run_after in
nemo_skills/pipeline/eval.py (the eval()/run_exp entry) is List[str] but callers
pass a string; update the signature to accept str | list[str] | None (or
Union[str, List[str], None]) and/or normalize the value immediately in
eval()/run_exp by coercing a single str into [str] (and leaving None as-is)
before passing to add_task()/run_exp; reference the eval/run_exp parameter name
run_after and the downstream add_task utility in
nemo_skills/pipeline/utils/exp.py to ensure consistent handling.

---

Nitpick comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 102-103: Replace uses of dict.get(...) with direct indexing for
expected schema fields so missing keys raise errors instead of silently
returning None: update occurrences where metrics[agg_key].get(field) (and
similar .get(...) usages for keys 'field', 'num_tool_calls', 'conversation',
'role', 'content' found elsewhere in the file) are used and change them to
direct access like metrics[agg_key][field] (and data['num_tool_calls'],
data['conversation'], data['role'], data['content']) to enforce the expected
schema; keep existing variable names (metrics, agg_key, field, field_label) and
behavior otherwise.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 72ba7fda-672b-4de8-b4c0-884bc01efc88

📥 Commits

Reviewing files that changed from the base of the PR and between 94478e5 and dd25eeb.

📒 Files selected for processing (2)
  • tests/slurm-tests/nano_30b_eval/check_results.py
  • tests/slurm-tests/nano_30b_eval/run_test.py

Comment thread tests/slurm-tests/nano_30b_eval/check_results.py
Comment thread tests/slurm-tests/nano_30b_eval/run_test.py Outdated
Comment on lines +547 to +559
setup_expname = setup(workspace=args.workspace, cluster=args.cluster, expname_prefix=args.expname_prefix)

no_tools_expnames = eval_no_tools(
workspace=args.workspace,
cluster=args.cluster,
expname_prefix=args.expname_prefix,
wandb_project=args.wandb_project,
partition=args.partition,
num_jobs=args.num_jobs,
server_gpus=args.server_gpus,
server_container=args.server_container,
run_after=setup_expname,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify how `run_after` is handled inside eval implementation.
rg -n -C3 '\brun_after\b' nemo_skills/pipeline/eval.py
ast-grep --pattern 'run_after: List[str] = typer.Option($$$)' nemo_skills/pipeline/eval.py

Repository: NVIDIA-NeMo/Skills

Length of output: 2862


🏁 Script executed:

# Check the setup function return type in the test file
head -30 tests/slurm-tests/nano_30b_eval/run_test.py | grep -A 20 "def setup"

# Look for run_exp function signature and how it handles run_after
rg -n "def run_exp" nemo_skills/pipeline/utils/exp.py | head -1

# Check how run_after is used in run_exp
rg -n -A 10 "run_after" nemo_skills/pipeline/utils/exp.py | head -40

# Check if there's any type conversion/normalization for run_after
rg -n "isinstance.*run_after\|list.*run_after\|if.*run_after" nemo_skills/pipeline/

Repository: NVIDIA-NeMo/Skills

Length of output: 1973


🏁 Script executed:

# Check the setup function return type in the test file
rg -n "def setup" tests/slurm-tests/nano_30b_eval/run_test.py -A 10

# Check eval_no_tools function signature - does it pass run_after to eval?
rg -n "def eval_no_tools" tests/slurm-tests/nano_30b_eval/run_test.py -A 30

# Verify what run_after parameter receives in eval.py and how it's forwarded
rg -n "run_after=" nemo_skills/pipeline/eval.py -B 2 -A 2 | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 3116


Update run_after type hint in nemo_skills/pipeline/eval.py or normalize it at entry.

The test passes setup_expname (a string) to eval() as run_after, but eval.py line 238 declares run_after: List[str]. While this works because downstream functions like add_task() normalize strings to lists (lines 496–498 in nemo_skills/pipeline/utils/exp.py), the type mismatch violates the declared interface.

Either update the type hint to run_after: str | list[str] | None to match the actual accepted types, or normalize at the entry point before passing to run_exp().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/run_test.py` around lines 547 - 559, The
declared type for run_after in nemo_skills/pipeline/eval.py (the eval()/run_exp
entry) is List[str] but callers pass a string; update the signature to accept
str | list[str] | None (or Union[str, List[str], None]) and/or normalize the
value immediately in eval()/run_exp by coercing a single str into [str] (and
leaving None as-is) before passing to add_task()/run_exp; reference the
eval/run_exp parameter name run_after and the downstream add_task utility in
nemo_skills/pipeline/utils/exp.py to ensure consistent handling.

@wedu-nvidia wedu-nvidia marked this pull request as ready for review April 22, 2026 03:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tests/slurm-tests/nano_30b_eval/check_results.py (1)

72-75: ⚠️ Potential issue | 🟠 Major

Guard soft-asserted keys before indexing to keep aggregated failure reporting intact.

At Line 75 and Line 221–222, direct indexing can still raise KeyError before assert_all() runs, which breaks the intended soft-assert flow.

Proposed fix
 def load_metrics_block(metrics_path: Path, benchmark: str):
     data = load_json(metrics_path)
-    soft_assert(benchmark in data, f"Missing benchmark {benchmark} in {metrics_path}")
-    return data[benchmark]
+    if benchmark not in data:
+        soft_assert(False, f"Missing benchmark {benchmark} in {metrics_path}")
+        return None
+    return data[benchmark]
@@
 def check_metric_group(
     eval_dir: Path, metric_config: dict[str, tuple[str, str | tuple[str, ...], tuple[float, float]]]
 ):
     for benchmark, (agg_key, field, (lo, hi)) in metric_config.items():
         metrics_path, metrics, benchmark_label = resolve_metrics_entry(eval_dir, benchmark)
+        if metrics is None:
+            continue
         soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
         if agg_key not in metrics:
             continue
@@
 def check_formal_math(eval_dir: Path):
     for label, (benchmark, agg_key, field, (lo, hi)) in FORMAL_MATH_METRICS.items():
         metrics_path = eval_dir / "eval-results" / benchmark / "metrics.json"
         metrics = load_metrics_block(metrics_path, benchmark)
-        soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
-        soft_assert(field in metrics[agg_key], f"Missing field {field} in {metrics_path}")
+        if metrics is None:
+            continue
+        if agg_key not in metrics:
+            soft_assert(False, f"Missing aggregation key {agg_key} in {metrics_path}")
+            continue
+        if field not in metrics[agg_key]:
+            soft_assert(False, f"Missing field {field} in {metrics_path}")
+            continue
         value = normalize_percent(float(metrics[agg_key][field]))

Also applies to: 216-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 72 - 75, The
code currently indexes into metric dictionaries directly which can raise
KeyError and bypass the soft-assert aggregation; in functions like
load_metrics_block and the code that accesses benchmarks (the blocks referencing
data[benchmark] and later data["summaries"]/data["metrics"] access), first call
soft_assert(key in dict, ...) for each required key (e.g., soft_assert(benchmark
in data, ...); soft_assert("summaries" in data, ...); soft_assert("metrics" in
summary, ...)) and only then perform the dict indexing, or use dict.get(...)
after the soft_assert to avoid a KeyError before assert_all() is called. Ensure
you preserve existing error messages and continue to call assert_all() at the
end of the test flow.
🧹 Nitpick comments (1)
tests/slurm-tests/nano_30b_eval/run_test.py (1)

67-70: Use iterable unpacking for cleaner list composition.

This matches Ruff RUF005 and keeps the block simpler.

Proposed fix
     if enable_tools:
-        parts = [
-            "--enable-auto-tool-choice",
-            "--tool-call-parser qwen3_coder",
-        ] + parts
+        parts = [
+            "--enable-auto-tool-choice",
+            "--tool-call-parser qwen3_coder",
+            *parts,
+        ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/run_test.py` around lines 67 - 70, Replace
the list concatenation that prepends flags to the existing variable parts with
iterable unpacking to simplify composition: update the assignment that currently
builds parts by doing ["--enable-auto-tool-choice", "--tool-call-parser
qwen3_coder",] + parts so it uses iterable unpacking with the existing parts
variable (refer to the parts assignment in run_test.py) to produce the same list
more cleanly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 158-160: The division samples_with_tools / total_samples can still
raise ZeroDivisionError even though soft_assert was called; update the logic
around the tool_fraction calculation in check_results.py to guard against
total_samples == 0 (e.g., check total_samples before dividing, set tool_fraction
to 0 or skip downstream processing/return early) and ensure any downstream uses
of tool_fraction handle the fallback; modify the block that contains
soft_assert, total_samples, samples_with_tools and the tool_fraction assignment
to perform the safe check and avoid the division when total_samples is zero.
- Around line 178-180: The loop over bench_dir.glob("output-rs*.jsonl") in
check_timeouts currently treats zero matched files as zero timeouts; update
check_timeouts to first collect matches (e.g.,
list(sorted(bench_dir.glob("output-rs*.jsonl")))) and assert that the list is
non-empty for each benchmark, raising or failing the test with a clear message
referencing the benchmark (bench_dir) when no output-rs*.jsonl files are found
so missing eval outputs cannot be silently ignored.

---

Duplicate comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 72-75: The code currently indexes into metric dictionaries
directly which can raise KeyError and bypass the soft-assert aggregation; in
functions like load_metrics_block and the code that accesses benchmarks (the
blocks referencing data[benchmark] and later data["summaries"]/data["metrics"]
access), first call soft_assert(key in dict, ...) for each required key (e.g.,
soft_assert(benchmark in data, ...); soft_assert("summaries" in data, ...);
soft_assert("metrics" in summary, ...)) and only then perform the dict indexing,
or use dict.get(...) after the soft_assert to avoid a KeyError before
assert_all() is called. Ensure you preserve existing error messages and continue
to call assert_all() at the end of the test flow.

---

Nitpick comments:
In `@tests/slurm-tests/nano_30b_eval/run_test.py`:
- Around line 67-70: Replace the list concatenation that prepends flags to the
existing variable parts with iterable unpacking to simplify composition: update
the assignment that currently builds parts by doing
["--enable-auto-tool-choice", "--tool-call-parser qwen3_coder",] + parts so it
uses iterable unpacking with the existing parts variable (refer to the parts
assignment in run_test.py) to produce the same list more cleanly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 47a70dac-c809-4584-924e-7061c8a04833

📥 Commits

Reviewing files that changed from the base of the PR and between dd25eeb and 2cdab17.

📒 Files selected for processing (2)
  • tests/slurm-tests/nano_30b_eval/check_results.py
  • tests/slurm-tests/nano_30b_eval/run_test.py

Comment on lines +158 to +160
soft_assert(total_samples > 0, "No samples found in with_tools outputs")
tool_fraction = samples_with_tools / total_samples
print(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent division by zero when tool outputs are missing.

At Line 159, samples_with_tools / total_samples can raise ZeroDivisionError even after soft_assert fails.

Proposed fix
     soft_assert(total_samples > 0, "No samples found in with_tools outputs")
+    if total_samples == 0:
+        return
     tool_fraction = samples_with_tools / total_samples
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 158 - 160, The
division samples_with_tools / total_samples can still raise ZeroDivisionError
even though soft_assert was called; update the logic around the tool_fraction
calculation in check_results.py to guard against total_samples == 0 (e.g., check
total_samples before dividing, set tool_fraction to 0 or skip downstream
processing/return early) and ensure any downstream uses of tool_fraction handle
the fallback; modify the block that contains soft_assert, total_samples,
samples_with_tools and the tool_fraction assignment to perform the safe check
and avoid the division when total_samples is zero.

Comment on lines +178 to +180
for output_path in sorted(bench_dir.glob("output-rs*.jsonl")):
file_timeouts = 0
with output_path.open("rt", encoding="utf-8") as fin:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Assert timeout input files exist per benchmark.

check_timeouts currently allows a benchmark with no output-rs*.jsonl files to pass as 0 timeouts, which can mask missing eval outputs.

Proposed fix
     for benchmark in TOOL_BENCHMARKS:
         bench_dir = eval_dir / "eval-results" / benchmark
         bench_timeouts = 0
+        output_files = sorted(bench_dir.glob("output-rs*.jsonl"))
+        soft_assert(len(output_files) > 0, f"No output files found in {bench_dir}")
+        if not output_files:
+            continue
 
-        for output_path in sorted(bench_dir.glob("output-rs*.jsonl")):
+        for output_path in output_files:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 178 - 180, The
loop over bench_dir.glob("output-rs*.jsonl") in check_timeouts currently treats
zero matched files as zero timeouts; update check_timeouts to first collect
matches (e.g., list(sorted(bench_dir.glob("output-rs*.jsonl")))) and assert that
the list is non-empty for each benchmark, raising or failing the test with a
clear message referencing the benchmark (bench_dir) when no output-rs*.jsonl
files are found so missing eval outputs cannot be silently ignored.

@wedu-nvidia
Copy link
Copy Markdown
Collaborator Author

wedu-nvidia commented Apr 22, 2026

@Kipok @gwarmstrong
I created a SLURM test suite for Nano v3 and included most of the benchmarks listed on the Hugging Face model page: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

For a few tasks, I could not find corresponding support in NeMo-Skills, such as Terminal Bench and TauBench V2. I also skipped RULER for now because it requires downloading and preparing the data on the cluster first, and it seems that AA-LCR is currently the higher-priority long-context benchmark.

Regarding the range, I do not know it is reasonable for all of benchmarks listed above.
Feel free to provide any comments.

Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
@wedu-nvidia wedu-nvidia force-pushed the wedu/update-slurm-test branch from 2cdab17 to 6bdefa5 Compare April 23, 2026 00:07
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
tests/slurm-tests/nano_30b_eval/check_results.py (3)

158-160: ⚠️ Potential issue | 🟠 Major

Avoid divide-by-zero in tool-usage summary.

Line 159 divides even when total_samples == 0; soft_assert alone does not stop execution.

Suggested fix
     soft_assert(total_samples > 0, "No samples found in with_tools outputs")
+    if total_samples == 0:
+        return
     tool_fraction = samples_with_tools / total_samples
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 158 - 160, The
code computes tool_fraction = samples_with_tools / total_samples after calling
soft_assert(total_samples > 0, ...), but soft_assert does not abort execution so
a divide-by-zero can occur; update the logic around soft_assert/total_samples
(referencing soft_assert, total_samples, samples_with_tools, tool_fraction, and
the subsequent print) to explicitly guard against total_samples == 0 — either
return/exit when total_samples is zero or set tool_fraction to a safe default
(e.g., 0) and adjust the print/output to reflect no samples instead of
performing the division.

72-76: ⚠️ Potential issue | 🟠 Major

Guard missing metric keys before dereferencing to keep aggregated failures working.

Line 75 and Line 222 can still raise before assert_all() when required keys are absent, which breaks soft-assert aggregation.

Suggested fix
 def load_metrics_block(metrics_path: Path, benchmark: str):
     data = load_json(metrics_path)
-    soft_assert(benchmark in data, f"Missing benchmark {benchmark} in {metrics_path}")
-    return data[benchmark]
+    if benchmark not in data:
+        soft_assert(False, f"Missing benchmark {benchmark} in {metrics_path}")
+        return None
+    return data[benchmark]
@@
     for benchmark, (agg_key, field, (lo, hi)) in metric_config.items():
         metrics_path, metrics, benchmark_label = resolve_metrics_entry(eval_dir, benchmark)
+        if metrics is None:
+            continue
         soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
@@
 def check_formal_math(eval_dir: Path):
     for label, (benchmark, agg_key, field, (lo, hi)) in FORMAL_MATH_METRICS.items():
         metrics_path = eval_dir / "eval-results" / benchmark / "metrics.json"
         metrics = load_metrics_block(metrics_path, benchmark)
-        soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
-        soft_assert(field in metrics[agg_key], f"Missing field {field} in {metrics_path}")
+        if metrics is None:
+            continue
+        soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}")
+        if agg_key not in metrics:
+            continue
+        soft_assert(field in metrics[agg_key], f"Missing field {field} in {metrics_path}")
+        if field not in metrics[agg_key]:
+            continue
         value = normalize_percent(float(metrics[agg_key][field]))

Also applies to: 96-100, 216-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 72 - 76, The
code in load_metrics_block (and similar places that directly index into loaded
metric dicts) dereferences expected keys and can raise KeyError before
soft-assert aggregation runs; update load_metrics_block and the other
metric-access sites (the blocks around the code that reads benchmark keys and
specific metric keys at ~lines 96-100 and 216-223) to check for key existence
using soft_assert/soft_assert_present (or call soft_assert with a message)
before accessing data[benchmark] or data[benchmark][<metric>] so missing keys
produce soft assertion records rather than exceptions, and return a safe default
or None when keys are absent to allow assert_all() to aggregate failures.

174-180: ⚠️ Potential issue | 🟠 Major

Fail when tool benchmark outputs are missing, not as “0 timeouts”.

If no output-rs*.jsonl files exist for a benchmark, current logic treats it as zero timeouts and can mask missing eval artifacts.

Suggested fix
     for benchmark in TOOL_BENCHMARKS:
         bench_dir = eval_dir / "eval-results" / benchmark
         bench_timeouts = 0
+        output_files = sorted(bench_dir.glob("output-rs*.jsonl"))
+        soft_assert(len(output_files) > 0, f"No output files found in {bench_dir}")
+        if not output_files:
+            continue
 
-        for output_path in sorted(bench_dir.glob("output-rs*.jsonl")):
+        for output_path in output_files:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 174 - 180, The
loop over TOOL_BENCHMARKS currently assumes presence of output-rs*.jsonl files
and treats absence as "0 timeouts"; change the logic to detect when
sorted(bench_dir.glob("output-rs*.jsonl")) yields no files and treat that as a
missing-artifact error rather than zero timeouts. Specifically, inside the scan
of each benchmark (using bench_dir and output_path), if no output files are
found set a missing flag or raise/log an error for that benchmark (increment a
missing-artifact counter or fail the run) instead of leaving bench_timeouts at
0, and ensure downstream aggregation uses that missing indicator. Ensure
references to TOOL_BENCHMARKS, bench_dir, bench_timeouts and file_timeouts are
updated so callers can distinguish "no outputs" from "0 timeouts."
🧹 Nitpick comments (1)
tests/slurm-tests/nano_30b_eval/run_test.py (1)

100-355: Refactor repeated run_eval blocks into a table-driven loop.

The no-tools/with-tools sections duplicate a large call pattern, which increases drift risk when changing shared args or benchmark options. A small benchmark spec table + loop would make this much safer to maintain.

As per coding guidelines: "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks, use self-explanatory code over comments, avoid complicated type interfaces with unions, and keep naming consistent with existing conventions."

Also applies to: 358-444

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_eval/run_test.py` around lines 100 - 355, The
eval_no_tools function has many repeated run_eval calls; refactor by creating a
table (list of dicts) of benchmark specs (each entry: key "exp_suffix",
"benchmarks", optional overrides like "ctx_suffix", "num_jobs", "split",
"num_chunks", "judge_model", "judge_server_type", "judge_server_gpus",
"judge_server_container", "extra_judge_args", "wandb_name_override") and loop
over it to call run_eval once per spec, composing expname from expname_prefix +
exp_suffix and building ctx via wrap_arguments(NO_TOOLS_PARAMS +
spec.get("ctx_suffix","")), while passing the shared parameters
(model=get_local_model_path(workspace), server_type="vllm",
server_gpus=server_gpus, server_args=server_args,
server_container=server_container, output_dir=output_dir, partition=partition,
run_after=run_after, wandb_project=wandb_project) and applying any per-spec
overrides (num_jobs, split, benchmarks, judge_* fields, extra_judge_args,
num_chunks, wandb_name) so behavior remains identical to the original run_eval
calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 158-160: The code computes tool_fraction = samples_with_tools /
total_samples after calling soft_assert(total_samples > 0, ...), but soft_assert
does not abort execution so a divide-by-zero can occur; update the logic around
soft_assert/total_samples (referencing soft_assert, total_samples,
samples_with_tools, tool_fraction, and the subsequent print) to explicitly guard
against total_samples == 0 — either return/exit when total_samples is zero or
set tool_fraction to a safe default (e.g., 0) and adjust the print/output to
reflect no samples instead of performing the division.
- Around line 72-76: The code in load_metrics_block (and similar places that
directly index into loaded metric dicts) dereferences expected keys and can
raise KeyError before soft-assert aggregation runs; update load_metrics_block
and the other metric-access sites (the blocks around the code that reads
benchmark keys and specific metric keys at ~lines 96-100 and 216-223) to check
for key existence using soft_assert/soft_assert_present (or call soft_assert
with a message) before accessing data[benchmark] or data[benchmark][<metric>] so
missing keys produce soft assertion records rather than exceptions, and return a
safe default or None when keys are absent to allow assert_all() to aggregate
failures.
- Around line 174-180: The loop over TOOL_BENCHMARKS currently assumes presence
of output-rs*.jsonl files and treats absence as "0 timeouts"; change the logic
to detect when sorted(bench_dir.glob("output-rs*.jsonl")) yields no files and
treat that as a missing-artifact error rather than zero timeouts. Specifically,
inside the scan of each benchmark (using bench_dir and output_path), if no
output files are found set a missing flag or raise/log an error for that
benchmark (increment a missing-artifact counter or fail the run) instead of
leaving bench_timeouts at 0, and ensure downstream aggregation uses that missing
indicator. Ensure references to TOOL_BENCHMARKS, bench_dir, bench_timeouts and
file_timeouts are updated so callers can distinguish "no outputs" from "0
timeouts."

---

Nitpick comments:
In `@tests/slurm-tests/nano_30b_eval/run_test.py`:
- Around line 100-355: The eval_no_tools function has many repeated run_eval
calls; refactor by creating a table (list of dicts) of benchmark specs (each
entry: key "exp_suffix", "benchmarks", optional overrides like "ctx_suffix",
"num_jobs", "split", "num_chunks", "judge_model", "judge_server_type",
"judge_server_gpus", "judge_server_container", "extra_judge_args",
"wandb_name_override") and loop over it to call run_eval once per spec,
composing expname from expname_prefix + exp_suffix and building ctx via
wrap_arguments(NO_TOOLS_PARAMS + spec.get("ctx_suffix","")), while passing the
shared parameters (model=get_local_model_path(workspace), server_type="vllm",
server_gpus=server_gpus, server_args=server_args,
server_container=server_container, output_dir=output_dir, partition=partition,
run_after=run_after, wandb_project=wandb_project) and applying any per-spec
overrides (num_jobs, split, benchmarks, judge_* fields, extra_judge_args,
num_chunks, wandb_name) so behavior remains identical to the original run_eval
calls.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 707ba520-00be-45e5-9b84-edf9d718b620

📥 Commits

Reviewing files that changed from the base of the PR and between 2cdab17 and 6bdefa5.

📒 Files selected for processing (2)
  • tests/slurm-tests/nano_30b_eval/check_results.py
  • tests/slurm-tests/nano_30b_eval/run_test.py

NO_TOOLS_METRICS = {
"aime25": ("pass@1[avg-of-4]", "symbolic_correct", (88.0, 94.0)),
"gpqa": ("pass@1[avg-of-4]", "symbolic_correct", (69.0, 76.0)),
"mmlu-pro": ("pass@1", "symbolic_correct", (74.0, 82.0)),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for large datasets like this, I think we can use max_samples=256 or something like this to make it faster

JUDGE_MODEL = "openai/gpt-oss-120b"
JUDGE_MODEL_DIRNAME = JUDGE_MODEL.split("/")[-1]
REASONING_PARSER_FILENAME = "nano_v3_reasoning_parser.py"
DEFAULT_SERVER_CONTAINER = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not have references to internal infra, this can be specified in your cluster config and by default we just use what's in there

)

NO_TOOLS_PARAMS = (
"++inference.tokens_to_generate=120000 "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of setting tokens to generate, let's set max-model-len in server args, going to be more efficient

)

FORMAL_MATH_PARAMS = (
"++inference.tokens_to_generate=38912 "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is an exception if it's using lower tokens to generate, so here we can set it explicitly

return f"{workspace}/{MODEL_DIRNAME}"


def setup(workspace, cluster, expname_prefix):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't need this, you can just use model names and it will put them in hf cache

):
output_dir = f"{workspace}/no_tools"
server_args = build_server_args(
parser_path=f"{workspace}/nano_v3_parser/{REASONING_PARSER_FILENAME}",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest vllm accepts just nemotron3 for parser or something like this, maybe we use that for simplicitly

expnames = []

expname = f"{expname_prefix}-no-tools-aime25"
run_eval(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should group more benchmarks together for efficiency. Single 8 node server can likely run more in parallel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants