update slurm test for nano v3#1389
Conversation
📝 WalkthroughWalkthroughAdds two new CLI Python scripts: Changes
Sequence Diagram(s)sequenceDiagram
participant User as User (CLI)
participant Run as run_test.py
participant Scheduler as SLURM Scheduler
participant Eval as nemo_skills.eval Jobs
participant Storage as workspace / eval-results
participant Check as check_results.py
User->>Run: launch with --workspace --cluster --expname_prefix
Run->>Scheduler: submit asset-download job
Scheduler-->>Run: asset job scheduled
Run->>Scheduler: submit eval jobs (no-tools, with-tools, formal-math, agentic) [depend on asset job]
Scheduler->>Eval: run evaluation jobs
Eval-->>Storage: write eval-results/<benchmark>/{metrics.json,output-rs*.jsonl}
Scheduler->>Check: schedule/trigger check_results.py after eval jobs complete
Check->>Storage: read metrics.json and output-rs*.jsonl
Check-->>User: print per-metric results and pass/fail via assertions
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/slurm-tests/nano_30b_eval/check_results.py (1)
102-103: Use direct key access for expected output schema fields.For expected keys (
field,num_tool_calls,conversation,role,content),.get()defaults can mask schema breaks and weaken checker signal quality.As per coding guidelines, "Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data".Also applies to: 131-134, 163-166
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 102 - 103, Replace uses of dict.get(...) with direct indexing for expected schema fields so missing keys raise errors instead of silently returning None: update occurrences where metrics[agg_key].get(field) (and similar .get(...) usages for keys 'field', 'num_tool_calls', 'conversation', 'role', 'content' found elsewhere in the file) are used and change them to direct access like metrics[agg_key][field] (and data['num_tool_calls'], data['conversation'], data['role'], data['content']) to enforce the expected schema; keep existing variable names (metrics, agg_key, field, field_label) and behavior otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 72-76: The code currently dereferences dict keys immediately
(e.g., in load_metrics_block and other places that access nested keys) which can
raise KeyError/TypeError before soft_assert/aggregate checks run; update each
place that does direct indexing (notably load_metrics_block, and the locations
that access nested keys at the other reported spots) to first soft_assert the
existence of the top-level and nested keys (using soft_assert(benchmark in data,
...), then soft_assert('subkey' in data[benchmark], ...) as needed) and only
after assertions pass perform the actual access, or use data.get('key') and
check for None via soft_assert before using the value; ensure you reference and
keep using soft_assert and assert_all for aggregated failures rather than
letting direct indexing raise exceptions.
In `@tests/slurm-tests/nano_30b_eval/run_test.py`:
- Line 19: The import shadows the Python builtin eval; change the import from
nemo_skills.pipeline.cli to alias eval (e.g., import eval as run_eval) and then
update every callsite that currently calls eval to use run_eval instead (there
are 16 callsites to change). Keep the other imported symbols (prepare_data,
run_cmd, wrap_arguments) unchanged and ensure all references to the original
eval symbol in functions and tests are replaced with run_eval to avoid builtin
shadowing and satisfy Ruff A004.
- Around line 547-559: The declared type for run_after in
nemo_skills/pipeline/eval.py (the eval()/run_exp entry) is List[str] but callers
pass a string; update the signature to accept str | list[str] | None (or
Union[str, List[str], None]) and/or normalize the value immediately in
eval()/run_exp by coercing a single str into [str] (and leaving None as-is)
before passing to add_task()/run_exp; reference the eval/run_exp parameter name
run_after and the downstream add_task utility in
nemo_skills/pipeline/utils/exp.py to ensure consistent handling.
---
Nitpick comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 102-103: Replace uses of dict.get(...) with direct indexing for
expected schema fields so missing keys raise errors instead of silently
returning None: update occurrences where metrics[agg_key].get(field) (and
similar .get(...) usages for keys 'field', 'num_tool_calls', 'conversation',
'role', 'content' found elsewhere in the file) are used and change them to
direct access like metrics[agg_key][field] (and data['num_tool_calls'],
data['conversation'], data['role'], data['content']) to enforce the expected
schema; keep existing variable names (metrics, agg_key, field, field_label) and
behavior otherwise.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 72ba7fda-672b-4de8-b4c0-884bc01efc88
📒 Files selected for processing (2)
tests/slurm-tests/nano_30b_eval/check_results.pytests/slurm-tests/nano_30b_eval/run_test.py
| setup_expname = setup(workspace=args.workspace, cluster=args.cluster, expname_prefix=args.expname_prefix) | ||
|
|
||
| no_tools_expnames = eval_no_tools( | ||
| workspace=args.workspace, | ||
| cluster=args.cluster, | ||
| expname_prefix=args.expname_prefix, | ||
| wandb_project=args.wandb_project, | ||
| partition=args.partition, | ||
| num_jobs=args.num_jobs, | ||
| server_gpus=args.server_gpus, | ||
| server_container=args.server_container, | ||
| run_after=setup_expname, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify how `run_after` is handled inside eval implementation.
rg -n -C3 '\brun_after\b' nemo_skills/pipeline/eval.py
ast-grep --pattern 'run_after: List[str] = typer.Option($$$)' nemo_skills/pipeline/eval.pyRepository: NVIDIA-NeMo/Skills
Length of output: 2862
🏁 Script executed:
# Check the setup function return type in the test file
head -30 tests/slurm-tests/nano_30b_eval/run_test.py | grep -A 20 "def setup"
# Look for run_exp function signature and how it handles run_after
rg -n "def run_exp" nemo_skills/pipeline/utils/exp.py | head -1
# Check how run_after is used in run_exp
rg -n -A 10 "run_after" nemo_skills/pipeline/utils/exp.py | head -40
# Check if there's any type conversion/normalization for run_after
rg -n "isinstance.*run_after\|list.*run_after\|if.*run_after" nemo_skills/pipeline/Repository: NVIDIA-NeMo/Skills
Length of output: 1973
🏁 Script executed:
# Check the setup function return type in the test file
rg -n "def setup" tests/slurm-tests/nano_30b_eval/run_test.py -A 10
# Check eval_no_tools function signature - does it pass run_after to eval?
rg -n "def eval_no_tools" tests/slurm-tests/nano_30b_eval/run_test.py -A 30
# Verify what run_after parameter receives in eval.py and how it's forwarded
rg -n "run_after=" nemo_skills/pipeline/eval.py -B 2 -A 2 | head -50Repository: NVIDIA-NeMo/Skills
Length of output: 3116
Update run_after type hint in nemo_skills/pipeline/eval.py or normalize it at entry.
The test passes setup_expname (a string) to eval() as run_after, but eval.py line 238 declares run_after: List[str]. While this works because downstream functions like add_task() normalize strings to lists (lines 496–498 in nemo_skills/pipeline/utils/exp.py), the type mismatch violates the declared interface.
Either update the type hint to run_after: str | list[str] | None to match the actual accepted types, or normalize at the entry point before passing to run_exp().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/slurm-tests/nano_30b_eval/run_test.py` around lines 547 - 559, The
declared type for run_after in nemo_skills/pipeline/eval.py (the eval()/run_exp
entry) is List[str] but callers pass a string; update the signature to accept
str | list[str] | None (or Union[str, List[str], None]) and/or normalize the
value immediately in eval()/run_exp by coercing a single str into [str] (and
leaving None as-is) before passing to add_task()/run_exp; reference the
eval/run_exp parameter name run_after and the downstream add_task utility in
nemo_skills/pipeline/utils/exp.py to ensure consistent handling.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tests/slurm-tests/nano_30b_eval/check_results.py (1)
72-75:⚠️ Potential issue | 🟠 MajorGuard soft-asserted keys before indexing to keep aggregated failure reporting intact.
At Line 75 and Line 221–222, direct indexing can still raise
KeyErrorbeforeassert_all()runs, which breaks the intended soft-assert flow.Proposed fix
def load_metrics_block(metrics_path: Path, benchmark: str): data = load_json(metrics_path) - soft_assert(benchmark in data, f"Missing benchmark {benchmark} in {metrics_path}") - return data[benchmark] + if benchmark not in data: + soft_assert(False, f"Missing benchmark {benchmark} in {metrics_path}") + return None + return data[benchmark] @@ def check_metric_group( eval_dir: Path, metric_config: dict[str, tuple[str, str | tuple[str, ...], tuple[float, float]]] ): for benchmark, (agg_key, field, (lo, hi)) in metric_config.items(): metrics_path, metrics, benchmark_label = resolve_metrics_entry(eval_dir, benchmark) + if metrics is None: + continue soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}") if agg_key not in metrics: continue @@ def check_formal_math(eval_dir: Path): for label, (benchmark, agg_key, field, (lo, hi)) in FORMAL_MATH_METRICS.items(): metrics_path = eval_dir / "eval-results" / benchmark / "metrics.json" metrics = load_metrics_block(metrics_path, benchmark) - soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}") - soft_assert(field in metrics[agg_key], f"Missing field {field} in {metrics_path}") + if metrics is None: + continue + if agg_key not in metrics: + soft_assert(False, f"Missing aggregation key {agg_key} in {metrics_path}") + continue + if field not in metrics[agg_key]: + soft_assert(False, f"Missing field {field} in {metrics_path}") + continue value = normalize_percent(float(metrics[agg_key][field]))Also applies to: 216-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 72 - 75, The code currently indexes into metric dictionaries directly which can raise KeyError and bypass the soft-assert aggregation; in functions like load_metrics_block and the code that accesses benchmarks (the blocks referencing data[benchmark] and later data["summaries"]/data["metrics"] access), first call soft_assert(key in dict, ...) for each required key (e.g., soft_assert(benchmark in data, ...); soft_assert("summaries" in data, ...); soft_assert("metrics" in summary, ...)) and only then perform the dict indexing, or use dict.get(...) after the soft_assert to avoid a KeyError before assert_all() is called. Ensure you preserve existing error messages and continue to call assert_all() at the end of the test flow.
🧹 Nitpick comments (1)
tests/slurm-tests/nano_30b_eval/run_test.py (1)
67-70: Use iterable unpacking for cleaner list composition.This matches Ruff RUF005 and keeps the block simpler.
Proposed fix
if enable_tools: - parts = [ - "--enable-auto-tool-choice", - "--tool-call-parser qwen3_coder", - ] + parts + parts = [ + "--enable-auto-tool-choice", + "--tool-call-parser qwen3_coder", + *parts, + ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/run_test.py` around lines 67 - 70, Replace the list concatenation that prepends flags to the existing variable parts with iterable unpacking to simplify composition: update the assignment that currently builds parts by doing ["--enable-auto-tool-choice", "--tool-call-parser qwen3_coder",] + parts so it uses iterable unpacking with the existing parts variable (refer to the parts assignment in run_test.py) to produce the same list more cleanly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 158-160: The division samples_with_tools / total_samples can still
raise ZeroDivisionError even though soft_assert was called; update the logic
around the tool_fraction calculation in check_results.py to guard against
total_samples == 0 (e.g., check total_samples before dividing, set tool_fraction
to 0 or skip downstream processing/return early) and ensure any downstream uses
of tool_fraction handle the fallback; modify the block that contains
soft_assert, total_samples, samples_with_tools and the tool_fraction assignment
to perform the safe check and avoid the division when total_samples is zero.
- Around line 178-180: The loop over bench_dir.glob("output-rs*.jsonl") in
check_timeouts currently treats zero matched files as zero timeouts; update
check_timeouts to first collect matches (e.g.,
list(sorted(bench_dir.glob("output-rs*.jsonl")))) and assert that the list is
non-empty for each benchmark, raising or failing the test with a clear message
referencing the benchmark (bench_dir) when no output-rs*.jsonl files are found
so missing eval outputs cannot be silently ignored.
---
Duplicate comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 72-75: The code currently indexes into metric dictionaries
directly which can raise KeyError and bypass the soft-assert aggregation; in
functions like load_metrics_block and the code that accesses benchmarks (the
blocks referencing data[benchmark] and later data["summaries"]/data["metrics"]
access), first call soft_assert(key in dict, ...) for each required key (e.g.,
soft_assert(benchmark in data, ...); soft_assert("summaries" in data, ...);
soft_assert("metrics" in summary, ...)) and only then perform the dict indexing,
or use dict.get(...) after the soft_assert to avoid a KeyError before
assert_all() is called. Ensure you preserve existing error messages and continue
to call assert_all() at the end of the test flow.
---
Nitpick comments:
In `@tests/slurm-tests/nano_30b_eval/run_test.py`:
- Around line 67-70: Replace the list concatenation that prepends flags to the
existing variable parts with iterable unpacking to simplify composition: update
the assignment that currently builds parts by doing
["--enable-auto-tool-choice", "--tool-call-parser qwen3_coder",] + parts so it
uses iterable unpacking with the existing parts variable (refer to the parts
assignment in run_test.py) to produce the same list more cleanly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 47a70dac-c809-4584-924e-7061c8a04833
📒 Files selected for processing (2)
tests/slurm-tests/nano_30b_eval/check_results.pytests/slurm-tests/nano_30b_eval/run_test.py
| soft_assert(total_samples > 0, "No samples found in with_tools outputs") | ||
| tool_fraction = samples_with_tools / total_samples | ||
| print( |
There was a problem hiding this comment.
Prevent division by zero when tool outputs are missing.
At Line 159, samples_with_tools / total_samples can raise ZeroDivisionError even after soft_assert fails.
Proposed fix
soft_assert(total_samples > 0, "No samples found in with_tools outputs")
+ if total_samples == 0:
+ return
tool_fraction = samples_with_tools / total_samples🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 158 - 160, The
division samples_with_tools / total_samples can still raise ZeroDivisionError
even though soft_assert was called; update the logic around the tool_fraction
calculation in check_results.py to guard against total_samples == 0 (e.g., check
total_samples before dividing, set tool_fraction to 0 or skip downstream
processing/return early) and ensure any downstream uses of tool_fraction handle
the fallback; modify the block that contains soft_assert, total_samples,
samples_with_tools and the tool_fraction assignment to perform the safe check
and avoid the division when total_samples is zero.
| for output_path in sorted(bench_dir.glob("output-rs*.jsonl")): | ||
| file_timeouts = 0 | ||
| with output_path.open("rt", encoding="utf-8") as fin: |
There was a problem hiding this comment.
Assert timeout input files exist per benchmark.
check_timeouts currently allows a benchmark with no output-rs*.jsonl files to pass as 0 timeouts, which can mask missing eval outputs.
Proposed fix
for benchmark in TOOL_BENCHMARKS:
bench_dir = eval_dir / "eval-results" / benchmark
bench_timeouts = 0
+ output_files = sorted(bench_dir.glob("output-rs*.jsonl"))
+ soft_assert(len(output_files) > 0, f"No output files found in {bench_dir}")
+ if not output_files:
+ continue
- for output_path in sorted(bench_dir.glob("output-rs*.jsonl")):
+ for output_path in output_files:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 178 - 180, The
loop over bench_dir.glob("output-rs*.jsonl") in check_timeouts currently treats
zero matched files as zero timeouts; update check_timeouts to first collect
matches (e.g., list(sorted(bench_dir.glob("output-rs*.jsonl")))) and assert that
the list is non-empty for each benchmark, raising or failing the test with a
clear message referencing the benchmark (bench_dir) when no output-rs*.jsonl
files are found so missing eval outputs cannot be silently ignored.
|
@Kipok @gwarmstrong For a few tasks, I could not find corresponding support in NeMo-Skills, such as Terminal Bench and TauBench V2. I also skipped RULER for now because it requires downloading and preparing the data on the cluster first, and it seems that AA-LCR is currently the higher-priority long-context benchmark. Regarding the range, I do not know it is reasonable for all of benchmarks listed above. |
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
2cdab17 to
6bdefa5
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
tests/slurm-tests/nano_30b_eval/check_results.py (3)
158-160:⚠️ Potential issue | 🟠 MajorAvoid divide-by-zero in tool-usage summary.
Line 159 divides even when
total_samples == 0;soft_assertalone does not stop execution.Suggested fix
soft_assert(total_samples > 0, "No samples found in with_tools outputs") + if total_samples == 0: + return tool_fraction = samples_with_tools / total_samples🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 158 - 160, The code computes tool_fraction = samples_with_tools / total_samples after calling soft_assert(total_samples > 0, ...), but soft_assert does not abort execution so a divide-by-zero can occur; update the logic around soft_assert/total_samples (referencing soft_assert, total_samples, samples_with_tools, tool_fraction, and the subsequent print) to explicitly guard against total_samples == 0 — either return/exit when total_samples is zero or set tool_fraction to a safe default (e.g., 0) and adjust the print/output to reflect no samples instead of performing the division.
72-76:⚠️ Potential issue | 🟠 MajorGuard missing metric keys before dereferencing to keep aggregated failures working.
Line 75 and Line 222 can still raise before
assert_all()when required keys are absent, which breaks soft-assert aggregation.Suggested fix
def load_metrics_block(metrics_path: Path, benchmark: str): data = load_json(metrics_path) - soft_assert(benchmark in data, f"Missing benchmark {benchmark} in {metrics_path}") - return data[benchmark] + if benchmark not in data: + soft_assert(False, f"Missing benchmark {benchmark} in {metrics_path}") + return None + return data[benchmark] @@ for benchmark, (agg_key, field, (lo, hi)) in metric_config.items(): metrics_path, metrics, benchmark_label = resolve_metrics_entry(eval_dir, benchmark) + if metrics is None: + continue soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}") @@ def check_formal_math(eval_dir: Path): for label, (benchmark, agg_key, field, (lo, hi)) in FORMAL_MATH_METRICS.items(): metrics_path = eval_dir / "eval-results" / benchmark / "metrics.json" metrics = load_metrics_block(metrics_path, benchmark) - soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}") - soft_assert(field in metrics[agg_key], f"Missing field {field} in {metrics_path}") + if metrics is None: + continue + soft_assert(agg_key in metrics, f"Missing aggregation key {agg_key} in {metrics_path}") + if agg_key not in metrics: + continue + soft_assert(field in metrics[agg_key], f"Missing field {field} in {metrics_path}") + if field not in metrics[agg_key]: + continue value = normalize_percent(float(metrics[agg_key][field]))Also applies to: 96-100, 216-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 72 - 76, The code in load_metrics_block (and similar places that directly index into loaded metric dicts) dereferences expected keys and can raise KeyError before soft-assert aggregation runs; update load_metrics_block and the other metric-access sites (the blocks around the code that reads benchmark keys and specific metric keys at ~lines 96-100 and 216-223) to check for key existence using soft_assert/soft_assert_present (or call soft_assert with a message) before accessing data[benchmark] or data[benchmark][<metric>] so missing keys produce soft assertion records rather than exceptions, and return a safe default or None when keys are absent to allow assert_all() to aggregate failures.
174-180:⚠️ Potential issue | 🟠 MajorFail when tool benchmark outputs are missing, not as “0 timeouts”.
If no
output-rs*.jsonlfiles exist for a benchmark, current logic treats it as zero timeouts and can mask missing eval artifacts.Suggested fix
for benchmark in TOOL_BENCHMARKS: bench_dir = eval_dir / "eval-results" / benchmark bench_timeouts = 0 + output_files = sorted(bench_dir.glob("output-rs*.jsonl")) + soft_assert(len(output_files) > 0, f"No output files found in {bench_dir}") + if not output_files: + continue - for output_path in sorted(bench_dir.glob("output-rs*.jsonl")): + for output_path in output_files:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/check_results.py` around lines 174 - 180, The loop over TOOL_BENCHMARKS currently assumes presence of output-rs*.jsonl files and treats absence as "0 timeouts"; change the logic to detect when sorted(bench_dir.glob("output-rs*.jsonl")) yields no files and treat that as a missing-artifact error rather than zero timeouts. Specifically, inside the scan of each benchmark (using bench_dir and output_path), if no output files are found set a missing flag or raise/log an error for that benchmark (increment a missing-artifact counter or fail the run) instead of leaving bench_timeouts at 0, and ensure downstream aggregation uses that missing indicator. Ensure references to TOOL_BENCHMARKS, bench_dir, bench_timeouts and file_timeouts are updated so callers can distinguish "no outputs" from "0 timeouts."
🧹 Nitpick comments (1)
tests/slurm-tests/nano_30b_eval/run_test.py (1)
100-355: Refactor repeatedrun_evalblocks into a table-driven loop.The no-tools/with-tools sections duplicate a large call pattern, which increases drift risk when changing shared args or benchmark options. A small benchmark spec table + loop would make this much safer to maintain.
As per coding guidelines: "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks, use self-explanatory code over comments, avoid complicated type interfaces with unions, and keep naming consistent with existing conventions."
Also applies to: 358-444
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/nano_30b_eval/run_test.py` around lines 100 - 355, The eval_no_tools function has many repeated run_eval calls; refactor by creating a table (list of dicts) of benchmark specs (each entry: key "exp_suffix", "benchmarks", optional overrides like "ctx_suffix", "num_jobs", "split", "num_chunks", "judge_model", "judge_server_type", "judge_server_gpus", "judge_server_container", "extra_judge_args", "wandb_name_override") and loop over it to call run_eval once per spec, composing expname from expname_prefix + exp_suffix and building ctx via wrap_arguments(NO_TOOLS_PARAMS + spec.get("ctx_suffix","")), while passing the shared parameters (model=get_local_model_path(workspace), server_type="vllm", server_gpus=server_gpus, server_args=server_args, server_container=server_container, output_dir=output_dir, partition=partition, run_after=run_after, wandb_project=wandb_project) and applying any per-spec overrides (num_jobs, split, benchmarks, judge_* fields, extra_judge_args, num_chunks, wandb_name) so behavior remains identical to the original run_eval calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/slurm-tests/nano_30b_eval/check_results.py`:
- Around line 158-160: The code computes tool_fraction = samples_with_tools /
total_samples after calling soft_assert(total_samples > 0, ...), but soft_assert
does not abort execution so a divide-by-zero can occur; update the logic around
soft_assert/total_samples (referencing soft_assert, total_samples,
samples_with_tools, tool_fraction, and the subsequent print) to explicitly guard
against total_samples == 0 — either return/exit when total_samples is zero or
set tool_fraction to a safe default (e.g., 0) and adjust the print/output to
reflect no samples instead of performing the division.
- Around line 72-76: The code in load_metrics_block (and similar places that
directly index into loaded metric dicts) dereferences expected keys and can
raise KeyError before soft-assert aggregation runs; update load_metrics_block
and the other metric-access sites (the blocks around the code that reads
benchmark keys and specific metric keys at ~lines 96-100 and 216-223) to check
for key existence using soft_assert/soft_assert_present (or call soft_assert
with a message) before accessing data[benchmark] or data[benchmark][<metric>] so
missing keys produce soft assertion records rather than exceptions, and return a
safe default or None when keys are absent to allow assert_all() to aggregate
failures.
- Around line 174-180: The loop over TOOL_BENCHMARKS currently assumes presence
of output-rs*.jsonl files and treats absence as "0 timeouts"; change the logic
to detect when sorted(bench_dir.glob("output-rs*.jsonl")) yields no files and
treat that as a missing-artifact error rather than zero timeouts. Specifically,
inside the scan of each benchmark (using bench_dir and output_path), if no
output files are found set a missing flag or raise/log an error for that
benchmark (increment a missing-artifact counter or fail the run) instead of
leaving bench_timeouts at 0, and ensure downstream aggregation uses that missing
indicator. Ensure references to TOOL_BENCHMARKS, bench_dir, bench_timeouts and
file_timeouts are updated so callers can distinguish "no outputs" from "0
timeouts."
---
Nitpick comments:
In `@tests/slurm-tests/nano_30b_eval/run_test.py`:
- Around line 100-355: The eval_no_tools function has many repeated run_eval
calls; refactor by creating a table (list of dicts) of benchmark specs (each
entry: key "exp_suffix", "benchmarks", optional overrides like "ctx_suffix",
"num_jobs", "split", "num_chunks", "judge_model", "judge_server_type",
"judge_server_gpus", "judge_server_container", "extra_judge_args",
"wandb_name_override") and loop over it to call run_eval once per spec,
composing expname from expname_prefix + exp_suffix and building ctx via
wrap_arguments(NO_TOOLS_PARAMS + spec.get("ctx_suffix","")), while passing the
shared parameters (model=get_local_model_path(workspace), server_type="vllm",
server_gpus=server_gpus, server_args=server_args,
server_container=server_container, output_dir=output_dir, partition=partition,
run_after=run_after, wandb_project=wandb_project) and applying any per-spec
overrides (num_jobs, split, benchmarks, judge_* fields, extra_judge_args,
num_chunks, wandb_name) so behavior remains identical to the original run_eval
calls.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 707ba520-00be-45e5-9b84-edf9d718b620
📒 Files selected for processing (2)
tests/slurm-tests/nano_30b_eval/check_results.pytests/slurm-tests/nano_30b_eval/run_test.py
| NO_TOOLS_METRICS = { | ||
| "aime25": ("pass@1[avg-of-4]", "symbolic_correct", (88.0, 94.0)), | ||
| "gpqa": ("pass@1[avg-of-4]", "symbolic_correct", (69.0, 76.0)), | ||
| "mmlu-pro": ("pass@1", "symbolic_correct", (74.0, 82.0)), |
There was a problem hiding this comment.
for large datasets like this, I think we can use max_samples=256 or something like this to make it faster
| JUDGE_MODEL = "openai/gpt-oss-120b" | ||
| JUDGE_MODEL_DIRNAME = JUDGE_MODEL.split("/")[-1] | ||
| REASONING_PARSER_FILENAME = "nano_v3_reasoning_parser.py" | ||
| DEFAULT_SERVER_CONTAINER = ( |
There was a problem hiding this comment.
let's not have references to internal infra, this can be specified in your cluster config and by default we just use what's in there
| ) | ||
|
|
||
| NO_TOOLS_PARAMS = ( | ||
| "++inference.tokens_to_generate=120000 " |
There was a problem hiding this comment.
instead of setting tokens to generate, let's set max-model-len in server args, going to be more efficient
| ) | ||
|
|
||
| FORMAL_MATH_PARAMS = ( | ||
| "++inference.tokens_to_generate=38912 " |
There was a problem hiding this comment.
this one is an exception if it's using lower tokens to generate, so here we can set it explicitly
| return f"{workspace}/{MODEL_DIRNAME}" | ||
|
|
||
|
|
||
| def setup(workspace, cluster, expname_prefix): |
There was a problem hiding this comment.
we shouldn't need this, you can just use model names and it will put them in hf cache
| ): | ||
| output_dir = f"{workspace}/no_tools" | ||
| server_args = build_server_args( | ||
| parser_path=f"{workspace}/nano_v3_parser/{REASONING_PARSER_FILENAME}", |
There was a problem hiding this comment.
latest vllm accepts just nemotron3 for parser or something like this, maybe we use that for simplicitly
| expnames = [] | ||
|
|
||
| expname = f"{expname_prefix}-no-tools-aime25" | ||
| run_eval( |
There was a problem hiding this comment.
we should group more benchmarks together for efficiency. Single 8 node server can likely run more in parallel
Summary by CodeRabbit