Add RASB-26H1 benchmark integration by pbelcak · Pull Request #1402 · NVIDIA-NeMo/Skills

pbelcak · 2026-04-25T22:59:29Z

RASB (Real Agent Scaffolds Bench) evaluates LLMs on complex agent scaffolding tasks scraped from open-source AI agent repositories. This adds support for the 26H1 evaluation snapshot with 193 environments and 5,731 test samples.
See the following NVIDIA tech report for details: internal link.

Features:

Docker-based evaluation matching original RASB benchmark setup
Support for OpenAI, Anthropic, and other compatible APIs
Aggregate metrics: mean, median, Q1, Q3, std across environments
Per-environment, per-type, per-judgment breakdowns

Files:

nemo_skills/dataset/rasb-26h1/: Dataset module and preparation
nemo_skills/inference/eval/rasb.py: Generation with Docker orchestration
nemo_skills/inference/eval/rasb_container/: Container evaluation files
nemo_skills/evaluation/metrics/rasb_metrics.py: Metrics aggregation
nemo_skills/evaluation/evaluator/rasb.py: Result passthrough evaluator
docs/evaluation/rasb.md: Documentation

Summary by CodeRabbit

New Features
- RASB 26H1 evaluation: containerized runs with resumable outputs, solvability retries, judge-based judgments, dataset preparation CLI, unified LM adapter for multiple backends, and evaluator/metrics integration.
- Enhanced reporting: pass-rate breakdowns (env/type/repo), per-environment statistics, errors/containers logging, and JSON output.
Documentation
- Comprehensive RASB guide and dataset README with setup, flags, Docker/run examples, metrics explanation, and troubleshooting.

RASB (Real Agent Scaffolds Bench) evaluates LLMs on complex agent scaffolding tasks. This adds support for the 26H1 snapshot with 193 environments and 5,731 test samples. Features: - Docker-based evaluation matching original RASB benchmark - Support for OpenAI, Anthropic, and other compatible APIs - Aggregate metrics: mean, median, Q1, Q3, std across environments - Per-environment, per-type, per-judgment breakdowns Files: - nemo_skills/dataset/rasb-26h1/: Dataset module and preparation - nemo_skills/inference/eval/rasb.py: Generation with Docker orchestration - nemo_skills/inference/eval/rasb_container/: Container evaluation files - nemo_skills/evaluation/metrics/rasb_metrics.py: Metrics aggregation - nemo_skills/evaluation/evaluator/rasb.py: Result passthrough evaluator - docs/evaluation/rasb.md: Documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Peter Belcak <pbelcak@nvidia.com>

coderabbitai · 2026-04-25T23:04:28Z

📝 Walkthrough

Walkthrough

Adds RASB 26H1: docs and MkDocs nav, a dataset package and prepare CLI, a Docker-based Hydra generation task that builds/runs per-environment containers, in-container evaluation/judging/LM adapters, evaluator/metrics integration, and result aggregation/output plumbing.

Changes

Cohort / File(s)	Summary
Documentation & Config `docs/evaluation/rasb.md`, `mkdocs.yml`, `nemo_skills/dataset/rasb-26h1/README.md`	New RASB documentation, usage and troubleshooting; adds page to MkDocs navigation.
Dataset Package & Prep `nemo_skills/dataset/rasb-26h1/__init__.py`, `nemo_skills/dataset/rasb-26h1/prepare.py`	New dataset package metadata and a `prepare.py` CLI to scan environment dirs, read metadata/inputs, and emit pointer-style `test.jsonl`.
Evaluator Registry & Class `nemo_skills/evaluation/evaluator/__init__.py`, `nemo_skills/evaluation/evaluator/rasb.py`	Registers `"rasb"` evaluator and adds `RasbEvaluator` that converts container `rasb_result` into Skills prediction fields; single-sample eval unsupported.
Metrics Implementation `nemo_skills/evaluation/metrics/map_metrics.py`, `nemo_skills/evaluation/metrics/rasb_metrics.py`	Adds `rasb` metrics mapping and `RasbMetrics` to aggregate pass@1, breakdowns (env type, parsing, judgment, mode, repo, tools), per-env stats and error counters.
Generation Orchestration (Hydra Task) `nemo_skills/inference/eval/rasb.py`	Adds `RasbGenerationTask` to build per-environment base/overlay Docker images, generate `callable.py` and `.env`, run containers (mounting inputs/results), collect `results.json`, and stream per-sample JSONL output with support for dry-run, chunking, concurrency, and resume.
Container Resource Package `nemo_skills/inference/eval/rasb_container/__init__.py`	Exports `CONTAINER_FILES_DIR` for bundled evaluation resources.
In-Container Evaluation Harness `nemo_skills/inference/eval/rasb_container/evaluate.py`	Container-side harness that builds messages, calls LM, executes tools, parses outputs (json/regex/tool/face-value), performs exact or requirements judgments, supports solvability/retry and incremental writes to `results/results.json` and `solvability.json`.
Requirements Judge `nemo_skills/inference/eval/rasb_container/judge.py`	Implements `judge_requirements` which runs a committee of judge-model queries in parallel, parses JSON responses, majority-aggregates per-requirement votes, and returns structured judgment counts and reasoning.
LM Abstractions `nemo_skills/inference/eval/rasb_container/lm.py`	Adds LM abstraction (`LM`, `OpenAILM`, `AnthropicLM`), `LMQueryError`, sync/async query APIs, token usage tracking, and adapters between OpenAI Responses and Anthropic Messages formats.

Sequence Diagram(s)

sequenceDiagram
    participant Orchestrator as Orchestration (rasb.py)
    participant Docker as Docker (Image Build)
    participant Container as Container (evaluate.py)
    participant LM as LM Endpoint (OpenAI/Anthropic)
    participant Judge as Judge (judge.py)
    participant Metrics as Metrics Collector

    Orchestrator->>Docker: build base image per environment
    Orchestrator->>Docker: build overlay (evaluate.py, judge.py, lm.py, callable.py, .env)
    Orchestrator->>Container: run container with mounted inputs/results

    Container->>LM: invoke callable with messages
    LM-->>Container: model response (may include tool calls)

    alt tool call present
        Container->>Container: execute tool locally
        Container->>LM: feed tool result back to callable
        LM-->>Container: updated response
    end

    Container->>Container: parse output (json/regex/tool/face-value)
    alt judgment == requirements
        Container->>Judge: send prompt + requirements (committee)
        Judge->>LM: query judge model(s)
        LM-->>Judge: judge responses
        Judge-->>Container: aggregated requirements judgment
    else judgment == exact
        Container->>Container: normalize & compare expected
    end

    Container->>Orchestrator: write results.json
    Orchestrator->>Metrics: map results to Skills-format, aggregate breakdowns
    Metrics-->>Orchestrator: pass rates & per-env stats

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main change: integrating the RASB-26H1 benchmark into the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 92.55% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (9)

nemo_skills/dataset/rasb-26h1/__init__.py (1)

27-39: Consider adding RASB into slurm benchmark coverage.

Given the Docker orchestration + custom evaluator/metrics flow, adding at least a small rasb-26h1 slurm test path would help catch integration regressions early.

Based on learnings, when enabling a new modality or adding complicated benchmark evaluation/metrics logic, the dataset should be considered for slurm tests for comprehensive evaluation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/rasb-26h1/__init__.py` around lines 27 - 39, Add the new
rasb-26h1 dataset to the project's Slurm benchmark test coverage by updating the
Slurm/CI test matrix to include a lightweight slurm job for this dataset (use a
short/smoke config) so integration regressions for Docker orchestration and the
custom evaluator/metrics are exercised; reference the dataset name rasb-26h1 and
ensure the test invokes the dataset's GENERATION_MODULE
("nemo_skills.inference.eval.rasb"), honors REQUIRES_DATA_DIR=True (or uses a
small synthetic data fixture), and validates METRICS_TYPE="rasb" and
GENERATION_ARGS="++eval_type=rasb" on the EVAL_SPLIT="test" path. Ensure the job
is marked small/optional to avoid long runs and include any necessary Docker
permissions so the container orchestration path executes during the Slurm run.

nemo_skills/dataset/rasb-26h1/prepare.py (2)

166-167: Remove extraneous f-string prefixes.

These f-strings don't contain any placeholders.

♻️ Proposed fix

-    print(f"\nRASB 26H1 Preparation Summary")
-    print(f"{'=' * 40}")
+    print("\nRASB 26H1 Preparation Summary")
+    print("=" * 40)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/rasb-26h1/prepare.py` around lines 166 - 167, The two
print statements that use f-strings (the lines calling print(f"\nRASB 26H1
Preparation Summary") and print(f"{'=' * 40}")) have no interpolations; remove
the unnecessary f prefixes so they are plain string literals (print("\nRASB 26H1
Preparation Summary") and print("=" * 40)) to avoid misleading f-string usage in
prepare.py.

185-187: Remove extraneous f-string prefix.

♻️ Proposed fix

-    print(f"\nSamples by environment type:")
+    print("\nSamples by environment type:")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/rasb-26h1/prepare.py` around lines 185 - 187, The print
statement uses an unnecessary f-string prefix: change print(f"\nSamples by
environment type:") to a regular string print("\nSamples by environment type:")
while leaving the subsequent loop and its f-string (for env_type, count in
sorted(type_counts.items(), key=lambda x: -x[1]): print(f"  {env_type}:
{count}")) intact; this removes the extraneous f prefix without affecting the
interpolated prints.

nemo_skills/evaluation/metrics/rasb_metrics.py (1)

193-198: Rename unused loop variable.

The env_id variable is not used in the loop body.

♻️ Proposed fix

         # Collect per-environment pass rates
         pass_rates = []
-        for env_id, counter in self.by_env_id.items():
+        for _env_id, counter in self.by_env_id.items():
             if counter["total"] > 0:
                 rate = 100.0 * counter["correct"] / counter["total"]
                 pass_rates.append(rate)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/rasb_metrics.py` around lines 193 - 198, The
loop over self.by_env_id.items() declares an unused variable env_id; change the
loop to ignore the unused key (e.g., use _ or _env_id) or iterate over
self.by_env_id.values() so only counter is used; update the for statement where
pass_rates is built (the loop that currently reads "for env_id, counter in
self.by_env_id.items():") to avoid the unused env_id symbol.

nemo_skills/inference/eval/rasb_container/evaluate.py (2)

85-88: Consider logging the exception in harness fallback.

The silent pass on exception swallows potentially useful diagnostic information when apply_inputs fails.

🔧 Proposed fix

         try:
             result = _harness_apply_inputs(system_template, user_template, fields, input_mode)
             if isinstance(result, tuple) and len(result) == 2:
                 sys_out, usr_out = result
                 # Only accept plain text prompts; reject multimodal content blocks
                 if isinstance(sys_out, str) and isinstance(usr_out, str):
                     return result
                 log.warning("Harness returned non-string prompt content (multimodal?), falling through to built-in")
-        except Exception:
-            pass  # Fall through to built-in handling
+        except Exception as e:
+            log.debug("Harness apply_inputs failed, using built-in: %s", e)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 85 - 88,
The except block that currently swallows errors when calling apply_inputs should
log the exception so failures are visible; replace the bare "pass" with a
log.exception (or log.error with exc_info=True) call that includes context like
"apply_inputs failed, falling back to built-in handling" and the exception
details; update the except in the same try that surrounds apply_inputs/return
result and keep the subsequent fallback behavior unchanged so diagnostics are
preserved without altering control flow.

96-97: Minor: Use next(iter(...)) for single element extraction.

♻️ Proposed fix

     elif input_mode == "direct_user_message":
-        return system_template, str(list(fields.values())[0]) if fields else ""
+        return system_template, str(next(iter(fields.values()))) if fields else ""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 96 - 97,
The return currently constructs a list to grab the first value from fields
(str(list(fields.values())[0])), which is inefficient; replace that extraction
with str(next(iter(fields.values()))) in the branch handling input_mode ==
"direct_user_message" (in evaluate.py) while keeping the existing guard (if
fields else "") so next(iter(...)) is only called when fields is non-empty.

nemo_skills/inference/eval/rasb.py (3)

240-252: Prefix unused unpacked variables with underscore.

The image and logs variables from images.build() are not used.

♻️ Proposed fix

         try:
-            image, logs = self.docker_client.images.build(
+            _image, _logs = self.docker_client.images.build(
                 path=str(env_path),
                 tag=base_tag,
                 rm=True,
                 forcerm=True,
                 timeout=self.cfg.docker_build_timeout,
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb.py` around lines 240 - 252, The variables
returned by docker_client.images.build (image, logs) are assigned but unused; in
the build block inside rasb.py (the call to self.docker_client.images.build with
tag=base_tag) rename the unpacked variables to unused-prefixed names (e.g.,
_image, _logs or _, _logs) to comply with the unused-variable convention and
avoid linter warnings while keeping the same behavior in the try/except that
logs success with LOG.info and handles docker.errors.BuildError.

271-284: Prefix unused unpacked variables with underscore.

Same issue in overlay image build.

♻️ Proposed fix

         try:
-            image, logs = self.docker_client.images.build(
+            _image, _logs = self.docker_client.images.build(
                 path=str(build_context),
                 dockerfile="Dockerfile.bench",
                 tag=bench_tag,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb.py` around lines 271 - 284, The tuple
returned by self.docker_client.images.build is being unpacked into image, logs
but those variables are unused; change the unpacking to use underscore-prefixed
names (e.g., _image, _logs or _, _) when calling self.docker_client.images.build
inside the try block in rasb.py so linters don’t flag unused variables and
intent is clear; keep the rest of the try/except (docker.errors.BuildError
handling and LOG/error messages) unchanged.

590-595: Consider logging the container removal failure.

Silent exception swallowing during cleanup hides potential issues.

♻️ Proposed fix

             finally:
                 if container and not self.cfg.keep_containers:
                     try:
                         container.remove(force=True)
-                    except Exception:
-                        pass
+                    except Exception as e:
+                        LOG.debug("Failed to remove container: %s", e)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb.py` around lines 590 - 595, The cleanup block
currently swallows exceptions when removing the Docker container (variable
container) if not self.cfg.keep_containers; change the empty except to log the
failure instead of silencing it—use the class logger (e.g., self.logger) to
record the exception and context (container id/name and that
container.remove(force=True) failed), using logger.exception or
logger.error(..., exc_info=True) so the stack trace is preserved, but keep the
behavior of not re-raising the error so cleanup continues.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/rasb-26h1/README.md`:
- Line 27: Update the typo in the README heading "### NVIDIA Inference API
(Ouickstart Example)" by changing "Ouickstart" to "Quickstart" so the heading
reads "### NVIDIA Inference API (Quickstart Example)"; locate and edit that
exact heading text to correct the spelling.

In `@nemo_skills/evaluation/evaluator/rasb.py`:
- Around line 62-71: The code silently defaults missing RASB schema fields by
using rasb_result.get(...) which can hide schema regressions; change access to
required fields to direct key access (e.g., rasb_result["passed"],
rasb_result["judgment_type"], rasb_result["judgment_details"],
rasb_result["parsed_output"], rasb_result["error"]) so the code raises a
KeyError on missing data, and only use .get() with sensible defaults for truly
optional fields; update any other occurrences (e.g., the later block around
lines 90-93) to follow the same pattern and add a small comment noting which
keys are required vs optional.

In `@nemo_skills/inference/eval/rasb_container/lm.py`:
- Around line 218-226: The string concatenation has wrong operator precedence in
the reasoning branch; change the expression inside the loop over response.output
(where item.type == "reasoning" and variable ret is built) to group the ternary
so the newline is appended to the whole result — e.g. use "Reasoning: " +
(item.content if item.content else "Empty.") + "\n" — ensuring item.content is
correctly used when present and "Empty." is used otherwise.

In `@nemo_skills/inference/eval/rasb.py`:
- Around line 746-750: total_samples can be zero causing a division-by-zero in
the LOG.info percent calculation; after computing total_samples and correct
(from results and existing_results) add a guard that if total_samples == 0 set
the percentage to 0.0 (or log a "no samples" message) and avoid dividing, then
call LOG.info using that safe percentage; update the block that computes
total_samples, correct and the LOG.info call in rasb.py to use this check so the
code never performs 100*correct/total_samples when total_samples is zero.

---

Nitpick comments:
In `@nemo_skills/dataset/rasb-26h1/__init__.py`:
- Around line 27-39: Add the new rasb-26h1 dataset to the project's Slurm
benchmark test coverage by updating the Slurm/CI test matrix to include a
lightweight slurm job for this dataset (use a short/smoke config) so integration
regressions for Docker orchestration and the custom evaluator/metrics are
exercised; reference the dataset name rasb-26h1 and ensure the test invokes the
dataset's GENERATION_MODULE ("nemo_skills.inference.eval.rasb"), honors
REQUIRES_DATA_DIR=True (or uses a small synthetic data fixture), and validates
METRICS_TYPE="rasb" and GENERATION_ARGS="++eval_type=rasb" on the
EVAL_SPLIT="test" path. Ensure the job is marked small/optional to avoid long
runs and include any necessary Docker permissions so the container orchestration
path executes during the Slurm run.

In `@nemo_skills/dataset/rasb-26h1/prepare.py`:
- Around line 166-167: The two print statements that use f-strings (the lines
calling print(f"\nRASB 26H1 Preparation Summary") and print(f"{'=' * 40}")) have
no interpolations; remove the unnecessary f prefixes so they are plain string
literals (print("\nRASB 26H1 Preparation Summary") and print("=" * 40)) to avoid
misleading f-string usage in prepare.py.
- Around line 185-187: The print statement uses an unnecessary f-string prefix:
change print(f"\nSamples by environment type:") to a regular string
print("\nSamples by environment type:") while leaving the subsequent loop and
its f-string (for env_type, count in sorted(type_counts.items(), key=lambda x:
-x[1]): print(f"  {env_type}: {count}")) intact; this removes the extraneous f
prefix without affecting the interpolated prints.

In `@nemo_skills/evaluation/metrics/rasb_metrics.py`:
- Around line 193-198: The loop over self.by_env_id.items() declares an unused
variable env_id; change the loop to ignore the unused key (e.g., use _ or
_env_id) or iterate over self.by_env_id.values() so only counter is used; update
the for statement where pass_rates is built (the loop that currently reads "for
env_id, counter in self.by_env_id.items():") to avoid the unused env_id symbol.

In `@nemo_skills/inference/eval/rasb_container/evaluate.py`:
- Around line 85-88: The except block that currently swallows errors when
calling apply_inputs should log the exception so failures are visible; replace
the bare "pass" with a log.exception (or log.error with exc_info=True) call that
includes context like "apply_inputs failed, falling back to built-in handling"
and the exception details; update the except in the same try that surrounds
apply_inputs/return result and keep the subsequent fallback behavior unchanged
so diagnostics are preserved without altering control flow.
- Around line 96-97: The return currently constructs a list to grab the first
value from fields (str(list(fields.values())[0])), which is inefficient; replace
that extraction with str(next(iter(fields.values()))) in the branch handling
input_mode == "direct_user_message" (in evaluate.py) while keeping the existing
guard (if fields else "") so next(iter(...)) is only called when fields is
non-empty.

In `@nemo_skills/inference/eval/rasb.py`:
- Around line 240-252: The variables returned by docker_client.images.build
(image, logs) are assigned but unused; in the build block inside rasb.py (the
call to self.docker_client.images.build with tag=base_tag) rename the unpacked
variables to unused-prefixed names (e.g., _image, _logs or _, _logs) to comply
with the unused-variable convention and avoid linter warnings while keeping the
same behavior in the try/except that logs success with LOG.info and handles
docker.errors.BuildError.
- Around line 271-284: The tuple returned by self.docker_client.images.build is
being unpacked into image, logs but those variables are unused; change the
unpacking to use underscore-prefixed names (e.g., _image, _logs or _, _) when
calling self.docker_client.images.build inside the try block in rasb.py so
linters don’t flag unused variables and intent is clear; keep the rest of the
try/except (docker.errors.BuildError handling and LOG/error messages) unchanged.
- Around line 590-595: The cleanup block currently swallows exceptions when
removing the Docker container (variable container) if not
self.cfg.keep_containers; change the empty except to log the failure instead of
silencing it—use the class logger (e.g., self.logger) to record the exception
and context (container id/name and that container.remove(force=True) failed),
using logger.exception or logger.error(..., exc_info=True) so the stack trace is
preserved, but keep the behavior of not re-raising the error so cleanup
continues.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d4b97314-7e69-4176-96fa-914946d045e6

📥 Commits

Reviewing files that changed from the base of the PR and between beb2a4c and b66172d.

📒 Files selected for processing (14)

docs/evaluation/rasb.md
mkdocs.yml
nemo_skills/dataset/rasb-26h1/README.md
nemo_skills/dataset/rasb-26h1/__init__.py
nemo_skills/dataset/rasb-26h1/prepare.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/rasb.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/rasb_metrics.py
nemo_skills/inference/eval/rasb.py
nemo_skills/inference/eval/rasb_container/__init__.py
nemo_skills/inference/eval/rasb_container/evaluate.py
nemo_skills/inference/eval/rasb_container/judge.py
nemo_skills/inference/eval/rasb_container/lm.py

Add docstrings to improve coverage from 68% to 100%: - Module docstring describing the purpose - OpenAILM methods: _defaults, query, aquery, query_messages, aquery_messages - AnthropicLM methods: query, aquery, query_messages, aquery_messages - Fake classes __init__ methods for OpenAI compatibility layer Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Peter Belcak <pbelcak@nvidia.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

nemo_skills/evaluation/metrics/rasb_metrics.py (1)

60-69: Call super().reset() instead of manually initializing base class fields.

The reset() method manually initializes fields that BaseMetrics.reset() already handles. This creates maintenance burden if BaseMetrics changes.

♻️ Proposed fix

     def reset(self):
         """Reset all counters."""
-        self.total = 0
-        self.correct = 0
-        self.avg_tokens = 0
-        self.max_k = 1
-        self.min_start_time = float("inf")
-        self.max_end_time = float("-inf")
-        self.eval_dict = {"pass@1": {}}
-        self.all_scores = defaultdict(list)
+        super().reset()
+        self.correct = 0
+        self.max_k = 1  # RASB uses single predictions
+        self.eval_dict = {"pass@1": {}}  # Override default structure

         # RASB-specific counters

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/rasb_metrics.py` around lines 60 - 69, The
reset() method in rasb_metrics.py is reinitializing fields that
BaseMetrics.reset() already handles; modify the reset method to call
super().reset() at the start (or replace the manual initializations) and then
only set Rasb-specific fields (e.g., keep any fields unique to RasbMetrics such
as eval_dict or all_scores if needed), ensuring you remove the duplicated
assignments to total, correct, avg_tokens, max_k, min_start_time, and
max_end_time so the class uses BaseMetrics.reset() for base state.

nemo_skills/inference/eval/rasb.py (2)

241-247: Unused variables from Docker build.

The image and logs variables from the build call are never used.

✏️ Proposed fix

-            image, logs = self.docker_client.images.build(
+            _image, _logs = self.docker_client.images.build(
                 path=str(env_path),
                 tag=base_tag,
                 rm=True,
                 forcerm=True,
                 timeout=self.cfg.docker_build_timeout,
             )

Same applies at line 272.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb.py` around lines 241 - 247, The
docker_client.images.build(...) calls assign to image and logs but those
variables are unused; update both places to discard the unused return values
(e.g., assign to _ or _image, _logs) or call the method without capturing its
return, ensuring the call still passes the same kwargs
(timeout=self.cfg.docker_build_timeout, tag=base_tag, path=str(env_path),
rm=True, forcerm=True) so behavior is unchanged; locate the builds by the
docker_client.images.build(...) expressions in this module to apply the change.

590-595: Consider logging container removal failures.

The silent exception pass during cleanup could hide Docker issues.

♻️ Proposed fix

             finally:
                 if container and not self.cfg.keep_containers:
                     try:
                         container.remove(force=True)
-                    except Exception:
-                        pass
+                    except Exception as e:
+                        LOG.debug(f"[{env_id}] Failed to remove container: {e}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb.py` around lines 590 - 595, The finally block
swallows exceptions when removing the Docker container (container.remove) which
can hide cleanup failures; change the except-pass to log the failure including
the container identifier and exception details (e.g., catch Exception as e and
call self.logger.warning or logging.exception with container.id or
container.name and the error), while still honoring self.cfg.keep_containers so
removal only happens when appropriate.

nemo_skills/inference/eval/rasb_container/evaluate.py (1)

87-88: Silent exception pass may hide real configuration issues.

The bare except Exception: pass silently swallows all errors from harness.apply_inputs, which could mask legitimate configuration issues.

♻️ Proposed fix - log a debug message on failure

         try:
             result = _harness_apply_inputs(system_template, user_template, fields, input_mode)
             if isinstance(result, tuple) and len(result) == 2:
                 sys_out, usr_out = result
                 # Only accept plain text prompts; reject multimodal content blocks
                 if isinstance(sys_out, str) and isinstance(usr_out, str):
                     return result
                 log.warning("Harness returned non-string prompt content (multimodal?), falling through to built-in")
-        except Exception:
-            pass  # Fall through to built-in handling
+        except Exception as e:
+            log.debug("Harness apply_inputs failed, using built-in: %s", e)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 87 - 88,
The bare except around harness.apply_inputs swallows errors; replace the silent
pass with logging of the exception so configuration issues aren't hidden — catch
Exception as e in the except block around harness.apply_inputs and call the
module's logger (or logging.getLogger(__name__)) to log a debug-level message
including a short context string and the exception (e.g.,
logger.debug("harness.apply_inputs failed, continuing to built-in handling",
exc_info=e)), keeping the fall-through behavior intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/rasb-26h1/prepare.py`:
- Around line 166-167: The two print statements using f-strings that have no
placeholders should be regular strings: remove the leading "f" from the print
calls that output "RASB 26H1 Preparation Summary" and the line of "=" characters
(and similarly the later print at line 185 with no interpolation) so they are
plain string literals; locate the print(...) calls in
nemo_skills/dataset/rasb-26h1/prepare.py and drop the unnecessary f prefix from
those print statements.

In `@nemo_skills/evaluation/metrics/rasb_metrics.py`:
- Around line 193-198: The loop in the pass rate collection uses an unused
variable env_id; rename it to _env_id in the loop header (for _env_id, counter
in self.by_env_id.items()) so linters don’t flag an unused variable — update the
loop that builds pass_rates in the method in rasb_metrics.py accordingly
(reference: self.by_env_id and pass_rates).

---

Nitpick comments:
In `@nemo_skills/evaluation/metrics/rasb_metrics.py`:
- Around line 60-69: The reset() method in rasb_metrics.py is reinitializing
fields that BaseMetrics.reset() already handles; modify the reset method to call
super().reset() at the start (or replace the manual initializations) and then
only set Rasb-specific fields (e.g., keep any fields unique to RasbMetrics such
as eval_dict or all_scores if needed), ensuring you remove the duplicated
assignments to total, correct, avg_tokens, max_k, min_start_time, and
max_end_time so the class uses BaseMetrics.reset() for base state.

In `@nemo_skills/inference/eval/rasb_container/evaluate.py`:
- Around line 87-88: The bare except around harness.apply_inputs swallows
errors; replace the silent pass with logging of the exception so configuration
issues aren't hidden — catch Exception as e in the except block around
harness.apply_inputs and call the module's logger (or
logging.getLogger(__name__)) to log a debug-level message including a short
context string and the exception (e.g., logger.debug("harness.apply_inputs
failed, continuing to built-in handling", exc_info=e)), keeping the fall-through
behavior intact.

In `@nemo_skills/inference/eval/rasb.py`:
- Around line 241-247: The docker_client.images.build(...) calls assign to image
and logs but those variables are unused; update both places to discard the
unused return values (e.g., assign to _ or _image, _logs) or call the method
without capturing its return, ensuring the call still passes the same kwargs
(timeout=self.cfg.docker_build_timeout, tag=base_tag, path=str(env_path),
rm=True, forcerm=True) so behavior is unchanged; locate the builds by the
docker_client.images.build(...) expressions in this module to apply the change.
- Around line 590-595: The finally block swallows exceptions when removing the
Docker container (container.remove) which can hide cleanup failures; change the
except-pass to log the failure including the container identifier and exception
details (e.g., catch Exception as e and call self.logger.warning or
logging.exception with container.id or container.name and the error), while
still honoring self.cfg.keep_containers so removal only happens when
appropriate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a6412717-4a26-4031-9dd1-733fbc177e57

📥 Commits

Reviewing files that changed from the base of the PR and between b66172d and 535f043.

📒 Files selected for processing (14)

docs/evaluation/rasb.md
mkdocs.yml
nemo_skills/dataset/rasb-26h1/README.md
nemo_skills/dataset/rasb-26h1/__init__.py
nemo_skills/dataset/rasb-26h1/prepare.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/rasb.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/rasb_metrics.py
nemo_skills/inference/eval/rasb.py
nemo_skills/inference/eval/rasb_container/__init__.py
nemo_skills/inference/eval/rasb_container/evaluate.py
nemo_skills/inference/eval/rasb_container/judge.py
nemo_skills/inference/eval/rasb_container/lm.py

✅ Files skipped from review due to trivial changes (6)

mkdocs.yml
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/inference/eval/rasb_container/init.py
nemo_skills/dataset/rasb-26h1/init.py
docs/evaluation/rasb.md
nemo_skills/inference/eval/rasb_container/judge.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/evaluation/evaluator/init.py

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (5)

nemo_skills/inference/eval/rasb_container/lm.py (1)

105-109: Inconsistent indentation in method bodies.

The indentation of model=self.model and subsequent lines is inconsistent with the rest of the codebase.

✏️ Proposed fix for lines 105-109

         try:
-            response = self.client.chat.completions.create( # type: ignore
-            model=self.model,
-            messages=messages, # type: ignore
-            **kwargs,
-            )
+            response = self.client.chat.completions.create(  # type: ignore
+                model=self.model,
+                messages=messages,  # type: ignore
+                **kwargs,
+            )

Also applies to: 119-123

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb_container/lm.py` around lines 105 - 109, The
call to self.client.chat.completions.create in lm.py has inconsistent hanging
indentation: align the named arguments (model=self.model, messages=messages,
**kwargs) under the opening parenthesis or use a consistent 4-space hanging
indent so they line up with the first argument; update the block where response
is assigned (the call in the method that uses
self.client.chat.completions.create) and the other occurrence around the similar
call later (the second create call at lines ~119-123) so both use the same
indentation style.

nemo_skills/inference/eval/rasb.py (2)

241-247: Prefix unused variables with underscore.

The image and logs variables from docker.images.build() are never used.

✏️ Proposed fix

-            image, logs = self.docker_client.images.build(
+            _image, _logs = self.docker_client.images.build(
                 path=str(env_path),
                 tag=base_tag,
                 rm=True,
                 forcerm=True,
                 timeout=self.cfg.docker_build_timeout,
             )

Same at line 272:

-            image, logs = self.docker_client.images.build(
+            _image, _logs = self.docker_client.images.build(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb.py` around lines 241 - 247, The variables
returned from docker_client.images.build (image and logs) are unused; update
both occurrences (the docker_client.images.build calls in rasb.py) to prefix
unused variables with an underscore (e.g., _image, _logs) so linters won’t flag
them and intent is clear; locate the docker_client.images.build invocation(s) in
the method and rename the returned variables accordingly for both places (the
one shown and the one around line 272).

590-595: Log the exception when container removal fails.

Silently swallowing exceptions in the finally block can hide important debugging information.

✏️ Proposed fix

             finally:
                 if container and not self.cfg.keep_containers:
                     try:
                         container.remove(force=True)
-                    except Exception:
-                        pass
+                    except Exception as e:
+                        LOG.debug(f"Failed to remove container: {e}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb.py` around lines 590 - 595, The finally block
currently swallows errors when removing the container; update the container
cleanup in rasb.py to catch the Exception from container.remove(force=True) and
log it instead of passing — use the existing logger (e.g., self.logger or
process logger available in the class) to emit an error/debug message that
includes context (container id/name) and the exception details, and keep
honoring self.cfg.keep_containers as before.

nemo_skills/inference/eval/rasb_container/evaluate.py (2)

96-97: Use next(iter(...)) instead of single-element slice.

✏️ Proposed fix

     elif input_mode == "direct_user_message":
-        return system_template, str(list(fields.values())[0]) if fields else ""
+        return system_template, str(next(iter(fields.values()))) if fields else ""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 96 - 97,
The branch handling input_mode == "direct_user_message" currently extracts the
first value using list(fields.values())[0]; change it to use
next(iter(fields.values())) to avoid creating a temporary list and improve
efficiency; keep the existing empty-check behavior (i.e., return system_template
and the first field as a string if fields truthy, else empty string) and update
the expression in the evaluate.py branch for input_mode accordingly (reference:
the input_mode == "direct_user_message" branch and the fields.values() usage).

40-43: Imports from container-local modules may fail at container build time.

These imports (callable, tools, judge) assume the modules exist in the container's working directory. If the overlay image build fails to copy these files correctly, the error message won't be immediately clear.

Consider adding a try-except with a more descriptive error message to aid debugging container setup issues.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 40 - 43,
Wrap the container-local imports (from callable import call; from tools import
TOOLS, execute_tool; from judge import judge_requirements) in a try/except
ImportError block inside evaluate.py and raise or log a new ImportError with a
clear, actionable message that these local modules could not be found in the
container (e.g., instructing to verify overlay image copy paths and that
callable.py, tools.py, judge.py are present), preserving the original exception
as context; this will make failures during container build/runtime easier to
diagnose.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/rasb-26h1/prepare.py`:
- Line 156: The bug is that create_entry is called with data_dir as base_dir
which causes env_path.relative_to(base_dir) to raise ValueError when
--data_source makes environments_dir different from data_dir; fix by passing the
actual environments directory (environments_dir) or the correct base path that
contains env_path to create_entry instead of data_dir (i.e., change the call
entry = create_entry(env_path, input_file, metadata, data_dir) to use
environments_dir or compute a base_dir = environments_dir if env_path is under
it), or alternatively update create_entry to accept a None/default and compute
base_dir from env_path to avoid relative_to errors.

In `@nemo_skills/inference/eval/rasb_container/evaluate.py`:
- Around line 382-384: The current loop in evaluate.py silently skips sample
files with no expected_output (the check using "if not expected" and printing
"Skipping {sf.name}") which removes them from the results and breaks alignment
with rasb.py; modify the code so that instead of only printing and continuing it
appends a result entry for that sample (e.g., a dict/object with sample
identifier sf.name, status like "skipped_no_expected" and a clear
message/reason) into the same results collection used for evaluated samples, and
ensure any downstream consumer expects/handles this status; keep the print/log
but also create and push the placeholder/skipped result so rasb.py can correlate
inputs to outputs.

In `@nemo_skills/inference/eval/rasb_container/lm.py`:
- Around line 63-71: The _defaults method currently merges instance-level
self.kwargs after the caller's kwargs which allows defaults to overwrite
caller-provided values; change the merge order so instance defaults are applied
first and caller kwargs take precedence by merging self.kwargs into kwargs
before returning (e.g., use a merge that places self.kwargs first), keeping the
existing checks for self.max_tokens and self.temperature so explicit caller
values for those keys still win.

In `@nemo_skills/inference/eval/rasb.py`:
- Around line 731-734: The current code appends each result to output_file
(variables: output_file, results) which risks partial duplicates if the process
fails mid-write; instead, finalize computation of results in memory then write
atomically: write all results to a temporary file (e.g., temp_output), fsync and
close it, then atomically replace/move it to output_file (or append in a
controlled atomic step) so partial writes cannot produce duplicated/partial
entries on restart; alternatively implement a per-sample write-tracking
mechanism that coordinates with skip_filled to record the highest written sample
index before writing so restarts can skip partially written samples (use unique
symbols results, output_file, skip_filled to locate the logic).

---

Nitpick comments:
In `@nemo_skills/inference/eval/rasb_container/evaluate.py`:
- Around line 96-97: The branch handling input_mode == "direct_user_message"
currently extracts the first value using list(fields.values())[0]; change it to
use next(iter(fields.values())) to avoid creating a temporary list and improve
efficiency; keep the existing empty-check behavior (i.e., return system_template
and the first field as a string if fields truthy, else empty string) and update
the expression in the evaluate.py branch for input_mode accordingly (reference:
the input_mode == "direct_user_message" branch and the fields.values() usage).
- Around line 40-43: Wrap the container-local imports (from callable import
call; from tools import TOOLS, execute_tool; from judge import
judge_requirements) in a try/except ImportError block inside evaluate.py and
raise or log a new ImportError with a clear, actionable message that these local
modules could not be found in the container (e.g., instructing to verify overlay
image copy paths and that callable.py, tools.py, judge.py are present),
preserving the original exception as context; this will make failures during
container build/runtime easier to diagnose.

In `@nemo_skills/inference/eval/rasb_container/lm.py`:
- Around line 105-109: The call to self.client.chat.completions.create in lm.py
has inconsistent hanging indentation: align the named arguments
(model=self.model, messages=messages, **kwargs) under the opening parenthesis or
use a consistent 4-space hanging indent so they line up with the first argument;
update the block where response is assigned (the call in the method that uses
self.client.chat.completions.create) and the other occurrence around the similar
call later (the second create call at lines ~119-123) so both use the same
indentation style.

In `@nemo_skills/inference/eval/rasb.py`:
- Around line 241-247: The variables returned from docker_client.images.build
(image and logs) are unused; update both occurrences (the
docker_client.images.build calls in rasb.py) to prefix unused variables with an
underscore (e.g., _image, _logs) so linters won’t flag them and intent is clear;
locate the docker_client.images.build invocation(s) in the method and rename the
returned variables accordingly for both places (the one shown and the one around
line 272).
- Around line 590-595: The finally block currently swallows errors when removing
the container; update the container cleanup in rasb.py to catch the Exception
from container.remove(force=True) and log it instead of passing — use the
existing logger (e.g., self.logger or process logger available in the class) to
emit an error/debug message that includes context (container id/name) and the
exception details, and keep honoring self.cfg.keep_containers as before.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b19a67a5-42bc-4d40-b4d0-48c7cdf218a1

📥 Commits

Reviewing files that changed from the base of the PR and between 535f043 and 23b3be1.

📒 Files selected for processing (14)

docs/evaluation/rasb.md
mkdocs.yml
nemo_skills/dataset/rasb-26h1/README.md
nemo_skills/dataset/rasb-26h1/__init__.py
nemo_skills/dataset/rasb-26h1/prepare.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/rasb.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/rasb_metrics.py
nemo_skills/inference/eval/rasb.py
nemo_skills/inference/eval/rasb_container/__init__.py
nemo_skills/inference/eval/rasb_container/evaluate.py
nemo_skills/inference/eval/rasb_container/judge.py
nemo_skills/inference/eval/rasb_container/lm.py

✅ Files skipped from review due to trivial changes (6)

mkdocs.yml
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/evaluator/rasb.py
nemo_skills/dataset/rasb-26h1/init.py
docs/evaluation/rasb.md
nemo_skills/inference/eval/rasb_container/init.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/inference/eval/rasb_container/judge.py

Kipok · 2026-04-28T04:40:11Z

@pbelcak could you please recreate from a branch, so that our tests can run? Sent you an invite

@titu1994 @wasiahmad @ludwig-n could you please review as this looks related to your work?

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread nemo_skills/dataset/rasb-26h1/README.md

Comment thread nemo_skills/evaluation/evaluator/rasb.py

Comment thread nemo_skills/inference/eval/rasb_container/lm.py

Comment thread nemo_skills/inference/eval/rasb.py

pbelcak force-pushed the main branch 2 times, most recently from 535f043 to b66172d Compare April 25, 2026 23:25

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread nemo_skills/dataset/rasb-26h1/prepare.py

Comment thread nemo_skills/evaluation/metrics/rasb_metrics.py

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread nemo_skills/dataset/rasb-26h1/prepare.py

Comment thread nemo_skills/inference/eval/rasb_container/evaluate.py

Comment thread nemo_skills/inference/eval/rasb_container/lm.py

Comment thread nemo_skills/inference/eval/rasb.py

Kipok requested review from ludwig-n, titu1994 and wasiahmad April 28, 2026 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RASB-26H1 benchmark integration#1402

Add RASB-26H1 benchmark integration#1402
pbelcak wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
pbelcak:main

pbelcak commented Apr 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kipok commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pbelcak commented Apr 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kipok commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pbelcak commented Apr 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 25, 2026 •

edited

Loading