Add RASB-26H1 benchmark integration#1402
Conversation
RASB (Real Agent Scaffolds Bench) evaluates LLMs on complex agent scaffolding tasks. This adds support for the 26H1 snapshot with 193 environments and 5,731 test samples. Features: - Docker-based evaluation matching original RASB benchmark - Support for OpenAI, Anthropic, and other compatible APIs - Aggregate metrics: mean, median, Q1, Q3, std across environments - Per-environment, per-type, per-judgment breakdowns Files: - nemo_skills/dataset/rasb-26h1/: Dataset module and preparation - nemo_skills/inference/eval/rasb.py: Generation with Docker orchestration - nemo_skills/inference/eval/rasb_container/: Container evaluation files - nemo_skills/evaluation/metrics/rasb_metrics.py: Metrics aggregation - nemo_skills/evaluation/evaluator/rasb.py: Result passthrough evaluator - docs/evaluation/rasb.md: Documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Peter Belcak <pbelcak@nvidia.com>
📝 WalkthroughWalkthroughAdds RASB 26H1: docs and MkDocs nav, a dataset package and prepare CLI, a Docker-based Hydra generation task that builds/runs per-environment containers, in-container evaluation/judging/LM adapters, evaluator/metrics integration, and result aggregation/output plumbing. Changes
Sequence Diagram(s)sequenceDiagram
participant Orchestrator as Orchestration (rasb.py)
participant Docker as Docker (Image Build)
participant Container as Container (evaluate.py)
participant LM as LM Endpoint (OpenAI/Anthropic)
participant Judge as Judge (judge.py)
participant Metrics as Metrics Collector
Orchestrator->>Docker: build base image per environment
Orchestrator->>Docker: build overlay (evaluate.py, judge.py, lm.py, callable.py, .env)
Orchestrator->>Container: run container with mounted inputs/results
Container->>LM: invoke callable with messages
LM-->>Container: model response (may include tool calls)
alt tool call present
Container->>Container: execute tool locally
Container->>LM: feed tool result back to callable
LM-->>Container: updated response
end
Container->>Container: parse output (json/regex/tool/face-value)
alt judgment == requirements
Container->>Judge: send prompt + requirements (committee)
Judge->>LM: query judge model(s)
LM-->>Judge: judge responses
Judge-->>Container: aggregated requirements judgment
else judgment == exact
Container->>Container: normalize & compare expected
end
Container->>Orchestrator: write results.json
Orchestrator->>Metrics: map results to Skills-format, aggregate breakdowns
Metrics-->>Orchestrator: pass rates & per-env stats
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (9)
nemo_skills/dataset/rasb-26h1/__init__.py (1)
27-39: Consider adding RASB into slurm benchmark coverage.Given the Docker orchestration + custom evaluator/metrics flow, adding at least a small
rasb-26h1slurm test path would help catch integration regressions early.Based on learnings, when enabling a new modality or adding complicated benchmark evaluation/metrics logic, the dataset should be considered for slurm tests for comprehensive evaluation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/rasb-26h1/__init__.py` around lines 27 - 39, Add the new rasb-26h1 dataset to the project's Slurm benchmark test coverage by updating the Slurm/CI test matrix to include a lightweight slurm job for this dataset (use a short/smoke config) so integration regressions for Docker orchestration and the custom evaluator/metrics are exercised; reference the dataset name rasb-26h1 and ensure the test invokes the dataset's GENERATION_MODULE ("nemo_skills.inference.eval.rasb"), honors REQUIRES_DATA_DIR=True (or uses a small synthetic data fixture), and validates METRICS_TYPE="rasb" and GENERATION_ARGS="++eval_type=rasb" on the EVAL_SPLIT="test" path. Ensure the job is marked small/optional to avoid long runs and include any necessary Docker permissions so the container orchestration path executes during the Slurm run.nemo_skills/dataset/rasb-26h1/prepare.py (2)
166-167: Remove extraneous f-string prefixes.These f-strings don't contain any placeholders.
♻️ Proposed fix
- print(f"\nRASB 26H1 Preparation Summary") - print(f"{'=' * 40}") + print("\nRASB 26H1 Preparation Summary") + print("=" * 40)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/rasb-26h1/prepare.py` around lines 166 - 167, The two print statements that use f-strings (the lines calling print(f"\nRASB 26H1 Preparation Summary") and print(f"{'=' * 40}")) have no interpolations; remove the unnecessary f prefixes so they are plain string literals (print("\nRASB 26H1 Preparation Summary") and print("=" * 40)) to avoid misleading f-string usage in prepare.py.
185-187: Remove extraneous f-string prefix.♻️ Proposed fix
- print(f"\nSamples by environment type:") + print("\nSamples by environment type:")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/rasb-26h1/prepare.py` around lines 185 - 187, The print statement uses an unnecessary f-string prefix: change print(f"\nSamples by environment type:") to a regular string print("\nSamples by environment type:") while leaving the subsequent loop and its f-string (for env_type, count in sorted(type_counts.items(), key=lambda x: -x[1]): print(f" {env_type}: {count}")) intact; this removes the extraneous f prefix without affecting the interpolated prints.nemo_skills/evaluation/metrics/rasb_metrics.py (1)
193-198: Rename unused loop variable.The
env_idvariable is not used in the loop body.♻️ Proposed fix
# Collect per-environment pass rates pass_rates = [] - for env_id, counter in self.by_env_id.items(): + for _env_id, counter in self.by_env_id.items(): if counter["total"] > 0: rate = 100.0 * counter["correct"] / counter["total"] pass_rates.append(rate)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/rasb_metrics.py` around lines 193 - 198, The loop over self.by_env_id.items() declares an unused variable env_id; change the loop to ignore the unused key (e.g., use _ or _env_id) or iterate over self.by_env_id.values() so only counter is used; update the for statement where pass_rates is built (the loop that currently reads "for env_id, counter in self.by_env_id.items():") to avoid the unused env_id symbol.nemo_skills/inference/eval/rasb_container/evaluate.py (2)
85-88: Consider logging the exception in harness fallback.The silent
passon exception swallows potentially useful diagnostic information whenapply_inputsfails.🔧 Proposed fix
try: result = _harness_apply_inputs(system_template, user_template, fields, input_mode) if isinstance(result, tuple) and len(result) == 2: sys_out, usr_out = result # Only accept plain text prompts; reject multimodal content blocks if isinstance(sys_out, str) and isinstance(usr_out, str): return result log.warning("Harness returned non-string prompt content (multimodal?), falling through to built-in") - except Exception: - pass # Fall through to built-in handling + except Exception as e: + log.debug("Harness apply_inputs failed, using built-in: %s", e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 85 - 88, The except block that currently swallows errors when calling apply_inputs should log the exception so failures are visible; replace the bare "pass" with a log.exception (or log.error with exc_info=True) call that includes context like "apply_inputs failed, falling back to built-in handling" and the exception details; update the except in the same try that surrounds apply_inputs/return result and keep the subsequent fallback behavior unchanged so diagnostics are preserved without altering control flow.
96-97: Minor: Usenext(iter(...))for single element extraction.♻️ Proposed fix
elif input_mode == "direct_user_message": - return system_template, str(list(fields.values())[0]) if fields else "" + return system_template, str(next(iter(fields.values()))) if fields else ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 96 - 97, The return currently constructs a list to grab the first value from fields (str(list(fields.values())[0])), which is inefficient; replace that extraction with str(next(iter(fields.values()))) in the branch handling input_mode == "direct_user_message" (in evaluate.py) while keeping the existing guard (if fields else "") so next(iter(...)) is only called when fields is non-empty.nemo_skills/inference/eval/rasb.py (3)
240-252: Prefix unused unpacked variables with underscore.The
imageandlogsvariables fromimages.build()are not used.♻️ Proposed fix
try: - image, logs = self.docker_client.images.build( + _image, _logs = self.docker_client.images.build( path=str(env_path), tag=base_tag, rm=True, forcerm=True, timeout=self.cfg.docker_build_timeout, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb.py` around lines 240 - 252, The variables returned by docker_client.images.build (image, logs) are assigned but unused; in the build block inside rasb.py (the call to self.docker_client.images.build with tag=base_tag) rename the unpacked variables to unused-prefixed names (e.g., _image, _logs or _, _logs) to comply with the unused-variable convention and avoid linter warnings while keeping the same behavior in the try/except that logs success with LOG.info and handles docker.errors.BuildError.
271-284: Prefix unused unpacked variables with underscore.Same issue in overlay image build.
♻️ Proposed fix
try: - image, logs = self.docker_client.images.build( + _image, _logs = self.docker_client.images.build( path=str(build_context), dockerfile="Dockerfile.bench", tag=bench_tag,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb.py` around lines 271 - 284, The tuple returned by self.docker_client.images.build is being unpacked into image, logs but those variables are unused; change the unpacking to use underscore-prefixed names (e.g., _image, _logs or _, _) when calling self.docker_client.images.build inside the try block in rasb.py so linters don’t flag unused variables and intent is clear; keep the rest of the try/except (docker.errors.BuildError handling and LOG/error messages) unchanged.
590-595: Consider logging the container removal failure.Silent exception swallowing during cleanup hides potential issues.
♻️ Proposed fix
finally: if container and not self.cfg.keep_containers: try: container.remove(force=True) - except Exception: - pass + except Exception as e: + LOG.debug("Failed to remove container: %s", e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb.py` around lines 590 - 595, The cleanup block currently swallows exceptions when removing the Docker container (variable container) if not self.cfg.keep_containers; change the empty except to log the failure instead of silencing it—use the class logger (e.g., self.logger) to record the exception and context (container id/name and that container.remove(force=True) failed), using logger.exception or logger.error(..., exc_info=True) so the stack trace is preserved, but keep the behavior of not re-raising the error so cleanup continues.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/rasb-26h1/README.md`:
- Line 27: Update the typo in the README heading "### NVIDIA Inference API
(Ouickstart Example)" by changing "Ouickstart" to "Quickstart" so the heading
reads "### NVIDIA Inference API (Quickstart Example)"; locate and edit that
exact heading text to correct the spelling.
In `@nemo_skills/evaluation/evaluator/rasb.py`:
- Around line 62-71: The code silently defaults missing RASB schema fields by
using rasb_result.get(...) which can hide schema regressions; change access to
required fields to direct key access (e.g., rasb_result["passed"],
rasb_result["judgment_type"], rasb_result["judgment_details"],
rasb_result["parsed_output"], rasb_result["error"]) so the code raises a
KeyError on missing data, and only use .get() with sensible defaults for truly
optional fields; update any other occurrences (e.g., the later block around
lines 90-93) to follow the same pattern and add a small comment noting which
keys are required vs optional.
In `@nemo_skills/inference/eval/rasb_container/lm.py`:
- Around line 218-226: The string concatenation has wrong operator precedence in
the reasoning branch; change the expression inside the loop over response.output
(where item.type == "reasoning" and variable ret is built) to group the ternary
so the newline is appended to the whole result — e.g. use "Reasoning: " +
(item.content if item.content else "Empty.") + "\n" — ensuring item.content is
correctly used when present and "Empty." is used otherwise.
In `@nemo_skills/inference/eval/rasb.py`:
- Around line 746-750: total_samples can be zero causing a division-by-zero in
the LOG.info percent calculation; after computing total_samples and correct
(from results and existing_results) add a guard that if total_samples == 0 set
the percentage to 0.0 (or log a "no samples" message) and avoid dividing, then
call LOG.info using that safe percentage; update the block that computes
total_samples, correct and the LOG.info call in rasb.py to use this check so the
code never performs 100*correct/total_samples when total_samples is zero.
---
Nitpick comments:
In `@nemo_skills/dataset/rasb-26h1/__init__.py`:
- Around line 27-39: Add the new rasb-26h1 dataset to the project's Slurm
benchmark test coverage by updating the Slurm/CI test matrix to include a
lightweight slurm job for this dataset (use a short/smoke config) so integration
regressions for Docker orchestration and the custom evaluator/metrics are
exercised; reference the dataset name rasb-26h1 and ensure the test invokes the
dataset's GENERATION_MODULE ("nemo_skills.inference.eval.rasb"), honors
REQUIRES_DATA_DIR=True (or uses a small synthetic data fixture), and validates
METRICS_TYPE="rasb" and GENERATION_ARGS="++eval_type=rasb" on the
EVAL_SPLIT="test" path. Ensure the job is marked small/optional to avoid long
runs and include any necessary Docker permissions so the container orchestration
path executes during the Slurm run.
In `@nemo_skills/dataset/rasb-26h1/prepare.py`:
- Around line 166-167: The two print statements that use f-strings (the lines
calling print(f"\nRASB 26H1 Preparation Summary") and print(f"{'=' * 40}")) have
no interpolations; remove the unnecessary f prefixes so they are plain string
literals (print("\nRASB 26H1 Preparation Summary") and print("=" * 40)) to avoid
misleading f-string usage in prepare.py.
- Around line 185-187: The print statement uses an unnecessary f-string prefix:
change print(f"\nSamples by environment type:") to a regular string
print("\nSamples by environment type:") while leaving the subsequent loop and
its f-string (for env_type, count in sorted(type_counts.items(), key=lambda x:
-x[1]): print(f" {env_type}: {count}")) intact; this removes the extraneous f
prefix without affecting the interpolated prints.
In `@nemo_skills/evaluation/metrics/rasb_metrics.py`:
- Around line 193-198: The loop over self.by_env_id.items() declares an unused
variable env_id; change the loop to ignore the unused key (e.g., use _ or
_env_id) or iterate over self.by_env_id.values() so only counter is used; update
the for statement where pass_rates is built (the loop that currently reads "for
env_id, counter in self.by_env_id.items():") to avoid the unused env_id symbol.
In `@nemo_skills/inference/eval/rasb_container/evaluate.py`:
- Around line 85-88: The except block that currently swallows errors when
calling apply_inputs should log the exception so failures are visible; replace
the bare "pass" with a log.exception (or log.error with exc_info=True) call that
includes context like "apply_inputs failed, falling back to built-in handling"
and the exception details; update the except in the same try that surrounds
apply_inputs/return result and keep the subsequent fallback behavior unchanged
so diagnostics are preserved without altering control flow.
- Around line 96-97: The return currently constructs a list to grab the first
value from fields (str(list(fields.values())[0])), which is inefficient; replace
that extraction with str(next(iter(fields.values()))) in the branch handling
input_mode == "direct_user_message" (in evaluate.py) while keeping the existing
guard (if fields else "") so next(iter(...)) is only called when fields is
non-empty.
In `@nemo_skills/inference/eval/rasb.py`:
- Around line 240-252: The variables returned by docker_client.images.build
(image, logs) are assigned but unused; in the build block inside rasb.py (the
call to self.docker_client.images.build with tag=base_tag) rename the unpacked
variables to unused-prefixed names (e.g., _image, _logs or _, _logs) to comply
with the unused-variable convention and avoid linter warnings while keeping the
same behavior in the try/except that logs success with LOG.info and handles
docker.errors.BuildError.
- Around line 271-284: The tuple returned by self.docker_client.images.build is
being unpacked into image, logs but those variables are unused; change the
unpacking to use underscore-prefixed names (e.g., _image, _logs or _, _) when
calling self.docker_client.images.build inside the try block in rasb.py so
linters don’t flag unused variables and intent is clear; keep the rest of the
try/except (docker.errors.BuildError handling and LOG/error messages) unchanged.
- Around line 590-595: The cleanup block currently swallows exceptions when
removing the Docker container (variable container) if not
self.cfg.keep_containers; change the empty except to log the failure instead of
silencing it—use the class logger (e.g., self.logger) to record the exception
and context (container id/name and that container.remove(force=True) failed),
using logger.exception or logger.error(..., exc_info=True) so the stack trace is
preserved, but keep the behavior of not re-raising the error so cleanup
continues.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d4b97314-7e69-4176-96fa-914946d045e6
📒 Files selected for processing (14)
docs/evaluation/rasb.mdmkdocs.ymlnemo_skills/dataset/rasb-26h1/README.mdnemo_skills/dataset/rasb-26h1/__init__.pynemo_skills/dataset/rasb-26h1/prepare.pynemo_skills/evaluation/evaluator/__init__.pynemo_skills/evaluation/evaluator/rasb.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/rasb_metrics.pynemo_skills/inference/eval/rasb.pynemo_skills/inference/eval/rasb_container/__init__.pynemo_skills/inference/eval/rasb_container/evaluate.pynemo_skills/inference/eval/rasb_container/judge.pynemo_skills/inference/eval/rasb_container/lm.py
535f043 to
b66172d
Compare
Add docstrings to improve coverage from 68% to 100%: - Module docstring describing the purpose - OpenAILM methods: _defaults, query, aquery, query_messages, aquery_messages - AnthropicLM methods: query, aquery, query_messages, aquery_messages - Fake classes __init__ methods for OpenAI compatibility layer Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Peter Belcak <pbelcak@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
nemo_skills/evaluation/metrics/rasb_metrics.py (1)
60-69: Callsuper().reset()instead of manually initializing base class fields.The
reset()method manually initializes fields thatBaseMetrics.reset()already handles. This creates maintenance burden ifBaseMetricschanges.♻️ Proposed fix
def reset(self): """Reset all counters.""" - self.total = 0 - self.correct = 0 - self.avg_tokens = 0 - self.max_k = 1 - self.min_start_time = float("inf") - self.max_end_time = float("-inf") - self.eval_dict = {"pass@1": {}} - self.all_scores = defaultdict(list) + super().reset() + self.correct = 0 + self.max_k = 1 # RASB uses single predictions + self.eval_dict = {"pass@1": {}} # Override default structure # RASB-specific counters🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/rasb_metrics.py` around lines 60 - 69, The reset() method in rasb_metrics.py is reinitializing fields that BaseMetrics.reset() already handles; modify the reset method to call super().reset() at the start (or replace the manual initializations) and then only set Rasb-specific fields (e.g., keep any fields unique to RasbMetrics such as eval_dict or all_scores if needed), ensuring you remove the duplicated assignments to total, correct, avg_tokens, max_k, min_start_time, and max_end_time so the class uses BaseMetrics.reset() for base state.nemo_skills/inference/eval/rasb.py (2)
241-247: Unused variables from Docker build.The
imageandlogsvariables from the build call are never used.✏️ Proposed fix
- image, logs = self.docker_client.images.build( + _image, _logs = self.docker_client.images.build( path=str(env_path), tag=base_tag, rm=True, forcerm=True, timeout=self.cfg.docker_build_timeout, )Same applies at line 272.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb.py` around lines 241 - 247, The docker_client.images.build(...) calls assign to image and logs but those variables are unused; update both places to discard the unused return values (e.g., assign to _ or _image, _logs) or call the method without capturing its return, ensuring the call still passes the same kwargs (timeout=self.cfg.docker_build_timeout, tag=base_tag, path=str(env_path), rm=True, forcerm=True) so behavior is unchanged; locate the builds by the docker_client.images.build(...) expressions in this module to apply the change.
590-595: Consider logging container removal failures.The silent exception pass during cleanup could hide Docker issues.
♻️ Proposed fix
finally: if container and not self.cfg.keep_containers: try: container.remove(force=True) - except Exception: - pass + except Exception as e: + LOG.debug(f"[{env_id}] Failed to remove container: {e}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb.py` around lines 590 - 595, The finally block swallows exceptions when removing the Docker container (container.remove) which can hide cleanup failures; change the except-pass to log the failure including the container identifier and exception details (e.g., catch Exception as e and call self.logger.warning or logging.exception with container.id or container.name and the error), while still honoring self.cfg.keep_containers so removal only happens when appropriate.nemo_skills/inference/eval/rasb_container/evaluate.py (1)
87-88: Silent exception pass may hide real configuration issues.The bare
except Exception: passsilently swallows all errors fromharness.apply_inputs, which could mask legitimate configuration issues.♻️ Proposed fix - log a debug message on failure
try: result = _harness_apply_inputs(system_template, user_template, fields, input_mode) if isinstance(result, tuple) and len(result) == 2: sys_out, usr_out = result # Only accept plain text prompts; reject multimodal content blocks if isinstance(sys_out, str) and isinstance(usr_out, str): return result log.warning("Harness returned non-string prompt content (multimodal?), falling through to built-in") - except Exception: - pass # Fall through to built-in handling + except Exception as e: + log.debug("Harness apply_inputs failed, using built-in: %s", e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 87 - 88, The bare except around harness.apply_inputs swallows errors; replace the silent pass with logging of the exception so configuration issues aren't hidden — catch Exception as e in the except block around harness.apply_inputs and call the module's logger (or logging.getLogger(__name__)) to log a debug-level message including a short context string and the exception (e.g., logger.debug("harness.apply_inputs failed, continuing to built-in handling", exc_info=e)), keeping the fall-through behavior intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/rasb-26h1/prepare.py`:
- Around line 166-167: The two print statements using f-strings that have no
placeholders should be regular strings: remove the leading "f" from the print
calls that output "RASB 26H1 Preparation Summary" and the line of "=" characters
(and similarly the later print at line 185 with no interpolation) so they are
plain string literals; locate the print(...) calls in
nemo_skills/dataset/rasb-26h1/prepare.py and drop the unnecessary f prefix from
those print statements.
In `@nemo_skills/evaluation/metrics/rasb_metrics.py`:
- Around line 193-198: The loop in the pass rate collection uses an unused
variable env_id; rename it to _env_id in the loop header (for _env_id, counter
in self.by_env_id.items()) so linters don’t flag an unused variable — update the
loop that builds pass_rates in the method in rasb_metrics.py accordingly
(reference: self.by_env_id and pass_rates).
---
Nitpick comments:
In `@nemo_skills/evaluation/metrics/rasb_metrics.py`:
- Around line 60-69: The reset() method in rasb_metrics.py is reinitializing
fields that BaseMetrics.reset() already handles; modify the reset method to call
super().reset() at the start (or replace the manual initializations) and then
only set Rasb-specific fields (e.g., keep any fields unique to RasbMetrics such
as eval_dict or all_scores if needed), ensuring you remove the duplicated
assignments to total, correct, avg_tokens, max_k, min_start_time, and
max_end_time so the class uses BaseMetrics.reset() for base state.
In `@nemo_skills/inference/eval/rasb_container/evaluate.py`:
- Around line 87-88: The bare except around harness.apply_inputs swallows
errors; replace the silent pass with logging of the exception so configuration
issues aren't hidden — catch Exception as e in the except block around
harness.apply_inputs and call the module's logger (or
logging.getLogger(__name__)) to log a debug-level message including a short
context string and the exception (e.g., logger.debug("harness.apply_inputs
failed, continuing to built-in handling", exc_info=e)), keeping the fall-through
behavior intact.
In `@nemo_skills/inference/eval/rasb.py`:
- Around line 241-247: The docker_client.images.build(...) calls assign to image
and logs but those variables are unused; update both places to discard the
unused return values (e.g., assign to _ or _image, _logs) or call the method
without capturing its return, ensuring the call still passes the same kwargs
(timeout=self.cfg.docker_build_timeout, tag=base_tag, path=str(env_path),
rm=True, forcerm=True) so behavior is unchanged; locate the builds by the
docker_client.images.build(...) expressions in this module to apply the change.
- Around line 590-595: The finally block swallows exceptions when removing the
Docker container (container.remove) which can hide cleanup failures; change the
except-pass to log the failure including the container identifier and exception
details (e.g., catch Exception as e and call self.logger.warning or
logging.exception with container.id or container.name and the error), while
still honoring self.cfg.keep_containers so removal only happens when
appropriate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a6412717-4a26-4031-9dd1-733fbc177e57
📒 Files selected for processing (14)
docs/evaluation/rasb.mdmkdocs.ymlnemo_skills/dataset/rasb-26h1/README.mdnemo_skills/dataset/rasb-26h1/__init__.pynemo_skills/dataset/rasb-26h1/prepare.pynemo_skills/evaluation/evaluator/__init__.pynemo_skills/evaluation/evaluator/rasb.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/rasb_metrics.pynemo_skills/inference/eval/rasb.pynemo_skills/inference/eval/rasb_container/__init__.pynemo_skills/inference/eval/rasb_container/evaluate.pynemo_skills/inference/eval/rasb_container/judge.pynemo_skills/inference/eval/rasb_container/lm.py
✅ Files skipped from review due to trivial changes (6)
- mkdocs.yml
- nemo_skills/evaluation/metrics/map_metrics.py
- nemo_skills/inference/eval/rasb_container/init.py
- nemo_skills/dataset/rasb-26h1/init.py
- docs/evaluation/rasb.md
- nemo_skills/inference/eval/rasb_container/judge.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/evaluation/evaluator/init.py
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (5)
nemo_skills/inference/eval/rasb_container/lm.py (1)
105-109: Inconsistent indentation in method bodies.The indentation of
model=self.modeland subsequent lines is inconsistent with the rest of the codebase.✏️ Proposed fix for lines 105-109
try: - response = self.client.chat.completions.create( # type: ignore - model=self.model, - messages=messages, # type: ignore - **kwargs, - ) + response = self.client.chat.completions.create( # type: ignore + model=self.model, + messages=messages, # type: ignore + **kwargs, + )Also applies to: 119-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb_container/lm.py` around lines 105 - 109, The call to self.client.chat.completions.create in lm.py has inconsistent hanging indentation: align the named arguments (model=self.model, messages=messages, **kwargs) under the opening parenthesis or use a consistent 4-space hanging indent so they line up with the first argument; update the block where response is assigned (the call in the method that uses self.client.chat.completions.create) and the other occurrence around the similar call later (the second create call at lines ~119-123) so both use the same indentation style.nemo_skills/inference/eval/rasb.py (2)
241-247: Prefix unused variables with underscore.The
imageandlogsvariables fromdocker.images.build()are never used.✏️ Proposed fix
- image, logs = self.docker_client.images.build( + _image, _logs = self.docker_client.images.build( path=str(env_path), tag=base_tag, rm=True, forcerm=True, timeout=self.cfg.docker_build_timeout, )Same at line 272:
- image, logs = self.docker_client.images.build( + _image, _logs = self.docker_client.images.build(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb.py` around lines 241 - 247, The variables returned from docker_client.images.build (image and logs) are unused; update both occurrences (the docker_client.images.build calls in rasb.py) to prefix unused variables with an underscore (e.g., _image, _logs) so linters won’t flag them and intent is clear; locate the docker_client.images.build invocation(s) in the method and rename the returned variables accordingly for both places (the one shown and the one around line 272).
590-595: Log the exception when container removal fails.Silently swallowing exceptions in the
finallyblock can hide important debugging information.✏️ Proposed fix
finally: if container and not self.cfg.keep_containers: try: container.remove(force=True) - except Exception: - pass + except Exception as e: + LOG.debug(f"Failed to remove container: {e}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb.py` around lines 590 - 595, The finally block currently swallows errors when removing the container; update the container cleanup in rasb.py to catch the Exception from container.remove(force=True) and log it instead of passing — use the existing logger (e.g., self.logger or process logger available in the class) to emit an error/debug message that includes context (container id/name) and the exception details, and keep honoring self.cfg.keep_containers as before.nemo_skills/inference/eval/rasb_container/evaluate.py (2)
96-97: Usenext(iter(...))instead of single-element slice.✏️ Proposed fix
elif input_mode == "direct_user_message": - return system_template, str(list(fields.values())[0]) if fields else "" + return system_template, str(next(iter(fields.values()))) if fields else ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 96 - 97, The branch handling input_mode == "direct_user_message" currently extracts the first value using list(fields.values())[0]; change it to use next(iter(fields.values())) to avoid creating a temporary list and improve efficiency; keep the existing empty-check behavior (i.e., return system_template and the first field as a string if fields truthy, else empty string) and update the expression in the evaluate.py branch for input_mode accordingly (reference: the input_mode == "direct_user_message" branch and the fields.values() usage).
40-43: Imports from container-local modules may fail at container build time.These imports (
callable,tools,judge) assume the modules exist in the container's working directory. If the overlay image build fails to copy these files correctly, the error message won't be immediately clear.Consider adding a try-except with a more descriptive error message to aid debugging container setup issues.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/rasb_container/evaluate.py` around lines 40 - 43, Wrap the container-local imports (from callable import call; from tools import TOOLS, execute_tool; from judge import judge_requirements) in a try/except ImportError block inside evaluate.py and raise or log a new ImportError with a clear, actionable message that these local modules could not be found in the container (e.g., instructing to verify overlay image copy paths and that callable.py, tools.py, judge.py are present), preserving the original exception as context; this will make failures during container build/runtime easier to diagnose.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/rasb-26h1/prepare.py`:
- Line 156: The bug is that create_entry is called with data_dir as base_dir
which causes env_path.relative_to(base_dir) to raise ValueError when
--data_source makes environments_dir different from data_dir; fix by passing the
actual environments directory (environments_dir) or the correct base path that
contains env_path to create_entry instead of data_dir (i.e., change the call
entry = create_entry(env_path, input_file, metadata, data_dir) to use
environments_dir or compute a base_dir = environments_dir if env_path is under
it), or alternatively update create_entry to accept a None/default and compute
base_dir from env_path to avoid relative_to errors.
In `@nemo_skills/inference/eval/rasb_container/evaluate.py`:
- Around line 382-384: The current loop in evaluate.py silently skips sample
files with no expected_output (the check using "if not expected" and printing
"Skipping {sf.name}") which removes them from the results and breaks alignment
with rasb.py; modify the code so that instead of only printing and continuing it
appends a result entry for that sample (e.g., a dict/object with sample
identifier sf.name, status like "skipped_no_expected" and a clear
message/reason) into the same results collection used for evaluated samples, and
ensure any downstream consumer expects/handles this status; keep the print/log
but also create and push the placeholder/skipped result so rasb.py can correlate
inputs to outputs.
In `@nemo_skills/inference/eval/rasb_container/lm.py`:
- Around line 63-71: The _defaults method currently merges instance-level
self.kwargs after the caller's kwargs which allows defaults to overwrite
caller-provided values; change the merge order so instance defaults are applied
first and caller kwargs take precedence by merging self.kwargs into kwargs
before returning (e.g., use a merge that places self.kwargs first), keeping the
existing checks for self.max_tokens and self.temperature so explicit caller
values for those keys still win.
In `@nemo_skills/inference/eval/rasb.py`:
- Around line 731-734: The current code appends each result to output_file
(variables: output_file, results) which risks partial duplicates if the process
fails mid-write; instead, finalize computation of results in memory then write
atomically: write all results to a temporary file (e.g., temp_output), fsync and
close it, then atomically replace/move it to output_file (or append in a
controlled atomic step) so partial writes cannot produce duplicated/partial
entries on restart; alternatively implement a per-sample write-tracking
mechanism that coordinates with skip_filled to record the highest written sample
index before writing so restarts can skip partially written samples (use unique
symbols results, output_file, skip_filled to locate the logic).
---
Nitpick comments:
In `@nemo_skills/inference/eval/rasb_container/evaluate.py`:
- Around line 96-97: The branch handling input_mode == "direct_user_message"
currently extracts the first value using list(fields.values())[0]; change it to
use next(iter(fields.values())) to avoid creating a temporary list and improve
efficiency; keep the existing empty-check behavior (i.e., return system_template
and the first field as a string if fields truthy, else empty string) and update
the expression in the evaluate.py branch for input_mode accordingly (reference:
the input_mode == "direct_user_message" branch and the fields.values() usage).
- Around line 40-43: Wrap the container-local imports (from callable import
call; from tools import TOOLS, execute_tool; from judge import
judge_requirements) in a try/except ImportError block inside evaluate.py and
raise or log a new ImportError with a clear, actionable message that these local
modules could not be found in the container (e.g., instructing to verify overlay
image copy paths and that callable.py, tools.py, judge.py are present),
preserving the original exception as context; this will make failures during
container build/runtime easier to diagnose.
In `@nemo_skills/inference/eval/rasb_container/lm.py`:
- Around line 105-109: The call to self.client.chat.completions.create in lm.py
has inconsistent hanging indentation: align the named arguments
(model=self.model, messages=messages, **kwargs) under the opening parenthesis or
use a consistent 4-space hanging indent so they line up with the first argument;
update the block where response is assigned (the call in the method that uses
self.client.chat.completions.create) and the other occurrence around the similar
call later (the second create call at lines ~119-123) so both use the same
indentation style.
In `@nemo_skills/inference/eval/rasb.py`:
- Around line 241-247: The variables returned from docker_client.images.build
(image and logs) are unused; update both occurrences (the
docker_client.images.build calls in rasb.py) to prefix unused variables with an
underscore (e.g., _image, _logs) so linters won’t flag them and intent is clear;
locate the docker_client.images.build invocation(s) in the method and rename the
returned variables accordingly for both places (the one shown and the one around
line 272).
- Around line 590-595: The finally block currently swallows errors when removing
the container; update the container cleanup in rasb.py to catch the Exception
from container.remove(force=True) and log it instead of passing — use the
existing logger (e.g., self.logger or process logger available in the class) to
emit an error/debug message that includes context (container id/name) and the
exception details, and keep honoring self.cfg.keep_containers as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b19a67a5-42bc-4d40-b4d0-48c7cdf218a1
📒 Files selected for processing (14)
docs/evaluation/rasb.mdmkdocs.ymlnemo_skills/dataset/rasb-26h1/README.mdnemo_skills/dataset/rasb-26h1/__init__.pynemo_skills/dataset/rasb-26h1/prepare.pynemo_skills/evaluation/evaluator/__init__.pynemo_skills/evaluation/evaluator/rasb.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/rasb_metrics.pynemo_skills/inference/eval/rasb.pynemo_skills/inference/eval/rasb_container/__init__.pynemo_skills/inference/eval/rasb_container/evaluate.pynemo_skills/inference/eval/rasb_container/judge.pynemo_skills/inference/eval/rasb_container/lm.py
✅ Files skipped from review due to trivial changes (6)
- mkdocs.yml
- nemo_skills/evaluation/metrics/map_metrics.py
- nemo_skills/evaluation/evaluator/rasb.py
- nemo_skills/dataset/rasb-26h1/init.py
- docs/evaluation/rasb.md
- nemo_skills/inference/eval/rasb_container/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/inference/eval/rasb_container/judge.py
|
@pbelcak could you please recreate from a branch, so that our tests can run? Sent you an invite @titu1994 @wasiahmad @ludwig-n could you please review as this looks related to your work? |
RASB (Real Agent Scaffolds Bench) evaluates LLMs on complex agent scaffolding tasks scraped from open-source AI agent repositories. This adds support for the 26H1 evaluation snapshot with 193 environments and 5,731 test samples.
See the following NVIDIA tech report for details: internal link.
Features:
Files:
Summary by CodeRabbit
New Features
Documentation