RASB (Real Agent Scaffolds Bench)

Overview

RASB evaluates how well LLMs follow complex agent scaffolding rather than solving problems autonomously. It tests models as "bearers of scaffolds" - following extensive system prompts, using tools correctly, and producing outputs in specified formats.

The benchmark uses bi-annual snapshots:

26H1: First 2026 snapshot with 193 environments and 5,731 verified synthetic test samples from 63 real agent repositories

Architecture

The Skills integration follows the original RASB benchmark architecture for reproducibility:

Base image: Built from each environment's Dockerfile
Overlay image: Adds evaluation infrastructure (evaluate.py, judge.py, lm.py, callable)
Container execution: Runs evaluate.py which processes all synth_*.json samples
Results collection: Reads results/results.json from the container

The container logic is identical to the original RASB benchmark. Only orchestration and results aggregation are adapted for Skills.

Benchmark Definition

Dataset: nemo_skills/dataset/rasb-26h1/__init__.py
Generation module: nemo_skills/inference/eval/rasb.py
Container files: nemo_skills/inference/eval/rasb_container/
Evaluator: nemo_skills/evaluation/evaluator/rasb.py

Environment Types

RASB covers six environment types based on the primary task:

Type	Samples	Description
generation	2,463	Content generation (text, code, structured data)
evaluation	1,075	Judging, scoring, or comparing outputs
retrieval	860	Finding relevant information from context
extraction	720	Extracting structured data from unstructured input
coding	330	Code generation with execution validation
codebase	283	Working with multi-file code repositories

Data Preparation

RASB requires Docker for evaluation. Each environment contains a Dockerfile that builds the execution container.

Prerequisites

Docker must be installed and running
Access to RASB 26H1 environment data (contact the RASB maintainers or refer to the RASB technical report for data access)

Setup

Once you have obtained the RASB 26H1 data, link the environments to the dataset directory:

ln -s /path/to/rasb-26h1/26h1 nemo_skills/dataset/rasb-26h1/26h1

Prepare the benchmark data:

ns prepare_data rasb-26h1

Or specify a custom data source:

ns prepare_data rasb-26h1 --data_source=/path/to/rasb-26h1/26h1

This creates test.jsonl with pointers to each environment and input file.

Evaluation

Quickstart Example

ns eval \
  --benchmarks rasb-26h1 \
  --server_type openai \
  --model azure/anthropic/claude-opus-4-5 \
  --server_address https://inference-api.nvidia.com \
  --output_dir /workspace/rasb-eval

Supported Endpoints and Models

RASB supports multiple API types and model providers:

Server Type	Compatible Models	Example Endpoint
`openai`	GPT-4, GPT-4o, Claude (via proxy), Gemini (via proxy)	OpenAI-compatible APIs
`anthropic`	Claude models	Anthropic SDK-compatible APIs

Supported endpoint types:

OpenAI-compatible APIs (completions and responses endpoints)
Anthropic SDK-compatible APIs
Local model servers with compatible APIs

Controlling Evaluation Scope

The ++max_samples parameter controls how many samples (and thus environments) to evaluate. Samples are ordered by environment in test.jsonl, with each environment containing ~30 samples (range: 20-30).

max_samples	Environments	Use case
`30`	~1	Quick smoke test
`300`	~10	Development testing
`3000`	~100	Partial benchmark
`-1`	all 193	Full benchmark (5,731 samples)

Formula: To evaluate N environments, set max_samples to approximately N × 30.

Docker Configuration

RASB runs each environment in an isolated Docker container (processing all samples for that environment). Key configuration options:

Parameter	Default	Description
`++docker_timeout`	1800	Timeout per environment in seconds (30 min)
`++docker_memory_limit`	4g	Memory limit per container
`++max_concurrent_containers`	2	Parallel container execution
`++keep_containers`	False	Keep containers for debugging
`++rebuild_images`	False	Force rebuild Docker images
`++docker_build_timeout`	600	Image build timeout (10 min)

Example with Docker Settings

ns eval \
  --benchmarks rasb-26h1 \
  --server_type openai \
  --model gpt-4o \
  --output_dir /workspace/rasb-eval \
  ++docker_timeout=3600 \
  ++max_concurrent_containers=4 \
  ++docker_memory_limit=8g

Judgment Types

RASB uses two judgment types (evaluated inside the container):

Type	Description
`exact`	Exact match comparison (normalized JSON or string)
`requirements`	LLM judge committee evaluates against requirements list

The requirements-based judgment uses a committee of LLM judges for open-ended tasks where exact matching isn't appropriate.

Output Parsing Strategies

RASB uses four output parsing strategies defined per environment:

Strategy	Description
`face_value`	Use model output as-is
`json_parse`	Parse output as JSON (handles markdown fences)
`regex_extraction`	Extract output using regex patterns from metadata
`tool_call_result`	Extract from tool call arguments

Benchmark Results

Results from the RASB 26H1 technical report (pass rates in %):

Model	Overall	Mean	Std	Median	Q1	Q3
Claude 4.5 Opus	85.8	85.5	17.2	90.0	80.0	96.7
Claude 4.6 Opus	83.0	82.8	19.1	90.0	76.7	96.7
Claude 4.6 Sonnet	78.4	78.0	23.5	86.7	70.0	96.0
Claude 4.7 Opus	78.6	78.3	23.1	86.7	66.7	96.7
Claude 4.5 Sonnet	77.0	76.7	22.2	83.3	66.7	93.3
Claude 4.5 Haiku	71.1	70.9	24.9	76.7	60.0	90.0
Gemini 3 Flash	71.1	70.8	23.7	76.7	56.7	90.0
GPT-5.1	70.9	70.6	28.0	76.7	60.0	90.0
GPT-5.3 Codex	70.4	70.0	28.1	76.7	56.7	92.0
Gemini 3.1 Pro	70.4	70.1	25.1	73.3	52.0	93.3
GPT-5.3	68.9	68.4	29.0	76.7	53.3	90.0
Gemini 2.5 Flash	66.7	66.4	24.8	70.0	46.7	84.0
Gemini 3.1 Flash Lite	64.8	64.5	27.7	66.7	46.7	86.7
Nemotron Super	63.3	62.9	25.9	63.3	43.3	83.3
Nemotron Nano	55.6	55.2	28.0	53.3	33.3	80.0

Overall is the sample-weighted pass rate across all 5,731 samples. Mean, Std, Median, Q1, and Q3 are computed over per-environment pass rates (193 environments).

Metrics

RASB reports pass rates aggregated by:

Overall: Total correct / total samples
By environment type: Pass rate per type (generation, evaluation, etc.)
By output parsing: Pass rate by parsing strategy
By judgment type: Pass rate for exact vs requirements judgments
By tool usage: Pass rate for samples with/without tools
By repository: Pass rate per source repository

Aggregate Statistics

RASB also computes statistics across all environments:

Metric	Description
`mean_pass_rate`	Average pass rate across environments
`median_pass_rate`	Median pass rate (50th percentile)
`q1_pass_rate`	First quartile (25th percentile)
`q3_pass_rate`	Third quartile (75th percentile)
`std_pass_rate`	Standard deviation of pass rates
`overall_pass_rate`	Total correct / total samples
`num_environments`	Number of environments evaluated

The mean treats all environments equally regardless of sample count, while the overall rate weights by sample count. The quartiles help identify performance distribution across environments.

Example Metrics Output

{
  "pass@1": {
    "pass_rate": 72.5,
    "pass_rate_generation": 78.2,
    "pass_rate_evaluation": 65.1,
    "pass_rate_retrieval": 71.8,
    "pass_rate_judgment_exact": 85.3,
    "pass_rate_judgment_requirements": 58.7,
    "pass_rate_with_tools": 68.4,
    "pass_rate_no_tools": 74.1,
    "errors": 15,
    "container_errors": 2,
    "mean_pass_rate": 71.3,
    "median_pass_rate": 73.5,
    "q1_pass_rate": 58.2,
    "q3_pass_rate": 85.0,
    "std_pass_rate": 18.7,
    "overall_pass_rate": 72.5,
    "num_environments": 193
  }
}

Execution Flow

Group samples: Samples grouped by environment
Build base image: Docker image built from environment's Dockerfile (cached)
Build overlay image: Adds evaluate.py, judge.py, lm.py, callable, .env
Run container: Container executes evaluate.py processing all synth_*.json files
Collect results: Results read from results/results.json
Map to samples: Container results mapped back to individual sample entries

Troubleshooting

Container Build Failures

Check the environment's Dockerfile and requirements:

cd nemo_skills/dataset/rasb-26h1/26h1/<environment>/
docker build -t test .

Debugging Container Execution

Enable keep_containers to inspect failed containers:

ns eval ... ++keep_containers=True

Then inspect the container logs:

# Check the results directory
ls /workspace/rasb-eval/environments/<env_id>/
cat /workspace/rasb-eval/environments/<env_id>/container_stdout.log
cat /workspace/rasb-eval/environments/<env_id>/results.json

Network Issues

By default, containers use host network mode to access the LLM server. If your server is not on localhost, ensure the container can reach it.

API Credentials

The container needs API credentials. Set the appropriate environment variable before running:

# For OpenAI
export OPENAI_API_KEY=your-openai-key

# For Anthropic
export ANTHROPIC_API_KEY=your-anthropic-key

# For NVIDIA Inference API
export NVAPI_KEY=your-nvidia-key

Ensure the endpoint URL and model name match your API provider. Connection errors inside Docker containers often indicate mismatched endpoint/model configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RASB (Real Agent Scaffolds Bench)

Overview

Architecture

Benchmark Definition

Environment Types

Data Preparation

Prerequisites

Setup

Evaluation

Quickstart Example

Supported Endpoints and Models

Controlling Evaluation Scope

Docker Configuration

Example with Docker Settings

Judgment Types

Output Parsing Strategies

Benchmark Results

Metrics

Aggregate Statistics

Example Metrics Output

Execution Flow

Troubleshooting

Container Build Failures

Debugging Container Execution

Network Issues

API Credentials

FilesExpand file tree

rasb.md

Latest commit

History

rasb.md

File metadata and controls

RASB (Real Agent Scaffolds Bench)

Overview

Architecture

Benchmark Definition

Environment Types

Data Preparation

Prerequisites

Setup

Evaluation

Quickstart Example

Supported Endpoints and Models

Controlling Evaluation Scope

Docker Configuration

Example with Docker Settings

Judgment Types

Output Parsing Strategies

Benchmark Results

Metrics

Aggregate Statistics

Example Metrics Output

Execution Flow

Troubleshooting

Container Build Failures

Debugging Container Execution

Network Issues

API Credentials