RASB evaluates how well LLMs follow complex agent scaffolding rather than solving problems autonomously. It tests models as "bearers of scaffolds" - following extensive system prompts, using tools correctly, and producing outputs in specified formats.
The benchmark uses bi-annual snapshots:
- 26H1: First 2026 snapshot with 193 environments and 5,731 verified synthetic test samples from 63 real agent repositories
The Skills integration follows the original RASB benchmark architecture for reproducibility:
- Base image: Built from each environment's Dockerfile
- Overlay image: Adds evaluation infrastructure (evaluate.py, judge.py, lm.py, callable)
- Container execution: Runs
evaluate.pywhich processes allsynth_*.jsonsamples - Results collection: Reads
results/results.jsonfrom the container
The container logic is identical to the original RASB benchmark. Only orchestration and results aggregation are adapted for Skills.
- Dataset:
nemo_skills/dataset/rasb-26h1/__init__.py - Generation module:
nemo_skills/inference/eval/rasb.py - Container files:
nemo_skills/inference/eval/rasb_container/ - Evaluator:
nemo_skills/evaluation/evaluator/rasb.py
RASB covers six environment types based on the primary task:
| Type | Samples | Description |
|---|---|---|
| generation | 2,463 | Content generation (text, code, structured data) |
| evaluation | 1,075 | Judging, scoring, or comparing outputs |
| retrieval | 860 | Finding relevant information from context |
| extraction | 720 | Extracting structured data from unstructured input |
| coding | 330 | Code generation with execution validation |
| codebase | 283 | Working with multi-file code repositories |
RASB requires Docker for evaluation. Each environment contains a Dockerfile that builds the execution container.
- Docker must be installed and running
- Access to RASB 26H1 environment data (contact the RASB maintainers or refer to the RASB technical report for data access)
Once you have obtained the RASB 26H1 data, link the environments to the dataset directory:
ln -s /path/to/rasb-26h1/26h1 nemo_skills/dataset/rasb-26h1/26h1Prepare the benchmark data:
ns prepare_data rasb-26h1Or specify a custom data source:
ns prepare_data rasb-26h1 --data_source=/path/to/rasb-26h1/26h1This creates test.jsonl with pointers to each environment and input file.
ns eval \
--benchmarks rasb-26h1 \
--server_type openai \
--model azure/anthropic/claude-opus-4-5 \
--server_address https://inference-api.nvidia.com \
--output_dir /workspace/rasb-evalRASB supports multiple API types and model providers:
| Server Type | Compatible Models | Example Endpoint |
|---|---|---|
openai |
GPT-4, GPT-4o, Claude (via proxy), Gemini (via proxy) | OpenAI-compatible APIs |
anthropic |
Claude models | Anthropic SDK-compatible APIs |
Supported endpoint types:
- OpenAI-compatible APIs (completions and responses endpoints)
- Anthropic SDK-compatible APIs
- Local model servers with compatible APIs
The ++max_samples parameter controls how many samples (and thus environments) to evaluate. Samples are ordered by environment in test.jsonl, with each environment containing ~30 samples (range: 20-30).
| max_samples | Environments | Use case |
|---|---|---|
30 |
~1 | Quick smoke test |
300 |
~10 | Development testing |
3000 |
~100 | Partial benchmark |
-1 |
all 193 | Full benchmark (5,731 samples) |
Formula: To evaluate N environments, set max_samples to approximately N × 30.
RASB runs each environment in an isolated Docker container (processing all samples for that environment). Key configuration options:
| Parameter | Default | Description |
|---|---|---|
++docker_timeout |
1800 | Timeout per environment in seconds (30 min) |
++docker_memory_limit |
4g | Memory limit per container |
++max_concurrent_containers |
2 | Parallel container execution |
++keep_containers |
False | Keep containers for debugging |
++rebuild_images |
False | Force rebuild Docker images |
++docker_build_timeout |
600 | Image build timeout (10 min) |
ns eval \
--benchmarks rasb-26h1 \
--server_type openai \
--model gpt-4o \
--output_dir /workspace/rasb-eval \
++docker_timeout=3600 \
++max_concurrent_containers=4 \
++docker_memory_limit=8gRASB uses two judgment types (evaluated inside the container):
| Type | Description |
|---|---|
exact |
Exact match comparison (normalized JSON or string) |
requirements |
LLM judge committee evaluates against requirements list |
The requirements-based judgment uses a committee of LLM judges for open-ended tasks where exact matching isn't appropriate.
RASB uses four output parsing strategies defined per environment:
| Strategy | Description |
|---|---|
face_value |
Use model output as-is |
json_parse |
Parse output as JSON (handles markdown fences) |
regex_extraction |
Extract output using regex patterns from metadata |
tool_call_result |
Extract from tool call arguments |
Results from the RASB 26H1 technical report (pass rates in %):
| Model | Overall | Mean | Std | Median | Q1 | Q3 |
|---|---|---|---|---|---|---|
| Claude 4.5 Opus | 85.8 | 85.5 | 17.2 | 90.0 | 80.0 | 96.7 |
| Claude 4.6 Opus | 83.0 | 82.8 | 19.1 | 90.0 | 76.7 | 96.7 |
| Claude 4.6 Sonnet | 78.4 | 78.0 | 23.5 | 86.7 | 70.0 | 96.0 |
| Claude 4.7 Opus | 78.6 | 78.3 | 23.1 | 86.7 | 66.7 | 96.7 |
| Claude 4.5 Sonnet | 77.0 | 76.7 | 22.2 | 83.3 | 66.7 | 93.3 |
| Claude 4.5 Haiku | 71.1 | 70.9 | 24.9 | 76.7 | 60.0 | 90.0 |
| Gemini 3 Flash | 71.1 | 70.8 | 23.7 | 76.7 | 56.7 | 90.0 |
| GPT-5.1 | 70.9 | 70.6 | 28.0 | 76.7 | 60.0 | 90.0 |
| GPT-5.3 Codex | 70.4 | 70.0 | 28.1 | 76.7 | 56.7 | 92.0 |
| Gemini 3.1 Pro | 70.4 | 70.1 | 25.1 | 73.3 | 52.0 | 93.3 |
| GPT-5.3 | 68.9 | 68.4 | 29.0 | 76.7 | 53.3 | 90.0 |
| Gemini 2.5 Flash | 66.7 | 66.4 | 24.8 | 70.0 | 46.7 | 84.0 |
| Gemini 3.1 Flash Lite | 64.8 | 64.5 | 27.7 | 66.7 | 46.7 | 86.7 |
| Nemotron Super | 63.3 | 62.9 | 25.9 | 63.3 | 43.3 | 83.3 |
| Nemotron Nano | 55.6 | 55.2 | 28.0 | 53.3 | 33.3 | 80.0 |
Overall is the sample-weighted pass rate across all 5,731 samples. Mean, Std, Median, Q1, and Q3 are computed over per-environment pass rates (193 environments).
RASB reports pass rates aggregated by:
- Overall: Total correct / total samples
- By environment type: Pass rate per type (generation, evaluation, etc.)
- By output parsing: Pass rate by parsing strategy
- By judgment type: Pass rate for exact vs requirements judgments
- By tool usage: Pass rate for samples with/without tools
- By repository: Pass rate per source repository
RASB also computes statistics across all environments:
| Metric | Description |
|---|---|
mean_pass_rate |
Average pass rate across environments |
median_pass_rate |
Median pass rate (50th percentile) |
q1_pass_rate |
First quartile (25th percentile) |
q3_pass_rate |
Third quartile (75th percentile) |
std_pass_rate |
Standard deviation of pass rates |
overall_pass_rate |
Total correct / total samples |
num_environments |
Number of environments evaluated |
The mean treats all environments equally regardless of sample count, while the overall rate weights by sample count. The quartiles help identify performance distribution across environments.
{
"pass@1": {
"pass_rate": 72.5,
"pass_rate_generation": 78.2,
"pass_rate_evaluation": 65.1,
"pass_rate_retrieval": 71.8,
"pass_rate_judgment_exact": 85.3,
"pass_rate_judgment_requirements": 58.7,
"pass_rate_with_tools": 68.4,
"pass_rate_no_tools": 74.1,
"errors": 15,
"container_errors": 2,
"mean_pass_rate": 71.3,
"median_pass_rate": 73.5,
"q1_pass_rate": 58.2,
"q3_pass_rate": 85.0,
"std_pass_rate": 18.7,
"overall_pass_rate": 72.5,
"num_environments": 193
}
}- Group samples: Samples grouped by environment
- Build base image: Docker image built from environment's Dockerfile (cached)
- Build overlay image: Adds evaluate.py, judge.py, lm.py, callable, .env
- Run container: Container executes
evaluate.pyprocessing allsynth_*.jsonfiles - Collect results: Results read from
results/results.json - Map to samples: Container results mapped back to individual sample entries
Check the environment's Dockerfile and requirements:
cd nemo_skills/dataset/rasb-26h1/26h1/<environment>/
docker build -t test .Enable keep_containers to inspect failed containers:
ns eval ... ++keep_containers=TrueThen inspect the container logs:
# Check the results directory
ls /workspace/rasb-eval/environments/<env_id>/
cat /workspace/rasb-eval/environments/<env_id>/container_stdout.log
cat /workspace/rasb-eval/environments/<env_id>/results.jsonBy default, containers use host network mode to access the LLM server. If your server is not on localhost, ensure the container can reach it.
The container needs API credentials. Set the appropriate environment variable before running:
# For OpenAI
export OPENAI_API_KEY=your-openai-key
# For Anthropic
export ANTHROPIC_API_KEY=your-anthropic-key
# For NVIDIA Inference API
export NVAPI_KEY=your-nvidia-keyEnsure the endpoint URL and model name match your API provider. Connection errors inside Docker containers often indicate mismatched endpoint/model configuration.