vla-evaluation-harness


Benchmarks
Models (official)
Models (dexbotic)
Models (starVLA)
Models (🤗 LeRobot)

_{✓ reproduced | ◇ integrated, awaiting first reproduction | · planned}

One framework to evaluate any VLA model on any robot simulation benchmark.

Latest News

[2026/07] v0.4.0 released. Recording on by default, pinned reproducible Docker rebuilds, DuoBench, and the LeRobot bridge below.
[2026/07] 🤗 LeRobot bridge: serve any single-obs-step LeRobot PreTrainedPolicy (π₀ / π₀.₅, GR00T N1.7, X-VLA, MolmoAct2, and the FastWAM / VLA-JEPA / LingBot-VA world models) with one config, pinned at v0.6.0. Four checkpoints reproduce their published LIBERO scores: π₀.₅ 100%, GR00T N1.7 99%, MolmoAct2 97%, VLA-JEPA 96%.
[2026/06] v0.3.0 released. SQLite recording + vla-eval merge, wandb/trackio tracking, and a watchdog for wedged benchmarks.
[2026/05] v0.2.0 released. 18 benchmarks x 13 model servers, the largest open VLA evaluation matrix. Browse configs/ to get started.
[2026/05] Leaderboard rebuilt: 2,456 models x 18 benchmarks, schema-validated pipeline, updated monthly.
[2026/04] v0.1.0 released. 6 VLA models reproduced within 2pp of published scores.
[2026/04] Batch parallel eval: 2,000 LIBERO episodes in 18 min on 1x H100 (details).

Why vla-evaluation-harness?


Batch Parallel Evaluation	Episode sharding + batched GPU inference → 47× throughput (2 000 LIBERO episodes in 18 min on 1× H100). Details
Zero Setup	Benchmarks in Docker and model servers as single-file uv scripts, avoiding dependency conflicts.
AI-Assisted Integration	Built-in Claude Code skills for adding benchmarks and model servers help scaffold new integrations in minutes, not hours.
Leaderboard	The largest unified VLA comparison: 2,456 models × 18 benchmarks, aggregated from 2,087 papers.

Motivation

VLA models are evaluated on LIBERO, CALVIN, SimplerEnv, ManiSkill, and others, but each benchmark has its own dependencies, observation format, and evaluation protocol. In practice, every research team ends up maintaining private eval forks per benchmark. Results diverge. Bug fixes don't propagate. No one tests under real-time conditions where the environment keeps moving during inference.

vla-evaluation-harness integrates the model once, integrates the benchmark once, and the full cross-evaluation matrix fills itself.

How: our abstraction layer fully decouples models from benchmarks.

Benchmarks run inside Docker, giving exact reproducibility without dependency conflicts.
Model servers are standalone uv scripts with inline dependency declarations and no manual setup.

See Architecture for how the pieces connect.

Installation

pip install vla-eval

Or from source (pinned to the latest stable release):

git clone --branch v0.4.0 https://github.com/allenai/vla-evaluation-harness.git
cd vla-evaluation-harness
uv sync --python 3.11 --all-extras --dev

Quick Start

Two terminals: one for the model server (GPU), one for the benchmark client.

# Terminal 1: model server (runs on host with GPU)
vla-eval serve --config configs/model_servers/db_cogact/libero.yaml

# Terminal 2: run evaluation (benchmark runs in Docker by default).
# Wait for the model server to finish loading first; ``GET /health`` returning HTTP 200 is the ready signal.
vla-eval run --config configs/benchmarks/libero/smoke_test.yaml

Results are saved to results/ as JSON. The benchmark runs inside Docker by default; pass --no-docker for local development.

For full evaluation (10 tasks x 50 episodes):

vla-eval run --config configs/benchmarks/libero/spatial.yaml

Other benchmarks and models follow the same pattern. Pick a benchmark and a compatible model server from configs/:

# SimplerEnv + X-VLA
vla-eval serve --config configs/model_servers/xvla/simpler_widowx.yaml
vla-eval run --config configs/benchmarks/simpler/widowx_vm.yaml

# CALVIN + DB-CogACT
vla-eval serve --config configs/model_servers/db_cogact/calvin.yaml
vla-eval run --config configs/benchmarks/calvin/eval.yaml

# LIBERO + π₀.₅ via 🤗 LeRobot (works for any LeRobot PreTrainedPolicy checkpoint)
vla-eval serve --config configs/model_servers/lerobot/pi05_libero.yaml
vla-eval run --config configs/benchmarks/libero/object.yaml

Each benchmark and model server directory has a README with setup details, supported configs, and Docker image info. See Reproduction Reports for verified scores.

Need faster runs? See Batch Parallel Evaluation for up to 47x throughput.

Batch Parallel Evaluation

A full evaluation takes hours sequentially. Two layers of parallelism bring this down to minutes:

Episode sharding splits (task, episode) pairs across N independent processes (RFC-0006). Each shard connects to the same model server, where a BatchPredictModelServer batches their inference requests into a single forward pass. The two axes multiply together.

Episode Sharding (environment parallelism)

# Option A: use the helper script (launches all shards + auto-merges)
./scripts/run_sharded.sh -c configs/benchmarks/libero/spatial.yaml -n 50

# Option B: manual launch
vla-eval run -c configs/benchmarks/libero/spatial.yaml --shard-id 0 --num-shards 4 &
vla-eval run -c configs/benchmarks/libero/spatial.yaml --shard-id 1 --num-shards 4 &
# ... (each shard is a separate process)
wait
vla-eval merge -c configs/benchmarks/libero/spatial.yaml -o results/libero_spatial.json

Each shard gets a deterministic slice via round-robin. Results merge with episode-level deduplication; if a shard fails, re-run only that shard.

Batch Model Server (GPU parallelism)

Enable batching in the model server config by setting max_batch_size > 1:

args:
  max_batch_size: 16    # max observations per GPU forward pass (>1 enables batching)
  max_wait_time: 0.05   # seconds to wait before dispatching a partial batch

Tuning & Combined Effect

We tune parallelism via a demand/supply methodology: demand λ(N) measures environment throughput as a function of shards, supply μ(B) measures model throughput as a function of batch size. The operating point satisfies λ(N) < 80% · μ(B*) to prevent queue buildup.

Sharding and batching multiply together (DB-CogACT 7B, LIBERO Spatial, 1× H100-80GB):

	Sequential	Batch Parallel (50 shards, B=16)
Wall-clock	~14 h	~18 min
Throughput	~11 obs/s	~486 obs/s

2 000 episodes, 47× faster. The included benchmarking tools (experiments/bench_demand.py, experiments/bench_supply.py) measure λ and μ for any model + benchmark combination. See the Tuning Guide for worked examples and max_wait_time derivation.

Docker Images

All benchmark environments are packaged as standalone Docker images, based on base unless noted.

Image	Size	Benchmark	Python	Base
`base`	3.3 GB	—	—	`nvidia/cuda:12.1.1-runtime-ubuntu22.04`
`rlbench` 🔒	4.7 GB	RLBench	3.8	`base`
`simpler`	4.9 GB	SimplerEnv	3.10	`base`
`duobench`	5.6 GB	DuoBench	3.11	`base`
`libero`	6.0 GB	LIBERO	3.8	`base`
`libero-pro`	6.2 GB	LIBERO-Pro	3.8	`base`
`robocerebra`	6.4 GB	RoboCerebra	3.8	`base`
`calvin`	9.6 GB	CALVIN	3.8	`base`
`maniskill2`	9.8 GB	ManiSkill2	3.10	`base`
`kinetix`	10.0 GB	Kinetix	3.11	`base`
`mikasa-robo`	10.1 GB	MIKASA-Robo	3.10	`base`
`libero-mem`	11.3 GB	LIBERO-Mem	3.8	`base`
`libero-plus`	14.8 GB	LIBERO-Plus	3.8	`base`
`robomme`	17.0 GB	RoboMME	3.11	`base`
`vlabench`	17.7 GB	VLABench	3.10	`base`
`behavior1k` 🔒	23.6 GB	BEHAVIOR-1K	3.10	`base`
`robotwin`	28.6 GB	RoboTwin 2.0	3.10	`base`
`molmospaces`	31.4 GB	MolmoSpaces-Bench	3.11	`base`
`robocasa`	35.6 GB	RoboCasa	3.11	`base`
`robodojo` 🔒	36.3 GB	RoboDojo	3.11	upstream Isaac Sim 5.1

_{🔒 = build-locally only; the Dockerfile gates the build behind a licence opt-in (docker/build.sh <name> --accept-license <name>) and the image isn't published to ghcr.io.}

Pull (recommended):

docker pull ghcr.io/allenai/vla-evaluation-harness/libero:latest

Build locally (see docker/build.sh):

docker/build.sh                                           # build all (gated images skipped)
docker/build.sh libero                                    # build one
docker/build.sh behavior1k --accept-license behavior1k    # build a gated image

Observability

Two systems capture eval data. Both key on eval_id so recordings and tracker runs stay linked.

Recording

Benchmark entries persist episode results and step rows by default, with videos off. Add a top-level recording: key only when you want to override that policy; use --record-video or set record_video: true when you want per-episode mp4s.

benchmarks:
  - benchmark: ...
    recording:
      record_video: true
      video_fps: 10

Recording writes <output_dir>/recording-<eval_id>.sqlite: per-step rows, episode results, and eval metadata. vla-eval merge materializes per-episode JSONL + aggregate JSON from the DB. Single-shard runs auto-merge; sharded runs call vla-eval merge once after all shards exit. --no-save skips recording entirely.

When filename_stem is omitted, per-episode artifacts use a benchmark-scoped path: {benchmark_safe_name}/task{task_idx:04d}_ep{episode_id:04d}_{status}. Custom stems can still reference serializable task fields such as {name} plus {status}.

Tracking (wandb / trackio)

Mirror aggregate metrics to a remote dashboard:

tracking:
  report_to: wandb           # "wandb" | ["wandb", "trackio"] | "all" | "none"

The harness injects id=<eval_id> + resume="allow" so live (vla-eval run) and merge (vla-eval merge) paths converge on the same run. All other settings, including project, entity, and API key, come from the backend's native env vars (WANDB_*, TRACKIO_*). See the W&B env reference for details. Install the backend yourself: pip install wandb / pip install trackio.

Under sharding, aggregate emission defers to vla-eval merge; per-episode tracking is live-path only.

Documentation

Document	Description
Architecture	Component descriptions, protocol, episode flow, configuration
Contributing	Dev setup, adding benchmarks/models, PR workflow
Reproduction Reports	Per-model evaluation results and reproducibility verdicts
RFCs	Design proposals with rationale and status tracking
Design Philosophy	Freshness, Convenience, Layered Abstraction, Quality, Reproducibility, Openness

Contributing

See CONTRIBUTING.md for dev setup and PR workflow.

PRs for any 🔜 item in the support matrix are welcome.

Citation

If you find this work useful, please cite:

@article{choi2026vlaeval,
  title={vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models},
  author={Choi, Suhwan and Lee, Yunsung and Park, Yubeen and Kim, Chris Dongjoo and Krishna, Ranjay and Fox, Dieter and Yu, Youngjae},
  journal={arXiv preprint arXiv:2603.13966},
  year={2026}
}

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 283 Commits
.claude/skills		.claude/skills
.github		.github
configs		configs
docker		docker
docs		docs
experiments		experiments
leaderboard		leaderboard
scripts		scripts
src/vla_eval		src/vla_eval
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vla-evaluation-harness

Latest News

Why vla-evaluation-harness?

Motivation

Installation

Quick Start

Batch Parallel Evaluation

Episode Sharding (environment parallelism)

Batch Model Server (GPU parallelism)

Tuning & Combined Effect

Docker Images

Observability

Recording

Tracking (wandb / trackio)

Documentation

Contributing

Citation

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

vla-evaluation-harness

Latest News

Why vla-evaluation-harness?

Motivation

Installation

Quick Start

Batch Parallel Evaluation

Episode Sharding (environment parallelism)

Batch Model Server (GPU parallelism)

Tuning & Combined Effect

Docker Images

Observability

Recording

Tracking (wandb / trackio)

Documentation

Contributing

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages