Awesome VLA Benchmarks
A curated list of benchmarks, evaluation frameworks, and datasets for Vision-Language-Action (VLA) models in robotics.
VLA models take visual observations and language instructions as input, and output robot actions. This list catalogs the benchmarks used to evaluate them.
Contributions welcome! Please read the contributing guidelines before submitting a pull request.
A chronological list of published Vision-Language-Action models.
VLM Backbone : the pretrained vision-language model the VLA was built on (or "from-scratch" if none).
Action Head : how continuous robot actions are produced — discrete tokens, diffusion, flow matching, etc.
Open : ✓ if weights/code are publicly released; ◐ if partial (code only / weights restricted); ✗ if closed.
Model
Date
Org
VLM Backbone
Action Head
Params
Open
Links
RT-1
2022-12
Google
EfficientNet + Universal Sentence Encoder
Discrete action tokens (Transformer)
35M
✓
Paper / Code
PaLM-E
2023-03
Google
PaLM + ViT
LLM-driven planning (text actions)
up to 562B
✗
Paper / Site
RT-2
2023-07
Google DeepMind
PaLI-X / PaLM-E
Discrete action tokens (co-fine-tuned with web data)
5B / 55B
✗
Paper / Site
RT-2-X / RT-X
2023-10
Open X-Embodiment collab
PaLI-X
Discrete action tokens, cross-embodiment
55B
✗
Paper / Site
RoboFlamingo
2023-11
ByteDance / Berkeley
OpenFlamingo
LSTM action head
~3B
✓
Paper / Code
3D-VLA
2024-03
UMass / MIT
3D-LLM
Generative 3D goal + action
—
✓
Paper / Code
Octo
2024-05
UC Berkeley / Stanford
Transformer (from-scratch)
Diffusion head
27M / 93M
✓
Paper / Code
OpenVLA
2024-06
Stanford / UC Berkeley
Llama-2-7B + DINOv2 + SigLIP
Discrete action tokens (autoregressive)
7B
✓
Paper / Code
TinyVLA
2024-09
Midea / ECNU
Small VLM (Pythia-based)
Diffusion head
<1B
✓
Paper / Site
RDT-1B
2024-10
Tsinghua AIR
SigLIP + T5-XXL
Diffusion Transformer
1B
✓
Paper / Code
π0 (Pi-Zero)
2024-10
Physical Intelligence
PaliGemma
Flow-matching action expert
3B
✓
Paper / Code
CogACT
2024-11
Microsoft Research Asia
OpenVLA-style (DINOv2+SigLIP+Llama2)
DiT action expert (decoupled cognition/action)
7B+
✓
Paper / Code
π0-FAST
2025-01
Physical Intelligence
PaliGemma
FAST (DCT) action tokens
3B
✓
Paper / Site
SpatialVLA
2025-01
Shanghai AI Lab et al.
PaliGemma2
Ego3D position-aware action tokens
4B
✓
Paper / Code
DexVLA
2025-02
Midea
Qwen2-VL
Diffusion action expert (dexterous)
1B+
✓
Paper / Site
Magma
2025-02
Microsoft
LLaVA-style
Set-of-marks + action traces
8B
✓
Paper / Code
Helix
2025-02
Figure AI
S2 (VLM, ~7B) + S1 (80M visuomotor)
Dual-system, S1 runs at 200Hz
~7B (S2)
✗
Site
Hi Robot
2025-02
Physical Intelligence
π0 backbone + high-level VLM
Hierarchical (instruction → action)
3B
✗
Paper / Site
OpenVLA-OFT
2025-02
Stanford
OpenVLA
Parallel decoding + continuous actions + L1 regression
7B
✓
Paper / Code
GR00T N1
2025-03
NVIDIA
Eagle-2 VLM
DiT action head (System 1+2 design)
2B
✓
Paper / Code
Gemini Robotics
2025-03
Google DeepMind
Gemini 2.0
Action decoder (closed)
—
✗
Paper / Site
GO-1
2025-03
AgiBot
InternVL backbone
Latent planner + action expert (ViLLA)
—
◐
Site / Code
π0.5
2025-04
Physical Intelligence
π0 + open-world co-training
Flow matching, generalizes to unseen homes
3B
✗
Paper / Site
NORA
2025-04
SUTD
Qwen2.5-VL
FAST tokens
3B
✓
Paper / Code
SmolVLA
2025-06
Hugging Face
SmolVLM-2
Flow matching action expert
450M
✓
Paper / Code
GR00T N1.5
2025-06
NVIDIA
Eagle-2 VLM
DiT action head (improved post-training)
2B
✓
Code
WorldVLA
2025-06
Alibaba DAMO
Chameleon
Unified world-model + action autoregression
7B
✓
Paper / Code
Gemini Robotics On-Device
2025-06
Google DeepMind
Gemini Nano family
On-device action decoder
—
✗
Site
MolmoAct
2025-08
Allen AI (AI2)
Molmo VLM
Action reasoning + chunked action tokens
7B
✓
Paper / Code
Simulation Benchmarks - Manipulation
Benchmark
Year
Simulator
Tasks
Key Focus
Links
CALVIN
2022
PyBullet
34 tasks, 4 envs
Long-horizon language-conditioned manipulation
Paper / Code
LIBERO
2023
robosuite
130 tasks, 4 suites
Lifelong learning, knowledge transfer
Paper / Code
RLBench
2020
CoppeliaSim
100 tasks
Vision-guided manipulation (RL, IL, few-shot)
Paper / Code
PerAct2
2024
CoppeliaSim
18 bimanual tasks
Bimanual 6-DoF coordination
Paper / Code
Meta-World
2019
MuJoCo
50 tasks
Multi-task / meta RL
Paper / Code
ManiSkill3
2024
SAPIEN
12 domains
GPU-parallel, fastest sim (30K+ FPS)
Paper / Code
ManiSkill-HAB
2024
SAPIEN
Home rearrangement
Low-level home manipulation
Paper / Site
robosuite
2020
MuJoCo
9 tasks, 10 robots
Modular manipulation framework
Paper / Code
RoboMimic
2021
MuJoCo
5 sim + 3 real tasks
IL from human demonstrations
Paper / Code
VIMA
2023
PyBullet
17 task types, 600K+ trajs
Multimodal prompt-conditioned
Paper / Code
Ravens / CLIPort
2020/22
PyBullet
10 tasks
Transporter / language rearrangement
Paper / Code
ARNOLD
2023
Isaac Sim
8 tasks, 40 objects
Continuous states in realistic 3D scenes
Paper / Code
COLOSSEUM
2024
CoppeliaSim
20 tasks x 14 perturbations
Systematic generalization testing
Paper / Code
VLABench
2024
-
100 categories, 2000+ objects
Long-horizon reasoning
Paper / Code
GemBench
2024
CoppeliaSim
7 primitives x 4 levels
Generalization levels
Paper / Code
ClevrSkills
2024
ManiSkill2
33 tasks, 330K trajs
Compositional reasoning
Paper
LoHoRavens
2023
PyBullet
10 tasks
Long-horizon without step-by-step instructions
Paper / Code
BEHAVIOR-1K
2022
OmniGibson
1000 activities
Full household activities
Paper / Code
RoboCasa
2024
robosuite
100-365 tasks
Kitchen tasks, generalist robots
Paper / Code
GenManip
2025
-
200 scenarios
LLM-driven instruction generalization
Paper / Code
Franka Kitchen
2019
MuJoCo
4 subtasks
Multi-task offline RL
Paper
FurnitureBench
2023
Isaac Gym
8 IKEA-style tasks
Long-horizon furniture assembly
Paper / Code
BiGym
2024
MuJoCo
40 tasks
Bimanual mobile manipulation
Paper / Code
RoboTwin
2024
-
50 tasks, 5 embodiments
Dual-arm with generative digital twins
Paper / Code
DexArt
2023
SAPIEN
Multiple
Dexterous articulated object manipulation
Paper / Code
Bi-DexHands
2022
Isaac Gym
Thousands
Bimanual dexterous manipulation
Paper / Code
DOMINO
2026
-
35 tasks, 110K+ trajs
Dynamic manipulation generalization
Paper / Code
LiLo-VLA (LIBERO-Long++ / Ultra-Long)
2026
robosuite
21 tasks
Compositional long-horizon manipulation with object-centric linking
Paper
InstructVLA
2026
-
Instruction-tuning suite
Instruction tuning from understanding to manipulation (ICLR 2026)
Code
Simulation Benchmarks - Embodied AI / Navigation
Benchmark
Year
Simulator
Tasks
Key Focus
Links
AI2-THOR / ManipulaTHOR
2017
Unity
120+ rooms
Navigation + manipulation
Paper / Code
Habitat 2.0
2021
Habitat Sim
Thousands of envs
Navigation + rearrangement
Paper / Site
EmbodiedBench
2025
Multi-env
1,128 instances
MLLM-based embodied agents
Paper / Code
Benchmark
Year
Simulator
Tasks
Key Focus
Links
HumanoidBench
2024
MuJoCo
27 (15 manip + 12 loco)
Whole-body locomotion & manipulation
Paper / Code
LeVERB
2025
Isaac Lab
150+ tasks, 10 categories
Vision-language humanoid whole-body control
Paper
Ego Humanoid Manipulation
2025
Isaac Lab
12 tasks
Egocentric vision humanoid manipulation
Code
HumanoidGen (HGen-Bench)
2025
SAPIEN
20 tasks
LLM-driven bimanual dexterous task generation
Paper / Code
Humanoid Everyday
2025
Real-world
260 tasks, 10.3K trajs
Large-scale real humanoid manipulation
Paper / Data
OmniH2O
2024
Isaac Gym
6 tasks
Human-to-humanoid teleoperation & autonomy
Paper / Code
SIMPLE (Psi-0)
2026
MuJoCo + Isaac Sim
6+ loco-manip tasks
Open humanoid VLA benchmarking simulator
Paper / Code
Mimicking-Bench
2024
-
6 tasks, 23K sequences
Human-to-humanoid scene interaction
Paper / Site
Real-World Datasets & Benchmarks
Benchmark
Year
Embodiment
Scale
Key Focus
Links
Open X-Embodiment
2023
22 robots
1M+ trajs, 527 skills
Cross-embodiment transfer
Paper / Code
BridgeData V2
2023
WidowX 250
60K trajs, 24 envs
Multi-task, cross-environment
Paper / Site
DROID
2024
18 Frankas
76K trajs, 564 scenes
In-the-wild manipulation
Paper / Code
RoboMIND
2025
4 embodiments
107K trajs, 479 tasks
Multi-embodiment with failure data
Paper / Site
AgiBot World
2025
Dual-arm
1M+ trajs, 217 tasks
Bimanual at scale (4000 m² facility)
Paper / Code
RoboSet
2023
Franka
7.5K trajs, 38 tasks
Kitchen multi-task
Paper / Site
Language-Table
2023
Custom
600K trajs
Open-vocabulary pushing/rearrangement
Paper / Code
FMB
2024
Franka
22.5K demos
Functional manipulation (grasp, assemble)
Paper / Site
LHManip
2023
Real robot
200 episodes, 20 tasks
Long-horizon in cluttered scenes
Paper / Code
ALOHA / Mobile ALOHA
2023
Custom bimanual
7-50 tasks
Bimanual (mobile) manipulation
Paper / Site
RoboVQA
2024
3 embodiments
829K pairs
VQA for robot reasoning
Paper / Site
MUTEX
2023
Franka
100 sim + 50 real
6-modality task specification
Paper
Benchmark
Year
Approach
Key Focus
Links
SimplerEnv (SIMPLER)
2024
Sim-as-real proxy
Evaluate real-world policies in sim
Paper / Code
REALM
2025
Real-validated sim
15 perturbation factors, p<0.001 correlation
Paper / Code
RobotArena Infinity
2025
Real-to-sim translation
VLM scoring + human preferences
Paper / Site
RoboArena
2025
Distributed real eval
Crowd-sourced ELO-style rankings
Paper
RoboChallenge
2025
Remote real robots
30 tasks, fleet of 10 machines
Paper / Site
VLA-Specific Evaluation Frameworks
Framework
Year
Type
Key Focus
Links
vla-eval
2026
Unified harness
17 benchmarks, 500+ models, Docker-based
Code
VLA-Arena
2025
Systematic eval
170 tasks, 4 dimensions x 3 difficulty levels
Paper / Code
LADEV
2024
Language-driven eval
Auto-generated scenes from NL descriptions
Paper
ManipBench
2025
MCQ-based
VLM reasoning for low-level manipulation
Paper / Site
RoboBench
2025
MCQ/VQA-based
MLLM as embodied brain, 5 cognitive dims
Paper / Site
Eval-Actions + AutoEval
2026
Automated eval
Trustworthy evaluation protocol for robotic manipulation
Paper
Robustness & Safety Benchmarks
Benchmark
Year
Extends
Key Focus
Links
LIBERO-PRO
2025
LIBERO
Robustness under 4-dim perturbations
Paper / Code
LIBERO-Plus
2025
LIBERO
7-dim x 5-level robustness analysis
Paper / Code
LIBERO-X
2026
LIBERO
Hierarchical robustness litmus test
-
LIBERO-Para
2026
LIBERO
Paraphrase robustness (22-52% degradation)
-
SimX-OR
2025
Plug-in
Observational robustness (blur, noise, etc.)
Paper / Code
Eva-VLA
2025
LIBERO
Adversarial physical variations
Paper
VLA-Risk
2025
Multiple
Safety/risk across 296 scenarios, 3 dims (object/action/space) x 2 modalities
OpenReview
RoboMME
2026
Custom
Memory-augmented VLA evaluation
Code
Safety-CHORES / SafeVLA
2025
AI2-THOR / CHORES
5 cost categories (corner, blind_spot, fragile, critical, danger) on long-horizon nav+manip; safe RL via CMDP (NeurIPS 2025 Spotlight)
Paper / Code / Site
RoboCasa-Safety (via OmniGuide)
2026
RoboCasa
Safety-rate protocol (no collision with static furniture) + 3D SDF guidance
Paper / Site
Linguistic Red-Team
2026
Multiple
Diversity-aware adversarial instructions (SR 93% → 5.85%)
Paper
VLSA / AEGIS
2026
Plug-in
Plug-and-play CBF safety-constraint layer with theoretical guarantees
Paper
Platform
Year
Key Focus
Links
RoboVerse
2025
Cross-simulator unified platform (MetaSim)
Paper / Code
STAR-Gen
2025
Generalization taxonomy (visual, semantic, behavioral)
Paper / Site
"Vision-Language-Action Models for Robotics: A Review" - Site
"Pure Vision Language Action (VLA) Models: A Comprehensive Survey" - Paper
"A Survey on Vision-Language-Action Models for Embodied AI" - Paper
"A Survey on Efficient Vision-Language-Action Models" - Paper / Site
"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective" - Paper
"Benchmarking the Generality of Vision-Language-Action Models" - Paper
Please see CONTRIBUTING.md for guidelines.