Skip to content

JFan5/awesome-vla-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Awesome VLA Benchmarks Awesome

A curated list of benchmarks, evaluation frameworks, and datasets for Vision-Language-Action (VLA) models in robotics.

VLA models take visual observations and language instructions as input, and output robot actions. This list catalogs the benchmarks used to evaluate them.

Contributions welcome! Please read the contributing guidelines before submitting a pull request.


Table of Contents


Awesome VLA Models

A chronological list of published Vision-Language-Action models.

  • VLM Backbone: the pretrained vision-language model the VLA was built on (or "from-scratch" if none).
  • Action Head: how continuous robot actions are produced — discrete tokens, diffusion, flow matching, etc.
  • Open: ✓ if weights/code are publicly released; ◐ if partial (code only / weights restricted); ✗ if closed.
Model Date Org VLM Backbone Action Head Params Open Links
RT-1 2022-12 Google EfficientNet + Universal Sentence Encoder Discrete action tokens (Transformer) 35M Paper / Code
PaLM-E 2023-03 Google PaLM + ViT LLM-driven planning (text actions) up to 562B Paper / Site
RT-2 2023-07 Google DeepMind PaLI-X / PaLM-E Discrete action tokens (co-fine-tuned with web data) 5B / 55B Paper / Site
RT-2-X / RT-X 2023-10 Open X-Embodiment collab PaLI-X Discrete action tokens, cross-embodiment 55B Paper / Site
RoboFlamingo 2023-11 ByteDance / Berkeley OpenFlamingo LSTM action head ~3B Paper / Code
3D-VLA 2024-03 UMass / MIT 3D-LLM Generative 3D goal + action Paper / Code
Octo 2024-05 UC Berkeley / Stanford Transformer (from-scratch) Diffusion head 27M / 93M Paper / Code
OpenVLA 2024-06 Stanford / UC Berkeley Llama-2-7B + DINOv2 + SigLIP Discrete action tokens (autoregressive) 7B Paper / Code
TinyVLA 2024-09 Midea / ECNU Small VLM (Pythia-based) Diffusion head <1B Paper / Site
RDT-1B 2024-10 Tsinghua AIR SigLIP + T5-XXL Diffusion Transformer 1B Paper / Code
π0 (Pi-Zero) 2024-10 Physical Intelligence PaliGemma Flow-matching action expert 3B Paper / Code
CogACT 2024-11 Microsoft Research Asia OpenVLA-style (DINOv2+SigLIP+Llama2) DiT action expert (decoupled cognition/action) 7B+ Paper / Code
π0-FAST 2025-01 Physical Intelligence PaliGemma FAST (DCT) action tokens 3B Paper / Site
SpatialVLA 2025-01 Shanghai AI Lab et al. PaliGemma2 Ego3D position-aware action tokens 4B Paper / Code
DexVLA 2025-02 Midea Qwen2-VL Diffusion action expert (dexterous) 1B+ Paper / Site
Magma 2025-02 Microsoft LLaVA-style Set-of-marks + action traces 8B Paper / Code
Helix 2025-02 Figure AI S2 (VLM, ~7B) + S1 (80M visuomotor) Dual-system, S1 runs at 200Hz ~7B (S2) Site
Hi Robot 2025-02 Physical Intelligence π0 backbone + high-level VLM Hierarchical (instruction → action) 3B Paper / Site
OpenVLA-OFT 2025-02 Stanford OpenVLA Parallel decoding + continuous actions + L1 regression 7B Paper / Code
GR00T N1 2025-03 NVIDIA Eagle-2 VLM DiT action head (System 1+2 design) 2B Paper / Code
Gemini Robotics 2025-03 Google DeepMind Gemini 2.0 Action decoder (closed) Paper / Site
GO-1 2025-03 AgiBot InternVL backbone Latent planner + action expert (ViLLA) Site / Code
π0.5 2025-04 Physical Intelligence π0 + open-world co-training Flow matching, generalizes to unseen homes 3B Paper / Site
NORA 2025-04 SUTD Qwen2.5-VL FAST tokens 3B Paper / Code
SmolVLA 2025-06 Hugging Face SmolVLM-2 Flow matching action expert 450M Paper / Code
GR00T N1.5 2025-06 NVIDIA Eagle-2 VLM DiT action head (improved post-training) 2B Code
WorldVLA 2025-06 Alibaba DAMO Chameleon Unified world-model + action autoregression 7B Paper / Code
Gemini Robotics On-Device 2025-06 Google DeepMind Gemini Nano family On-device action decoder Site
MolmoAct 2025-08 Allen AI (AI2) Molmo VLM Action reasoning + chunked action tokens 7B Paper / Code

Simulation Benchmarks - Manipulation

Benchmark Year Simulator Tasks Key Focus Links
CALVIN 2022 PyBullet 34 tasks, 4 envs Long-horizon language-conditioned manipulation Paper / Code
LIBERO 2023 robosuite 130 tasks, 4 suites Lifelong learning, knowledge transfer Paper / Code
RLBench 2020 CoppeliaSim 100 tasks Vision-guided manipulation (RL, IL, few-shot) Paper / Code
PerAct2 2024 CoppeliaSim 18 bimanual tasks Bimanual 6-DoF coordination Paper / Code
Meta-World 2019 MuJoCo 50 tasks Multi-task / meta RL Paper / Code
ManiSkill3 2024 SAPIEN 12 domains GPU-parallel, fastest sim (30K+ FPS) Paper / Code
ManiSkill-HAB 2024 SAPIEN Home rearrangement Low-level home manipulation Paper / Site
robosuite 2020 MuJoCo 9 tasks, 10 robots Modular manipulation framework Paper / Code
RoboMimic 2021 MuJoCo 5 sim + 3 real tasks IL from human demonstrations Paper / Code
VIMA 2023 PyBullet 17 task types, 600K+ trajs Multimodal prompt-conditioned Paper / Code
Ravens / CLIPort 2020/22 PyBullet 10 tasks Transporter / language rearrangement Paper / Code
ARNOLD 2023 Isaac Sim 8 tasks, 40 objects Continuous states in realistic 3D scenes Paper / Code
COLOSSEUM 2024 CoppeliaSim 20 tasks x 14 perturbations Systematic generalization testing Paper / Code
VLABench 2024 - 100 categories, 2000+ objects Long-horizon reasoning Paper / Code
GemBench 2024 CoppeliaSim 7 primitives x 4 levels Generalization levels Paper / Code
ClevrSkills 2024 ManiSkill2 33 tasks, 330K trajs Compositional reasoning Paper
LoHoRavens 2023 PyBullet 10 tasks Long-horizon without step-by-step instructions Paper / Code
BEHAVIOR-1K 2022 OmniGibson 1000 activities Full household activities Paper / Code
RoboCasa 2024 robosuite 100-365 tasks Kitchen tasks, generalist robots Paper / Code
GenManip 2025 - 200 scenarios LLM-driven instruction generalization Paper / Code
Franka Kitchen 2019 MuJoCo 4 subtasks Multi-task offline RL Paper
FurnitureBench 2023 Isaac Gym 8 IKEA-style tasks Long-horizon furniture assembly Paper / Code
BiGym 2024 MuJoCo 40 tasks Bimanual mobile manipulation Paper / Code
RoboTwin 2024 - 50 tasks, 5 embodiments Dual-arm with generative digital twins Paper / Code
DexArt 2023 SAPIEN Multiple Dexterous articulated object manipulation Paper / Code
Bi-DexHands 2022 Isaac Gym Thousands Bimanual dexterous manipulation Paper / Code
DOMINO 2026 - 35 tasks, 110K+ trajs Dynamic manipulation generalization Paper / Code
LiLo-VLA (LIBERO-Long++ / Ultra-Long) 2026 robosuite 21 tasks Compositional long-horizon manipulation with object-centric linking Paper
InstructVLA 2026 - Instruction-tuning suite Instruction tuning from understanding to manipulation (ICLR 2026) Code

Simulation Benchmarks - Embodied AI / Navigation

Benchmark Year Simulator Tasks Key Focus Links
AI2-THOR / ManipulaTHOR 2017 Unity 120+ rooms Navigation + manipulation Paper / Code
Habitat 2.0 2021 Habitat Sim Thousands of envs Navigation + rearrangement Paper / Site
EmbodiedBench 2025 Multi-env 1,128 instances MLLM-based embodied agents Paper / Code

Humanoid Benchmarks

Benchmark Year Simulator Tasks Key Focus Links
HumanoidBench 2024 MuJoCo 27 (15 manip + 12 loco) Whole-body locomotion & manipulation Paper / Code
LeVERB 2025 Isaac Lab 150+ tasks, 10 categories Vision-language humanoid whole-body control Paper
Ego Humanoid Manipulation 2025 Isaac Lab 12 tasks Egocentric vision humanoid manipulation Code
HumanoidGen (HGen-Bench) 2025 SAPIEN 20 tasks LLM-driven bimanual dexterous task generation Paper / Code
Humanoid Everyday 2025 Real-world 260 tasks, 10.3K trajs Large-scale real humanoid manipulation Paper / Data
OmniH2O 2024 Isaac Gym 6 tasks Human-to-humanoid teleoperation & autonomy Paper / Code
SIMPLE (Psi-0) 2026 MuJoCo + Isaac Sim 6+ loco-manip tasks Open humanoid VLA benchmarking simulator Paper / Code
Mimicking-Bench 2024 - 6 tasks, 23K sequences Human-to-humanoid scene interaction Paper / Site

Real-World Datasets & Benchmarks

Benchmark Year Embodiment Scale Key Focus Links
Open X-Embodiment 2023 22 robots 1M+ trajs, 527 skills Cross-embodiment transfer Paper / Code
BridgeData V2 2023 WidowX 250 60K trajs, 24 envs Multi-task, cross-environment Paper / Site
DROID 2024 18 Frankas 76K trajs, 564 scenes In-the-wild manipulation Paper / Code
RoboMIND 2025 4 embodiments 107K trajs, 479 tasks Multi-embodiment with failure data Paper / Site
AgiBot World 2025 Dual-arm 1M+ trajs, 217 tasks Bimanual at scale (4000 m² facility) Paper / Code
RoboSet 2023 Franka 7.5K trajs, 38 tasks Kitchen multi-task Paper / Site
Language-Table 2023 Custom 600K trajs Open-vocabulary pushing/rearrangement Paper / Code
FMB 2024 Franka 22.5K demos Functional manipulation (grasp, assemble) Paper / Site
LHManip 2023 Real robot 200 episodes, 20 tasks Long-horizon in cluttered scenes Paper / Code
ALOHA / Mobile ALOHA 2023 Custom bimanual 7-50 tasks Bimanual (mobile) manipulation Paper / Site
RoboVQA 2024 3 embodiments 829K pairs VQA for robot reasoning Paper / Site
MUTEX 2023 Franka 100 sim + 50 real 6-modality task specification Paper

Sim-to-Real Evaluation

Benchmark Year Approach Key Focus Links
SimplerEnv (SIMPLER) 2024 Sim-as-real proxy Evaluate real-world policies in sim Paper / Code
REALM 2025 Real-validated sim 15 perturbation factors, p<0.001 correlation Paper / Code
RobotArena Infinity 2025 Real-to-sim translation VLM scoring + human preferences Paper / Site
RoboArena 2025 Distributed real eval Crowd-sourced ELO-style rankings Paper
RoboChallenge 2025 Remote real robots 30 tasks, fleet of 10 machines Paper / Site

VLA-Specific Evaluation Frameworks

Framework Year Type Key Focus Links
vla-eval 2026 Unified harness 17 benchmarks, 500+ models, Docker-based Code
VLA-Arena 2025 Systematic eval 170 tasks, 4 dimensions x 3 difficulty levels Paper / Code
LADEV 2024 Language-driven eval Auto-generated scenes from NL descriptions Paper
ManipBench 2025 MCQ-based VLM reasoning for low-level manipulation Paper / Site
RoboBench 2025 MCQ/VQA-based MLLM as embodied brain, 5 cognitive dims Paper / Site
Eval-Actions + AutoEval 2026 Automated eval Trustworthy evaluation protocol for robotic manipulation Paper

Robustness & Safety Benchmarks

Benchmark Year Extends Key Focus Links
LIBERO-PRO 2025 LIBERO Robustness under 4-dim perturbations Paper / Code
LIBERO-Plus 2025 LIBERO 7-dim x 5-level robustness analysis Paper / Code
LIBERO-X 2026 LIBERO Hierarchical robustness litmus test -
LIBERO-Para 2026 LIBERO Paraphrase robustness (22-52% degradation) -
SimX-OR 2025 Plug-in Observational robustness (blur, noise, etc.) Paper / Code
Eva-VLA 2025 LIBERO Adversarial physical variations Paper
VLA-Risk 2025 Multiple Safety/risk across 296 scenarios, 3 dims (object/action/space) x 2 modalities OpenReview
RoboMME 2026 Custom Memory-augmented VLA evaluation Code
Safety-CHORES / SafeVLA 2025 AI2-THOR / CHORES 5 cost categories (corner, blind_spot, fragile, critical, danger) on long-horizon nav+manip; safe RL via CMDP (NeurIPS 2025 Spotlight) Paper / Code / Site
RoboCasa-Safety (via OmniGuide) 2026 RoboCasa Safety-rate protocol (no collision with static furniture) + 3D SDF guidance Paper / Site
Linguistic Red-Team 2026 Multiple Diversity-aware adversarial instructions (SR 93% → 5.85%) Paper
VLSA / AEGIS 2026 Plug-in Plug-and-play CBF safety-constraint layer with theoretical guarantees Paper

Unified Platforms

Platform Year Key Focus Links
RoboVerse 2025 Cross-simulator unified platform (MetaSim) Paper / Code
STAR-Gen 2025 Generalization taxonomy (visual, semantic, behavioral) Paper / Site

Survey Papers

  • "Vision-Language-Action Models for Robotics: A Review" - Site
  • "Pure Vision Language Action (VLA) Models: A Comprehensive Survey" - Paper
  • "A Survey on Vision-Language-Action Models for Embodied AI" - Paper
  • "A Survey on Efficient Vision-Language-Action Models" - Paper / Site
  • "A Survey on Vision-Language-Action Models: An Action Tokenization Perspective" - Paper
  • "Benchmarking the Generality of Vision-Language-Action Models" - Paper

Related Awesome Lists


Contributing

Please see CONTRIBUTING.md for guidelines.

License

CC0

About

A curated list of benchmarks for Vision-Language-Action (VLA) models in robotics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors