Skip to content

cagbal/awesome-foundation-models-for-robotics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

108 Commits
 
 

Repository files navigation

awesome-foundation-models-for-robotics

Curated database of foundation models for robotics

Rules & Legend

  • I just try to add my notes here. I can make mistakes. Please don't be offended if your work is not here, just open an issue or PR.
  • NOW AI HELP ME ADD PAPERS. MISTAKES HAPPEN. PLEASE DOUBLE CHECK ALL INFO.
  • Included models: fundamental works, open weight/source works, works I saw on X, YouTube, LinkedIn, works I trained, works I tried to train but couldn't.
  • Actions means chunked, single, end effector, joint actions. Unfortunately, I cannot keep track of all of them for each work. Also most of the models can be adapted to different modalities.

Modality Legend:

  • I: Image | Vid: Video | L: Language/Text | A: Actions
  • P: Proprioception | T: Tactile | D: Depth | G: Goal | S: State/Sensors | M: Memory | F: Force
  • A': Future Actions | I': Future Images | I_plan: Image-Space Plan | Vp: Viewpoint
  • Val: Value / Expected Reward | Prog: Progress Tracking

NotebookLM - if you want to listen to this repo

Notebook: Link)

Main list 👇

🚀 2026 Models

LDA-1B

I, L → A (Image, Language → Actions)

$\pi_{0.7}$ (Pi 0.7)

I, L → A (Image, Language → Actions)

ManiDreams

S → A (State → Actions)

ForceVLA2

I, L → A (Image, Language → Actions)

LiLo-VLA

I, L → A (Image, Language → Actions)

JALA

I, L → A (Image, Language → Actions)

  • Paper: Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild
  • Notes:
    • Released Feb 2026.
    • Pretraining framework that learns Jointly-Aligned Latent Actions (JALA).
    • Learns a predictive action embedding aligned with both inverse dynamics and real actions.
    • Scales with UniHand-Mix, a 7.5M video corpus (>2,000 hours).
    • Significantly improves downstream robot manipulation performance.

Self-Correcting VLA (SC-VLA)

I, L → A (Image, Language → Actions)

  • Paper: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
  • Code: Kisaragi0/SC-VLA
  • Notes:
    • Released Feb 2026.
    • Achieves self-improvement by intrinsically guiding action refinement through sparse imagination.
    • Integrates auxiliary predictive heads to forecast current task progress and future trajectory trends.
    • Introduces online action refinement to reshape progress-dependent dense rewards.
    • Yields highest task throughput with 16% fewer steps and 9% higher success rate than baselines.

HALO

I, L → A (Image, Language → Actions)

  • Paper: HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
  • Notes:
    • Released Feb 2026.
    • Unified VLA model for Embodied Multimodal Chain-of-Thought (EM-CoT) reasoning.
    • Mixture-of-Transformers (MoT) architecture decoupling semantic reasoning, visual foresight, and action prediction.
    • Surpasses baseline policy pi_0 by 34.1% on RoboTwin benchmark.
    • Demonstrates strong generalization under aggressive unseen environmental randomization.

AutoHorizon

I, L → A (Image, Language → Actions)

  • Website: hatchetproject.github.io/autohorizon
  • Paper: VLA Knows Its Limits
  • Notes:
    • Released Feb 2026.
    • Test-time method that dynamically estimates the execution horizon for each predicted action chunk.
    • Analyzes self-attention weights in flow-based VLAs.
    • Finds that intra-chunk actions attend invariantly to vision-language tokens.
    • Incurs negligible computational overhead and generalizes across diverse tasks and flow-based models.

TOPReward

Vid, L → Val (Video, Language → Value)

VLANeXt

I, L → A (Image, Language → Actions)

RoboGene

I, L → A (Image, Language → Actions)

DreamZero

I, L → A, Vid (Image, Language → Actions, Video)

  • Website: dreamzero0.github.io
  • Paper: World Action Models are Zero-shot Policies
  • Code: dreamzero0/dreamzero
  • Weights: Hugging Face
  • Notes:
    • Released Feb 2026.
    • World Action Model (WAM) that jointly predicts actions and videos.
    • Achieves strong zero-shot generalization to new tasks and environments (over 2x improvement vs VLAs).
    • Demonstrates efficient cross-embodiment transfer (adapts to new robot with 30 mins of play data).
    • Enables real-time closed-loop control at 7Hz via model and system optimizations (DreamZero-Flash).

FUTURE-VLA

I, L → A, I' (Image, Language → Actions, Future Images)

  • Paper: FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution
  • Notes:
    • Released Feb 2026.
    • Unified architecture reformulating long-horizon control and future forecasting as a monolithic sequence-generation task.
    • Leverages Temporally Adaptive Compression for high spatiotemporal information density.
    • Performs Latent-Space Autoregression to align actionable dynamics with reviewable visual look-aheads.
    • Enables prediction-guided Human-In-the-Loop mechanisms.
    • Achieves 99.2% success on LIBERO.

DM0

I, L → A (Image, Language → Actions)

  • Paper: DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI
  • Code: Dexmal/dexbotic
  • Notes:
    • Released Feb 2026.
    • Embodied-Native VLA framework designed for Physical AI.
    • Unifies embodied manipulation and navigation by learning from heterogeneous data sources.
    • Builds a flow-matching action expert atop the VLM.
    • Uses Embodied Spatial Scaffolding for spatial CoT reasoning.
    • Achieves SOTA performance on RoboChallenge benchmark.

RynnBrain

I, L → A (Image, Language → Actions)

APEX

I, P → A (Image, Proprioception → Actions)

  • Website: apex-humanoid.github.io
  • Paper: APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots
  • Notes:
    • Released Feb 2026.
    • System for perceptive, climbing-based high-platform traversal for humanoids.
    • Composes terrain-conditioned behaviors (climb-up, climb-down, walk, crawl).
    • Uses a generalized ratchet progress reward for learning contact-rich maneuvers.
    • Demonstrates zero-shot sim-to-real traversal of 0.8 meter platforms on Unitree G1.

RISE

I, L → A (Image, Language → Actions)

  • Website: opendrivelab.com/kai0-rl
  • Paper: RISE: Self-Improving Robot Policy with Compositional World Model
  • Notes:
    • Released Feb 2026.
    • Scalable framework for robotic reinforcement learning via imagination.
    • Compositional World Model: predicts multi-view future via controllable dynamics model and evaluates outcomes.
    • Enables continuous self-improvement in imaginary space without costly physical interaction.
    • Achieves +35-45% improvement on real-world manipulation tasks.

ContactGaussian-WM

Vid → I', Physics (Video → Future Images, Physics)

  • Paper: ContactGaussian-WM: Learning Physics-Grounded World Model from Videos
  • Notes:
    • Released Feb 2026.
    • Differentiable physics-grounded rigid-body world model.
    • Uses a unified Gaussian representation for visual appearance and collision geometry.
    • Learns physical laws directly from sparse and contact-rich video data.
    • Outperforms SOTA in learning complex scenarios and robust generalization.

VISTA

I, L → A (Image, Language → Actions)

  • Website: vista-wm.github.io
  • Paper: Scaling World Model for Hierarchical Manipulation Policies
  • Notes:
    • Released Feb 2026.
    • Hierarchical VLA framework using a world model for visual subgoal decomposition.
    • High-level world model divides tasks into subtask sequences with synthesized goal images.
    • Synthesized images provide visually and physically grounded details for the low-level policy.
    • Boosts performance in novel scenarios from 14% to 69% with world model guidance.

Say, Dream, and Act

I, L → A, Vid (Image, Language → Actions, Video)

LAP

I, L → A (Image, Language → Actions)

LocoVLM

I, L → A (Image, Language → Actions)

ST4VLA

I, L → A (Image, Language → Actions)

DreamDojo

Vid, A → Vid' (Video, Actions → Future Video) [World Model]

EgoActor

I, L → A (Image, Language → Actions)

GeneralVLA

I, L → A (Image, Language → Actions)

SCALE

I, L → A (Image, Language → Actions)

DADP

I → A (Image → Actions)

SD-VLA

I, L → A (Image, Language → Actions)

VLS

I, L → A (Image, Language → Actions)

DynamicVLA

I, L → A (Image, Language → Actions)

DeFM

D → S (Depth → Representations)

  • Paper: DeFM: Learning Foundation Representations from Depth for Robotics
  • Code: leggedrobotics/defm
  • Notes:
    • Released Jan 2026.
    • Self-supervised foundation model trained on 60M depth images.
    • Uses DINO-style self-distillation to learn metric-aware representations.
    • Introduces a three-channel input normalization strategy to preserve metric depth.
    • Distilled into compact models (as small as 3M params) for efficient policy learning.
    • Achieves SOTA on depth-based navigation, locomotion, and manipulation benchmarks.

SAM2Act & SAM2Act+

I, P → A (Image, Proprioception → Actions)

LingBot-VLA

I, L → A (Image, Language → Actions)

  • Website: technology.robbyant.com/lingbot-vla
  • Paper: A Pragmatic VLA Foundation Model
  • Code: robbyant/lingbot-vla
  • Notes:
    • Released Jan 2026.
    • Pre-trained on 20,000 hours of real-world multi-embodiment robot data (9 dual-arm configurations).
    • Achieves clear superiority on 100 real-world tasks across 3 platforms.
    • Empirically validates Scaling Laws for VLAs: performance scales with data volume without saturation.
    • Highly efficient training throughput.

Cosmos Policy

I, P, L → A, I', Val (Image, Proprioception, Language → Actions, Future Images, Value)

  • Website: research.nvidia.com/labs/dir/cosmos-policy
  • Paper: Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
  • Notes:
    • Released Jan 2026.
    • Adapts Cosmos-Predict2 (video generation model) into a robot policy via single-stage post-training.
    • No architectural modifications to the base video model; actions are encoded as latent frames.
    • Generates future state images and values (expected rewards) alongside actions, enabling test-time planning.
    • Achieves state-of-the-art performance on LIBERO (98.5%) and RoboCasa (67.1%).
    • Can learn from experience (policy rollout data) to refine its world model.

EgoWM

I, A → I' (Image, Actions → Future Images)

  • Website: egowm.github.io
  • Paper: Walk through Paintings: Egocentric World Models from Internet Priors
  • Code: miccooper9/egowm
  • Notes:
    • Released Jan 2026.
    • Transforms pretrained video diffusion models into action-conditioned world models.
    • Injects motor commands through lightweight conditioning layers.
    • Scales across embodiments, from 3-DoF mobile robots to 25-DoF humanoids.
    • Introduces the Structural Consistency Score (SCS) to measure physical correctness.
    • Generalizes to unseen environments, including paintings ("Walk through Paintings").

BayesianVLA

I, L → A (Image, Language → Actions)

  • Paper: LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
  • Notes:
    • Released Jan 2026.
    • Addresses "Information Collapse" in goal-driven datasets where language is ignored.
    • This collapse occurs because language instructions in existing datasets are often highly predictable from visual observations alone, causing the model to ignore language.
    • Proposes a Bayesian decomposition framework with learnable Latent Action Queries.
    • Maximizes conditional Pointwise Mutual Information (PMI) between actions and instructions.

TIDAL

I, L → A (Image, Language → Actions)

  • Paper: TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control
  • Notes:
    • Released Jan 2026.
    • Addresses high inference latency in large VLA models which causes execution blind spots.
    • Proposes a hierarchical framework: low-frequency macro-intent loop caches semantic embeddings, high-frequency micro-control loop interleaves single-step flow integration.
    • Enables ~9 Hz control on edge hardware (vs ~2.4 Hz baselines).
    • Uses a temporally misaligned training strategy to learn predictive compensation.

HumanoidVLM

I, L → A (Image, Language → Actions)

TwinBrainVLA

I, P, L → A (Image, Proprioception, Language → Actions)

DroneVLA

I, L → A (Image, Language → Actions)

  • Paper: DroneVLA: VLA based Aerial Manipulation
  • Notes:
    • Released Jan 2026.
    • Applies VLA models to autonomous aerial manipulation with a custom drone.
    • Integrates Grounding DINO as a separate module for object localization and dynamic planning within the pipeline.
    • Uses a human-centric controller for safe handovers.

UniAct

I, L → A (Image, Language → Actions)

  • Website: 2toinf.github.io/UniAct
  • Paper: Universal Actions for Enhanced Embodied Foundation Models
  • Code: 2toinf/UniAct
  • Notes:
    • Released Jan 2026.
    • Operates in a Universal Action Space constructed as a vector-quantized (VQ) codebook.
    • Learns universal actions capturing generic atomic behaviors shared across robots.
    • Uses streamlined heterogeneous decoders to translate universal actions into embodiment-specific commands.
    • 0.5B model outperforms significantly larger models (14x larger).

ActiveVLA

I, L, D → A, Vp (Image, Language, Depth → Actions, Viewpoint)

  • Paper: ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
  • Notes:
    • Released Jan 2026.
    • Injects active perception into VLA models to address limitations of static, end-effector-centric views.
    • Adopts a coarse-to-fine paradigm: first localizes critical 3D regions, then optimizes active perception.
    • Uses Active View Selection to choose viewpoints that maximize task relevance/diversity and minimize occlusion.
    • Applies Active 3D Zoom-in to enhance resolution in key areas for fine-grained manipulation.
    • Outperforms baselines on simulation benchmarks and transfers to real-world tasks.

PointWorld

I, D, A → S (Image, Depth, Actions → 3D Point Flow)

  • Website: point-world.github.io
  • Paper: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
  • Code: huangwl18/PointWorld
  • Notes:
    • Released Jan 2026.
    • Large pre-trained 3D world model forecasting future states from single RGB-D images.
    • Represent actions and state changes as 3D point flows (per-pixel displacements in 3D space), enabling geometry-aware predictions.
    • Unifies state and action in a shared 3D space, facilitating cross-embodiment learning.
    • Trained on ~2M trajectories and 500 hours of real and simulated data.
    • Enables diverse zero-shot manipulation skills (pushing, tool use) via MPC.

VLM4VLA

I, L → A (Image, Language → Actions)

1X World Model (1XWM)

I, L → Vid, A (Image, Language → Video, Actions)

  • Website: 1x.tech/ai
  • Notes:
    • Released Jan 2026.
    • Video-pretrained world model serving as NEO's cognitive core.
    • Derives robot actions from text-conditioned video generation (14B parameter backbone).
    • Uses a two-stage process: generates future video frames (World Model), then extracts actions via an Inverse Dynamics Model (IDM).
    • Trained on web-scale video, 900 hours of egocentric human video, and fine-tuned on 70 hours of robot data.
    • Explicitly functions as a World Model, predicting/hallucinating outcomes before execution.
    • Generalizes to novel objects and tasks without teleoperation data.

Nvidia Isaac GR00T N1.6

I, P, L → A (Image, Proprioception, Language → Actions)

Rho-alpha (ρα)

I, L, T → A (Image, Language, Tactile → Actions)

  • Website: microsoft.com/en-us/research/story/advancing-ai-for-the-physical-world/
  • Notes:
    • Released Jan 2026.
    • The first robotics model derived from Microsoft's Phi series.
    • VLA+ Model: Integrates tactile sensing directly into the decision-making process.
    • Uses a split architecture: a VLM for high-level reasoning and a specialized action expert for high-frequency control.
    • Trained using physical demonstrations and simulation (Isaac Sim).

$\pi^*_0.6$ (Pi-Star 0.6)

I, P, L → A (Image, Proprioception, Language → Actions)

  • Website: physicalintelligence.company/blog/pistar06
  • Notes:
    • Released early 2026.
    • Introduces Reinforcement Learning (RL) to the VLA training pipeline.
    • Allows the model to learn from experience, significantly improving success rates and throughput on real-world tasks.
    • Personal Note: I tried to train this locally but couldn't get the RL pipeline to converge due to limited VRAM scaling on my setup.

📆 2025 Models

Dream-VLA

I, L → A (Image, Language → Actions)

VLA-Motion

I, L → A (Image, Language → Actions)

FASTerVLA

I, L → A (Image, Language → Actions)

ManualVLA

I, L → A (Image, Language → Actions)

GR-RL

I, L → A (Image, Language → Actions)

  • Website: seed.bytedance.com/gr_rl
  • Paper: GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation
  • Notes:
    • Released Dec 2025.
    • Turns a generalist VLA policy into a specialist for long-horizon dexterous manipulation.
    • Uses a multi-stage training pipeline (filtering, augmentation, online RL).
    • The online RL component learns a latent space noise predictor to align the policy with deployment behaviors.
    • Can autonomously lace up a shoe (83.3% success rate), requiring millimeter-level precision.

SONIC

S, P → A (State, Proprioception → Actions)

  • Website: nvlabs.github.io/SONIC
  • Paper: SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
  • Code: huggingface/trl (Related)
  • Notes:
    • Released Nov 2025.
    • Addresses the challenge of diverse human motion data scarcity by extracting an expansive motion dataset (OmniHuman) containing diverse skills and realistic movements.
    • Introduces SONIC, a large-scale neural tracking policy demonstrating natural humanoid motions with up to 10.7x lower tracking error.
    • Validates zero-shot deployment in real-world scenarios for expressive humanoid movements.

EveryDayVLA

I, L → A (Image, Language → Actions)

XR-1

I, L → A (Image, Language → Actions)

Unified Diffusion VLA

I, L → A, I' (Image, Language → Actions, Future Images)

RL-100

I, P → A (Image, Proprioception → Actions)

ManiAgent

I, L → A (Image, Language → Actions)

X-VLA

I, L → A (Image, Language → Actions)

IntentionVLA

I, L → A (Image, Language → Actions)

Gemini Robotics 1.5 & ER 1.5

I, Vid, L → A, Val (Image, Video, Language → Actions, Reasoning/Value)

CLAP

I, L → A (Image, Language → Actions)

  • Paper: CLAP: A Closed-Loop Diffusion Transformer Action Foundation Model for Robotic Manipulation
  • Notes:
    • Presented at IROS 2025 (October).
    • Componentized VLA architecture with a specialized action module and a critic module.
    • Uses diffusion action transformers for modeling continuous temporal actions.
    • The critic module enables closed-loop inference by refining actions based on feedback.
    • Outperforms methods that use simple action quantization, handling complex, high-precision tasks and generalizing to unseen objects.

MLA

I, P, T, L → A (Image, Proprioception, Tactile, Language → Actions)

SARM

Vid, L → Prog (Video, Language → Progress)

MotoVLA

I, L → A (Image, Language → Actions)

Behavior Foundation Model (BFM)

G, P → A (Goal/Objectives, Proprioception → Actions)

  • Paper: Behavior Foundation Model for Humanoid Robots
  • Notes:
    • Released Sep 2025.
    • Generative model pretrained on large-scale behavioral datasets for humanoid robots.
    • Models the distribution of full-body behavioral trajectories conditioned on goals and proprioception.
    • Integrates masked online distillation with CVAE.
    • Enables flexible operation across diverse control modes (velocity, motion tracking, teleop) and generalizes robustly.

NavFoM

I, L → A (Image, Language → Actions)

  • Website: pku-epic.github.io/NavFoM-Web
  • Paper: Embodied Navigation Foundation Model
  • Notes:
    • Released Sep 2025.
    • Cross-embodiment and cross-task navigation foundation model.
    • Trained on 8 million navigation samples (quadrupeds, drones, wheeled robots, vehicles).
    • Unified architecture handling diverse camera setups and temporal horizons.

FLOWER

I, L → A (Image, Language → Actions)

ManiFlow

I, L → A (Image, Language → Actions)

UnifoLM-WMA-0

I, A → I', A' (Image, Actions → Future Images, Future Actions)

Discrete Diffusion VLA

I, L → A (Image, Language → Actions)

Long-VLA

I, L → A (Image, Language → Actions)

Embodied-R1

I, L → A (Image, Language → Actions)

DeepFleet

P, G → A (Proprioception, Goal → Actions)

MolmoAct

I, L → D, I_plan, A (Image, Language → Depth Tokens, Image-Space Plan, Actions)

RICL

I, L → A (Image, Language → Actions)

Digit's Motor Cortex

G, P → A (Goal/Objectives, Proprioception → Actions)

  • Website: agilityrobotics.com/content/training-a-whole-body-control-foundation-model
  • Notes:
    • Released Aug 2025.
    • A whole-body control foundation model trained purely in simulation (Isaac Sim).
    • Uses a small LSTM (<1M params) to handle balance, locomotion, and disturbance recovery.
    • Functions as a "motor cortex," taking end-effector objectives and handling the low-level dynamics.

InstructVLA

I, L → A (Image, Language → Actions)

GR-3

I, P, L → A (Image, Proprioception, Language → Actions)

  • Paper: GR-3 Technical Report
  • Notes:
    • Released Jul 2025.
    • Trained on three diverse data types: internet-scale vision-language data, human hand tracking data, and robot trajectories.
    • The architecture is a VLM + DiT, similar to other leading models.
    • Employs compliance control during teleoperation, which is beneficial for contact-rich tasks.
    • Showed that it can learn new tasks from only 10 human trajectory demonstrations.

Large Behavior Model (LBM)

I, P, L → A (Image, Proprioception, Language → Actions)

Unified VLA

I, L → A (Image, Language → Actions)

  • Paper: Unified Vision-Language-Action Model
  • Notes:
    • Released Jun 2025.
    • Autoregressively models vision, language, and actions as a single interleaved stream of discrete tokens.
    • Incorporates world modeling during post-training to capture causal dynamics.
    • Achieves strong results on CALVIN and LIBERO benchmarks.

RoboMonkey

I, L → A (Image, Language → Actions)

ControlVLA

I, L → A (Image, Language → Actions)

V-JEPA 2 & V-JEPA 2-AC

Vid → S, A (Video → Embeddings, Actions)

Waymo Motion FM

S, M → A (State, Map → Trajectory)

Genie Envisioner

I, L → Vid, A (Image, Language → Video, Actions)

Fast-in-Slow (FiS)

I, L → A (Image, Language → Actions)

Feel the Force (FTF)

I, T → A (Image, Tactile → Actions)

  • Website: feel-the-force-ftf.github.io
  • Paper: Feel the Force: Contact-Driven Learning from Humans
  • Notes:
    • Released Jun 2025.
    • A robot learning system that models human tactile behavior to learn force-sensitive manipulation.
    • Uses a tactile glove to collect human demonstrations with precise contact forces.
    • Achieves robust force-aware control by continuously predicting the forces needed for manipulation.

Agentic Robot

I, L → A (Image, Language → Actions)

TrackVLA

I, L → A (Image, Language → Actions)

3DLLM-Mem

I, L, M → A (Image, Language, Memory → Actions)

UniSkill

I, Vid → A (Image, Video → Actions)

UniVLA

I, L → A (Image, Language → Actions)

  • Paper: UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
  • Notes:
    • Released May 2025.
    • Learns task-centric action representations from videos using a latent action model (within DINO feature space).
    • Can leverage data from arbitrary embodiments and perspectives without explicit action labels.
    • Allows deploying generalist policies to various robots via efficient latent action decoding.

OpenHelix

I, L → A (Image, Language → Actions)

RobotxR1

I, L → A (Image, Language → Actions)

GraspVLA

I, L → A (Image, Language → Actions)

Uncertainty-Aware RWM (RWM-U)

I, P, A → I', S (Image, Proprioception, Actions → Future Images, Uncertainty)

Nvidia Isaac GR00T N1.5

I, P, L → A (Image, Proprioception, Language → Actions)

HybridVLA

I, L → A (Image, Language → Actions)

Magma

I, Vid, L → A (Image, Video, Language → Actions)

DexVLA

I, L → A (Image, Language → Actions)

Nvidia Cosmos

Vid, L, A → Vid, L (Video, Language, Control → Video, Language/Reasoning)

$\pi_{0.5}$ (Pi 0.5)

I, P, L → A (Image, Proprioception, Language → Actions)

  • Website: physicalintelligence.company/blog/pi05
  • Notes:
    • Released Early 2025.
    • An evolution of π0 focused on open-world generalization.
    • Capable of controlling mobile manipulators to perform tasks in entirely unseen environments like kitchens and bedrooms.

SmolVLA

I, Vid, L → A (Image, Video, Language → Actions)

  • Website: smolvla.net
  • Blog: huggingface.co/blog/smolvla
  • Notes:
    • Released Early 2025.
    • A compact (~450M parameter) Vision-Language-Action model designed for efficiency.
    • Optimized for running on consumer-grade GPUs and edge devices.
    • Trained on the LeRobot community datasets.
    • Personal Note: Tried to train this and successfully ran inference on a consumer GPU. Very fast and lightweight.

Genie 3

I, L → Vid (Image, Language → Interactive World Video)

  • Website: deepmind.google/blog/genie-3-a-new-frontier-for-world-models/
  • Notes:
    • Released Early 2025.
    • A general-purpose world model capable of generating interactive environments at 24fps.
    • Used to train embodied agents (like SIMA) in rich, simulated worlds.
    • Maintains environmental consistency over long horizons (minutes) and allows promptable world events.

RDT-2

I, L → A (Image, Language → Actions)

  • Website: rdt-robotics.github.io/rdt2/
  • Code: thu-ml/RDT2
  • Weights: Hugging Face
  • Notes:
    • Released Early 2025.
    • The sequel to RDT-1B, designed for zero-shot cross-embodiment generalization.
    • RDT2-VQ: A 7B VLA adapted from Qwen2.5-VL-7B, using Residual VQ for action tokenization.
    • RDT2-FM: Uses a Flow-Matching action expert for lower latency control.
    • Trained on 10,000+ hours of human manipulation videos across 100+ scenes (UMI data).

ELLMER

I, L, F → A (Image, Language, Force → Actions)

LiReN

I, G → A (Image, Goal → Actions)

LAC-WM

I, A → I' (Image, Actions → Predicted Image)

Lift3D Policy

I, P → A (Image, Proprioception → Actions)

3DS-VLA

I, L, D → A (Image, Language, Depth → Actions)

DYNA-1

I, L → A (Image, Language → Actions)

  • Website: dyna.co
  • Notes:
    • Released Early 2025.
    • Production-ready foundation model built for autonomy at scale.
    • Achieved >99% success rate in 24-hour non-stop operation.
    • Deployed in commercial settings like hotels and gyms.

🏛️ 2024 & Older

$\pi_0$ (Pi 0)

I, P, L → A (Image, Proprioception, Language → Actions)

OpenVLA

I, L → A (Image, Language → Actions)

RoboCat (2023)

I, P, G → A (Image, Proprioception, Goal Image → Actions)


🤖 Noteworthy Benchmarks / Auxiliary Frameworks

KinDER

  • Website: kinder-site
  • Paper: KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
  • Code: Princeton-Robot-Planning-and-Learning/kindergarden
  • Notes:
    • Released Apr 2026.
    • A physical reasoning benchmark for robot learning and planning.
    • Comprises 25 procedurally generated environments testing spatial relations, nonprehensile multi-object manipulation, tool use, combinatorial geometric constraints, and dynamic constraints.
    • Includes a Gymnasium-compatible Python library with parameterized skills and demonstrations.
    • Provides a standardized evaluation suite with 13 baselines spanning TAMP, imitation learning, RL, and foundation-model-based approaches.

RBench & RoVid-X

  • Paper: Rethinking Video Generation Model for the Embodied World
  • Notes:
    • Released Jan 2026.
    • Introduces RBench, a comprehensive robotics benchmark for video generation.
    • Presents RoVid-X, a large-scale high-quality robotic dataset for training video generation models.
    • Evaluation results on 25 video models show high agreement with human assessments.

CRISP


SafeDec: Constrained Decoding for Robotics Foundation Models


Risk-Guided Diffusion


Adapt3R: Adaptive 3D Scene Representation for Domain Transfer


Towards Safe Robot Foundation Models

  • Paper: Towards Safe Robot Foundation Models
  • Notes:
    • Released Mar 2025.
    • Introduces a safety layer to constrain the action space of any generalist policy.
    • Uses ATACOM, a safe reinforcement learning algorithm, to create a safe action space and ensure safe state transitions.
    • Facilitates deployment in safety-critical scenarios without requiring specific safety fine-tuning.
    • Demonstrated effectiveness in avoiding collisions in dynamic environments (e.g., air hockey).

📚 Influential Posts & Videos

Vision-Language-Action Models and the Search for a Generalist Robot Policy

  • Link: Substack Post by Chris Paxton
  • Notes:
    • A general overview of VLAs in the real world, with an excellent section on common failures.
    • Full of great insights and references.

Where's RobotGPT?

  • Link: YouTube Video by Dieter Fox
  • Notes:
    • This talk exists in many video forms; it's best to find the most recent version.
    • Focuses on the current state of robotics models and what is needed to achieve LLM-level general intelligence in robots.

About

Curated database of foundation models for robotics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors