Curated database of foundation models for robotics
- I just try to add my notes here. I can make mistakes. Please don't be offended if your work is not here, just open an issue or PR.
- NOW AI HELP ME ADD PAPERS. MISTAKES HAPPEN. PLEASE DOUBLE CHECK ALL INFO.
- Included models: fundamental works, open weight/source works, works I saw on X, YouTube, LinkedIn, works I trained, works I tried to train but couldn't.
- Actions means chunked, single, end effector, joint actions. Unfortunately, I cannot keep track of all of them for each work. Also most of the models can be adapted to different modalities.
Modality Legend:
I: Image |Vid: Video |L: Language/Text |A: ActionsP: Proprioception |T: Tactile |D: Depth |G: Goal |S: State/Sensors |M: Memory |F: ForceA': Future Actions |I': Future Images |I_plan: Image-Space Plan |Vp: ViewpointVal: Value / Expected Reward |Prog: Progress Tracking
Notebook: Link)
I, L → A (Image, Language → Actions)
- Website: pku-epic.github.io/LDA
- Paper: LDA-1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion
- Code: PKU-EPIC/LDA
-
Notes:
- Released Feb 2026.
- Jointly learns dynamics, policy, and visual forecasting.
- Assembled EI-30k, an embodied interaction dataset comprising over 30k hours of trajectories.
- Uses a structured DINO latent space for scalable dynamics learning.
- Employs a multi-modal diffusion transformer to handle asynchronous vision and action streams.
- Outperforms prior methods (like
$\pi_{0.5}$ ) on contact-rich, dexterous, and long-horizon tasks.
I, L → A (Image, Language → Actions)
- Website: pi.website/pi07
- Paper: $\pi_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
-
Notes:
- Released Apr 2026.
- A steerable generalist robot foundation model.
- Trained with diverse prompts that contain task description, detailed language, generated subgoal images, and episode metadata.
- Exhibits compositional generalization, recombining skills from various tasks to solve new problems.
S → A (State → Actions)
- Website: rice-robotpi-lab.github.io/ManiDreams
- Paper: ManiDreams: An Open-Source Library for Robust Object Manipulation via Uncertainty-aware Task-specific Intuitive Physics
- Code: Rice-RobotPI-Lab/ManiDreams
- Notes:
- Released Mar 2026.
- A modular framework for uncertainty-aware manipulation planning over intuitive physics models.
- Maintains a time-varying constraint (cage) around target objects, sampling and evaluating candidate actions via parallel forward simulation.
- Supports simulation-based and learning-based backends.
I, L → A (Image, Language → Actions)
- Website: sites.google.com/view/force-vla2/home
- Paper: ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
- Notes:
- Released Mar 2026.
- End-to-end vision-language-action framework for hybrid force-position control.
- Introduces force-based prompts to the VLM and uses a Cross-Scale Mixture-of-Experts (MoE) in the action expert to fuse task concepts with real-time interaction forces.
- Outperforms pi0 and pi0.5 in contact-rich manipulation tasks.
I, L → A (Image, Language → Actions)
- Website: yy-gx.github.io/LiLo-VLA
- Paper: LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
- Notes:
- Released Feb 2026.
- Linked Local VLA framework for compositional long-horizon manipulation.
- Decouples transport (global motion) from interaction (object-centric VLA).
- Zero-shot generalization to novel long-horizon tasks.
- Introduces LIBERO-Long++ and Ultra-Long benchmarks.
I, L → A (Image, Language → Actions)
- Paper: Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild
- Notes:
- Released Feb 2026.
- Pretraining framework that learns Jointly-Aligned Latent Actions (JALA).
- Learns a predictive action embedding aligned with both inverse dynamics and real actions.
- Scales with UniHand-Mix, a 7.5M video corpus (>2,000 hours).
- Significantly improves downstream robot manipulation performance.
I, L → A (Image, Language → Actions)
- Paper: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
- Code: Kisaragi0/SC-VLA
- Notes:
- Released Feb 2026.
- Achieves self-improvement by intrinsically guiding action refinement through sparse imagination.
- Integrates auxiliary predictive heads to forecast current task progress and future trajectory trends.
- Introduces online action refinement to reshape progress-dependent dense rewards.
- Yields highest task throughput with 16% fewer steps and 9% higher success rate than baselines.
I, L → A (Image, Language → Actions)
- Paper: HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
- Notes:
- Released Feb 2026.
- Unified VLA model for Embodied Multimodal Chain-of-Thought (EM-CoT) reasoning.
- Mixture-of-Transformers (MoT) architecture decoupling semantic reasoning, visual foresight, and action prediction.
- Surpasses baseline policy pi_0 by 34.1% on RoboTwin benchmark.
- Demonstrates strong generalization under aggressive unseen environmental randomization.
I, L → A (Image, Language → Actions)
- Website: hatchetproject.github.io/autohorizon
- Paper: VLA Knows Its Limits
- Notes:
- Released Feb 2026.
- Test-time method that dynamically estimates the execution horizon for each predicted action chunk.
- Analyzes self-attention weights in flow-based VLAs.
- Finds that intra-chunk actions attend invariantly to vision-language tokens.
- Incurs negligible computational overhead and generalizes across diverse tasks and flow-based models.
Vid, L → Val (Video, Language → Value)
- Website: topreward.github.io/webpage
- Paper: TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
- Code: TOPReward/TOPReward
- Notes:
- Released Feb 2026.
- A zero-shot progress estimator that interprets pretrained video VLM token likelihoods as temporal value functions.
- Avoids relying on numerical output, leveraging token probabilities (e.g., the token "True") for instruction satisfaction.
- Enables success detection and reward-aligned behavior cloning.
I, L → A (Image, Language → Actions)
- Website: dravenalg.github.io/VLANeXt
- Paper: VLANeXt: Recipes for Building Strong VLA Models
- Code: DravenALG/VLANeXt
- Notes:
- Released Feb 2026.
- Systematically explores the VLA design space under a unified framework to distill 12 key findings.
- Introduces VLANeXt, a simple yet effective model that outperforms prior state-of-the-art on LIBERO and LIBERO-plus benchmarks.
- Demonstrates strong generalization in real-world experiments.
I, L → A (Image, Language → Actions)
- Website: robogene-boost-vla.github.io
- Paper: RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation
- Notes:
- Released Feb 2026.
- Agentic framework for automated generation of diverse manipulation tasks.
- Integrates diversity-driven sampling, self-reflection, and human-in-the-loop refinement.
- Collected 18k trajectories.
- VLA models pre-trained with RoboGene achieve higher success rates and superior generalization.
I, L → A, Vid (Image, Language → Actions, Video)
- Website: dreamzero0.github.io
- Paper: World Action Models are Zero-shot Policies
- Code: dreamzero0/dreamzero
- Weights: Hugging Face
- Notes:
- Released Feb 2026.
- World Action Model (WAM) that jointly predicts actions and videos.
- Achieves strong zero-shot generalization to new tasks and environments (over 2x improvement vs VLAs).
- Demonstrates efficient cross-embodiment transfer (adapts to new robot with 30 mins of play data).
- Enables real-time closed-loop control at 7Hz via model and system optimizations (DreamZero-Flash).
I, L → A, I' (Image, Language → Actions, Future Images)
- Paper: FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution
- Notes:
- Released Feb 2026.
- Unified architecture reformulating long-horizon control and future forecasting as a monolithic sequence-generation task.
- Leverages Temporally Adaptive Compression for high spatiotemporal information density.
- Performs Latent-Space Autoregression to align actionable dynamics with reviewable visual look-aheads.
- Enables prediction-guided Human-In-the-Loop mechanisms.
- Achieves 99.2% success on LIBERO.
I, L → A (Image, Language → Actions)
- Paper: DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI
- Code: Dexmal/dexbotic
- Notes:
- Released Feb 2026.
- Embodied-Native VLA framework designed for Physical AI.
- Unifies embodied manipulation and navigation by learning from heterogeneous data sources.
- Builds a flow-matching action expert atop the VLM.
- Uses Embodied Spatial Scaffolding for spatial CoT reasoning.
- Achieves SOTA performance on RoboChallenge benchmark.
I, L → A (Image, Language → Actions)
- Website: alibaba-damo-academy.github.io/RynnBrain.github.io
- Paper: RynnBrain: Open Embodied Foundation Models
- Notes:
- Released Feb 2026.
- Open-source spatiotemporal foundation model for embodied intelligence.
- Strengthens egocentric understanding, localization, reasoning, and physics-aware planning.
- Family includes 2B, 8B, and 30B (MoE) models.
- Outperforms existing embodied foundation models on 20 benchmarks.
I, P → A (Image, Proprioception → Actions)
- Website: apex-humanoid.github.io
- Paper: APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots
- Notes:
- Released Feb 2026.
- System for perceptive, climbing-based high-platform traversal for humanoids.
- Composes terrain-conditioned behaviors (climb-up, climb-down, walk, crawl).
- Uses a generalized ratchet progress reward for learning contact-rich maneuvers.
- Demonstrates zero-shot sim-to-real traversal of 0.8 meter platforms on Unitree G1.
I, L → A (Image, Language → Actions)
- Website: opendrivelab.com/kai0-rl
- Paper: RISE: Self-Improving Robot Policy with Compositional World Model
- Notes:
- Released Feb 2026.
- Scalable framework for robotic reinforcement learning via imagination.
- Compositional World Model: predicts multi-view future via controllable dynamics model and evaluates outcomes.
- Enables continuous self-improvement in imaginary space without costly physical interaction.
- Achieves +35-45% improvement on real-world manipulation tasks.
Vid → I', Physics (Video → Future Images, Physics)
- Paper: ContactGaussian-WM: Learning Physics-Grounded World Model from Videos
- Notes:
- Released Feb 2026.
- Differentiable physics-grounded rigid-body world model.
- Uses a unified Gaussian representation for visual appearance and collision geometry.
- Learns physical laws directly from sparse and contact-rich video data.
- Outperforms SOTA in learning complex scenarios and robust generalization.
I, L → A (Image, Language → Actions)
- Website: vista-wm.github.io
- Paper: Scaling World Model for Hierarchical Manipulation Policies
- Notes:
- Released Feb 2026.
- Hierarchical VLA framework using a world model for visual subgoal decomposition.
- High-level world model divides tasks into subtask sequences with synthesized goal images.
- Synthesized images provide visually and physically grounded details for the low-level policy.
- Boosts performance in novel scenarios from 14% to 69% with world model guidance.
I, L → A, Vid (Image, Language → Actions, Video)
- Paper: Say, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation
- Notes:
- Released Feb 2026.
- Framework for fast and predictive video-conditioned action.
- Uses adversarial distillation for fast, few-step video generation ("Dreaming").
- Action model leverages both generated videos and real observations to correct spatial errors.
- Produces spatially accurate video predictions supporting precise manipulation.
I, L → A (Image, Language → Actions)
- Website: lap-vla.github.io
- Paper: LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer
- Notes:
- Released Feb 2026.
- Language-Action Pre-training: represents low-level robot actions directly in natural language.
- Aligns action supervision with the pre-trained VLM's input-output distribution.
- LAP-3B achieves >50% average zero-shot success on novel robots without fine-tuning.
- Unifies action prediction and VQA in a shared language-action format.
I, L → A (Image, Language → Actions)
- Website: locovlm.github.io
- Paper: LocoVLM: Grounding Vision and Language for Adapting Versatile Legged Locomotion Policies
- Notes:
- Released Feb 2026.
- Integrates high-level commonsense reasoning from foundation models into legged locomotion.
- Uses a VLM to extract environmental semantics and ground them in a skill database.
- Trains a style-conditioned policy for diverse locomotion skills.
- Achieves 87% instruction-following accuracy.
I, L → A (Image, Language → Actions)
- Website: internrobotics.github.io/internvla-m1.github.io
- Paper: ST4VLA: Spatially Guided Training for Vision-Language-Action Models
- Notes:
- Released Feb 2026.
- Dual-system VLA framework leveraging Spatial Guided Training.
- Stage 1: Spatial grounding pre-training (point, box, trajectory prediction).
- Stage 2: Spatially guided action post-training with spatial prompting.
- Substantial improvements on Google Robot and WidowX Robot tasks.
Vid, A → Vid' (Video, Actions → Future Video) [World Model]
- Website: dreamdojo-world.github.io
- Paper: DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
- Code: NVIDIA/DreamDojo
- Weights: nvidia/DreamDojo
- Notes:
- Released Feb 2026 by NVIDIA.
- Foundation world model learning diverse interactions and dexterous controls from 44k hours of egocentric human videos (
DreamDojo-HVdataset). - Introduces continuous latent actions as a hardware-agnostic proxy to extract control signals from unlabelled human video.
- Distillation pipeline accelerates autoregressive generation to real-time 10.81 FPS, enabling live teleoperation, policy evaluation, and model-based planning.
I, L → A (Image, Language → Actions)
- Paper: EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models
- Notes:
- Released Feb 2026.
- Unified and scalable VLM grounding high-level instructions into precise, spatially aware humanoid actions.
- Predicts locomotion primitives (walk, turn), head movements, and manipulation commands.
- Leverages broad supervision from real-world demos, spatial reasoning QA, and simulated demos.
- Inference under 1s with 4B and 8B parameter models.
I, L → A (Image, Language → Actions)
- Paper: GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning
- Website: aigeeksgroup.github.io/GeneralVLA
- Code: AIGeeksGroup/GeneralVLA
- Notes:
- Released Feb 2026.
- Hierarchical VLA model enabling zero-shot manipulation without real-world robotic data collection.
- High-level ASM (Affordance Segmentation Module) perceives image keypoint affordances.
- Mid-level 3DAgent carries out task understanding and trajectory planning.
- Low-level 3D-aware control policy executes precise manipulation.
I, L → A (Image, Language → Actions)
- Paper: SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models
- Notes:
- Released Feb 2026.
- Inference strategy that jointly modulates visual perception and action based on 'self-uncertainty'.
- Inspired by Active Inference theory.
- Requires no additional training, no verifier, and only a single forward pass.
- Broadens exploration in perception and action under high uncertainty.
I → A (Image → Actions)
- Paper: DADP: Domain Adaptive Diffusion Policy
- Website: outsider86.github.io/DomainAdaptiveDiffusionPolicy
- Notes:
- Released Feb 2026.
- Achieves robust adaptation through unsupervised disentanglement and domain-aware diffusion injection.
- Introduces Lagged Context Dynamical Prediction to filter out transient properties.
- Integrates learned domain representations directly into the generative process.
I, L → A (Image, Language → Actions)
- Paper: Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement
- Notes:
- Released Feb 2026.
- Disentangles visual inputs into multi-level static and dynamic tokens.
- Retains a single copy of static tokens (e.g., background) to significantly reduce context length.
- Reuses KV cache of static tokens via a lightweight recache gate.
- Delivers 2.26x inference speedup and improves long-horizon task performance.
I, L → A (Image, Language → Actions)
- Paper: VLS: Steering Pretrained Robot Policies via Vision-Language Models
- Website: vision-language-steering.github.io/webpage
- Notes:
- Released Feb 2026.
- Training-free framework for inference-time adaptation of frozen generative robot policies (diffusion or flow-matching).
- Steers sampling process using VLMs to synthesize trajectory-differentiable reward functions.
- Addresses failures near obstacles, on shifted surfaces, or with mild clutter.
I, L → A (Image, Language → Actions)
- Website: infinitescript.com/project/dynamic-vla
- Paper: DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
- Code: hzxie/DynamicVLA
- Notes:
- Released Jan 2026.
- Enables open-ended dynamic object manipulation by pairing a compact 0.4B VLM with low-latency Continuous Inference.
- Uses Latent-Aware Action Streaming to remove pauses and ensure seamless action transitions.
- Introduces the Dynamic Object Manipulation (DOM) benchmark with 2.8K scenes and 206 objects.
- Outperforms Pi0.5, SmolVLA, and VLASH in dynamic tasks.
D → S (Depth → Representations)
- Paper: DeFM: Learning Foundation Representations from Depth for Robotics
- Code: leggedrobotics/defm
- Notes:
- Released Jan 2026.
- Self-supervised foundation model trained on 60M depth images.
- Uses DINO-style self-distillation to learn metric-aware representations.
- Introduces a three-channel input normalization strategy to preserve metric depth.
- Distilled into compact models (as small as 3M params) for efficient policy learning.
- Achieves SOTA on depth-based navigation, locomotion, and manipulation benchmarks.
I, P → A (Image, Proprioception → Actions)
- Website: sam2act.github.io
- Paper: SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation
- Code: sam2act/sam2act
- Notes:
- Released Jan 2026.
- Integrates the SAM2 visual foundation model with a memory architecture for robotic manipulation.
- SAM2Act+ incorporates a memory bank and encoder for episodic recall, enabling spatial memory-dependent tasks.
- Achieves state-of-the-art performance on RLBench (86.8%) and robust generalization on The Colosseum.
I, L → A (Image, Language → Actions)
- Website: technology.robbyant.com/lingbot-vla
- Paper: A Pragmatic VLA Foundation Model
- Code: robbyant/lingbot-vla
- Notes:
- Released Jan 2026.
- Pre-trained on 20,000 hours of real-world multi-embodiment robot data (9 dual-arm configurations).
- Achieves clear superiority on 100 real-world tasks across 3 platforms.
- Empirically validates Scaling Laws for VLAs: performance scales with data volume without saturation.
- Highly efficient training throughput.
I, P, L → A, I', Val (Image, Proprioception, Language → Actions, Future Images, Value)
- Website: research.nvidia.com/labs/dir/cosmos-policy
- Paper: Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
- Notes:
- Released Jan 2026.
- Adapts
Cosmos-Predict2(video generation model) into a robot policy via single-stage post-training. - No architectural modifications to the base video model; actions are encoded as latent frames.
- Generates future state images and values (expected rewards) alongside actions, enabling test-time planning.
- Achieves state-of-the-art performance on LIBERO (98.5%) and RoboCasa (67.1%).
- Can learn from experience (policy rollout data) to refine its world model.
I, A → I' (Image, Actions → Future Images)
- Website: egowm.github.io
- Paper: Walk through Paintings: Egocentric World Models from Internet Priors
- Code: miccooper9/egowm
- Notes:
- Released Jan 2026.
- Transforms pretrained video diffusion models into action-conditioned world models.
- Injects motor commands through lightweight conditioning layers.
- Scales across embodiments, from 3-DoF mobile robots to 25-DoF humanoids.
- Introduces the Structural Consistency Score (SCS) to measure physical correctness.
- Generalizes to unseen environments, including paintings ("Walk through Paintings").
I, L → A (Image, Language → Actions)
- Paper: LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
- Notes:
- Released Jan 2026.
- Addresses "Information Collapse" in goal-driven datasets where language is ignored.
- This collapse occurs because language instructions in existing datasets are often highly predictable from visual observations alone, causing the model to ignore language.
- Proposes a Bayesian decomposition framework with learnable Latent Action Queries.
- Maximizes conditional Pointwise Mutual Information (PMI) between actions and instructions.
I, L → A (Image, Language → Actions)
- Paper: TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control
- Notes:
- Released Jan 2026.
- Addresses high inference latency in large VLA models which causes execution blind spots.
- Proposes a hierarchical framework: low-frequency macro-intent loop caches semantic embeddings, high-frequency micro-control loop interleaves single-step flow integration.
- Enables ~9 Hz control on edge hardware (vs ~2.4 Hz baselines).
- Uses a temporally misaligned training strategy to learn predictive compensation.
I, L → A (Image, Language → Actions)
- Paper: HumanoidVLM: Vision-Language-Guided Impedance Control for Contact-Rich Humanoid Manipulation
- Notes:
- Released Jan 2026.
- Enables humanoids (Unitree G1) to select task-appropriate impedance parameters from egocentric vision.
- Combines a VLM for semantic inference with a FAISS-based RAG module which retrieves experimentally validated stiffness-damping pairs for compliant manipulation.
I, P, L → A (Image, Proprioception, Language → Actions)
- Paper: TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
- Notes:
- Released Jan 2026.
- Resolves the tension between general semantic understanding and fine-grained motor skills.
- Features an Asymmetric Mixture-of-Transformers (AsyMoT) where the "Right Brain" (proprioception) can dynamically query the frozen "Left Brain" (VLM) for semantic knowledge, rather than just using standard fine-tuning.
- Uses a Flow-Matching Action Expert for precise control.
I, L → A (Image, Language → Actions)
- Paper: DroneVLA: VLA based Aerial Manipulation
- Notes:
- Released Jan 2026.
- Applies VLA models to autonomous aerial manipulation with a custom drone.
- Integrates Grounding DINO as a separate module for object localization and dynamic planning within the pipeline.
- Uses a human-centric controller for safe handovers.
I, L → A (Image, Language → Actions)
- Website: 2toinf.github.io/UniAct
- Paper: Universal Actions for Enhanced Embodied Foundation Models
- Code: 2toinf/UniAct
- Notes:
- Released Jan 2026.
- Operates in a Universal Action Space constructed as a vector-quantized (VQ) codebook.
- Learns universal actions capturing generic atomic behaviors shared across robots.
- Uses streamlined heterogeneous decoders to translate universal actions into embodiment-specific commands.
- 0.5B model outperforms significantly larger models (14x larger).
I, L, D → A, Vp (Image, Language, Depth → Actions, Viewpoint)
- Paper: ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
- Notes:
- Released Jan 2026.
- Injects active perception into VLA models to address limitations of static, end-effector-centric views.
- Adopts a coarse-to-fine paradigm: first localizes critical 3D regions, then optimizes active perception.
- Uses Active View Selection to choose viewpoints that maximize task relevance/diversity and minimize occlusion.
- Applies Active 3D Zoom-in to enhance resolution in key areas for fine-grained manipulation.
- Outperforms baselines on simulation benchmarks and transfers to real-world tasks.
I, D, A → S (Image, Depth, Actions → 3D Point Flow)
- Website: point-world.github.io
- Paper: PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
- Code: huangwl18/PointWorld
- Notes:
- Released Jan 2026.
- Large pre-trained 3D world model forecasting future states from single RGB-D images.
- Represent actions and state changes as 3D point flows (per-pixel displacements in 3D space), enabling geometry-aware predictions.
- Unifies state and action in a shared 3D space, facilitating cross-embodiment learning.
- Trained on ~2M trajectories and 500 hours of real and simulated data.
- Enables diverse zero-shot manipulation skills (pushing, tool use) via MPC.
I, L → A (Image, Language → Actions)
- Website: cladernyjorn.github.io/VLM4VLA.github.io
- Paper: VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
- Code: CladernyJorn/VLM4VLA
- Notes:
- Released Jan 2026.
- Unified training and evaluation framework (VLM4VLA) for studying VLM backbones in VLAs.
- Reveals that VLM general capabilities (VQA) are poor predictors of downstream VLA performance.
- Identifies the vision encoder as the primary bottleneck; fine-tuning it is crucial (freezing it leads to degradation).
- Finds that fine-tuning on auxiliary embodied tasks (e.g., embodied QA, visual pointing) does not guarantee better control performance.
I, L → Vid, A (Image, Language → Video, Actions)
- Website: 1x.tech/ai
- Notes:
- Released Jan 2026.
- Video-pretrained world model serving as NEO's cognitive core.
- Derives robot actions from text-conditioned video generation (14B parameter backbone).
- Uses a two-stage process: generates future video frames (World Model), then extracts actions via an Inverse Dynamics Model (IDM).
- Trained on web-scale video, 900 hours of egocentric human video, and fine-tuned on 70 hours of robot data.
- Explicitly functions as a World Model, predicting/hallucinating outcomes before execution.
- Generalizes to novel objects and tasks without teleoperation data.
I, P, L → A (Image, Proprioception, Language → Actions)
- Website: developer.nvidia.com/isaac/gr00t
- Research Page: research.nvidia.com/labs/gear/gr00t-n1_6/
- Code: NVIDIA/Isaac-GR00T
- Weights: Hugging Face
- Notes:
- Released Jan 2026.
- Reasoning VLA model for generalist humanoid robots.
- Integrates
NVIDIA Cosmos Reasonfor high-level reasoning and contextual understanding. - Unlocks full-body control for simultaneous moving and manipulation.
I, L, T → A (Image, Language, Tactile → Actions)
- Website: microsoft.com/en-us/research/story/advancing-ai-for-the-physical-world/
- Notes:
- Released Jan 2026.
- The first robotics model derived from Microsoft's Phi series.
- VLA+ Model: Integrates tactile sensing directly into the decision-making process.
- Uses a split architecture: a VLM for high-level reasoning and a specialized action expert for high-frequency control.
- Trained using physical demonstrations and simulation (Isaac Sim).
I, P, L → A (Image, Proprioception, Language → Actions)
- Website: physicalintelligence.company/blog/pistar06
- Notes:
- Released early 2026.
- Introduces Reinforcement Learning (RL) to the VLA training pipeline.
- Allows the model to learn from experience, significantly improving success rates and throughput on real-world tasks.
- Personal Note: I tried to train this locally but couldn't get the RL pipeline to converge due to limited VRAM scaling on my setup.
I, L → A (Image, Language → Actions)
- Paper: Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
- Notes:
- Released Dec 2025.
- Diffusion LLM-based VLA (dVLA) developed through continuous pre-training on open robotic datasets.
- Natively bidirectional diffusion backbone is inherently suited for action chunking and parallel generation.
- Demonstrates superior performance on VLA tasks compared to autoregressive baselines.
I, L → A (Image, Language → Actions)
- Website: vla-motion.github.io
- Paper: Robotic VLA Benefits from Joint Learning with Motion Image Diffusion
- Notes:
- Released Dec 2025.
- Enhances VLAs with motion reasoning by jointly training with a motion image diffusion head (optical flow).
- The motion head acts as an auxiliary task, improving the shared representation.
- Improves success rates on LIBERO (97.5%) and real-world tasks (23% gain).
- No additional inference latency.
I, L → A (Image, Language → Actions)
- Paper: FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization
- Notes:
- Released Dec 2025.
- Builds on the FAST tokenizer with block-wise autoregressive decoding and a lightweight action expert.
- Uses a learnable action tokenizer (FASTerVQ) that encodes action chunks as single-channel images.
- Achieves faster inference and higher task performance compared to diffusion VLAs.
I, L → A (Image, Language → Actions)
- Paper: ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
- Notes:
- Released Dec 2025.
- Unified VLA framework with Mixture-of-Transformers (MoT).
- Generates intermediate "manuals" (images, position prompts, textual instructions) via a planning expert.
- Uses a Manual Chain-of-Thought (ManualCoT) reasoning process.
- Achieves 32% higher success rate on long-horizon tasks like LEGO assembly.
I, L → A (Image, Language → Actions)
- Website: seed.bytedance.com/gr_rl
- Paper: GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation
- Notes:
- Released Dec 2025.
- Turns a generalist VLA policy into a specialist for long-horizon dexterous manipulation.
- Uses a multi-stage training pipeline (filtering, augmentation, online RL).
- The online RL component learns a latent space noise predictor to align the policy with deployment behaviors.
- Can autonomously lace up a shoe (83.3% success rate), requiring millimeter-level precision.
S, P → A (State, Proprioception → Actions)
- Website: nvlabs.github.io/SONIC
- Paper: SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
- Code: huggingface/trl (Related)
- Notes:
- Released Nov 2025.
- Addresses the challenge of diverse human motion data scarcity by extracting an expansive motion dataset (OmniHuman) containing diverse skills and realistic movements.
- Introduces SONIC, a large-scale neural tracking policy demonstrating natural humanoid motions with up to 10.7x lower tracking error.
- Validates zero-shot deployment in real-world scenarios for expressive humanoid movements.
I, L → A (Image, Language → Actions)
- Website: everydayvla.github.io
- Paper: EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation
- Notes:
- Released Nov 2025.
- Aims to democratize robotic manipulation with affordable hardware ($300 6-DOF arm).
- Unified model jointly outputting discrete and continuous actions.
- Features an adaptive-horizon ensemble to monitor motion uncertainty and trigger on-the-fly re-planning.
- Matches SOTA on LIBERO benchmark.
I, L → A (Image, Language → Actions)
- Paper: XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations
- Notes:
- Released Nov 2025.
- Introduces Unified Vision-Motion Codes (UVMC), a discrete latent representation for visual dynamics and robotic motion.
- Uses a dual-branch VQ-VAE to jointly encode vision and motion.
- Demonstrates strong cross-task and cross-embodiment generalization in real-world experiments.
I, L → A, I' (Image, Language → Actions, Future Images)
- Paper: Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
- Notes:
- Released Nov 2025.
- Jointly understands, generates future images, and acts using a synchronous denoising process.
- Integrates multiple modalities into a single denoising trajectory (JD3P).
- Achieves 4x faster inference than autoregressive methods on benchmarks like CALVIN and LIBERO.
I, P → A (Image, Proprioception → Actions)
- Website: lei-kun.github.io/RL-100
- Paper: RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
- Code: Lei-Kun/Uni-o4
- Notes:
- Released Oct 2025.
- Tackles the efficiency issues of traditional deep RL in real-world environments.
- Introduces an algorithmic framework built on a cross-modal transformer backbone designed to learn directly on real robots.
- Achieves rapid acquisition of complex manipulation skills (within 1-2 hours) with high success rates across 100+ tasks in a single day.
I, L → A (Image, Language → Actions)
- Paper: ManiAgent: An Agentic Framework for General Robotic Manipulation
- Notes:
- Released Oct 2025.
- Agentic architecture for general manipulation tasks.
- Uses multi-agent communication for perception, sub-task decomposition, and action generation.
- Achieves 95.8% success rate on real-world pick-and-place tasks.
I, L → A (Image, Language → Actions)
- Paper: X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
- Notes:
- Released Oct 2025.
- Uses "soft prompts" (learnable embeddings) to adapt to different robot embodiments and datasets.
- Treats each hardware setup as a distinct "task" guided by these prompts.
- Built on a flow-matching-based VLA architecture.
I, L → A (Image, Language → Actions)
- Paper: IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction
- Notes:
- Released Oct 2025.
- Focuses on implicit human intention reasoning for complex interactions.
- Uses a curriculum training paradigm combining intention inference, spatial grounding, and embodied reasoning.
- Significantly outperforms baselines on out-of-distribution intention tasks.
I, Vid, L → A, Val (Image, Video, Language → Actions, Reasoning/Value)
- Website: deepmind.google/models/gemini-robotics/
- Paper: Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer
- Notes:
- Released Oct 2025.
- A dual-model system: VLA for low-level control and Embodied Reasoning (ER) for high-level planning.
- Interleaves actions with a natural language "thinking" process to decompose complex tasks.
- Demonstrates motion transfer, allowing policies to adapt across different robot embodiments (e.g., Aloha to Apollo).
I, L → A (Image, Language → Actions)
- Paper: CLAP: A Closed-Loop Diffusion Transformer Action Foundation Model for Robotic Manipulation
- Notes:
- Presented at IROS 2025 (October).
- Componentized VLA architecture with a specialized action module and a critic module.
- Uses diffusion action transformers for modeling continuous temporal actions.
- The critic module enables closed-loop inference by refining actions based on feedback.
- Outperforms methods that use simple action quantization, handling complex, high-precision tasks and generalizing to unseen objects.
I, P, T, L → A (Image, Proprioception, Tactile, Language → Actions)
- Website: sites.google.com/view/open-mla
- Paper: MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
- Notes:
- Released Sep 2025.
- Integrates 2D visual, 3D geometric, and tactile cues.
- Repurposes the LLM itself as a perception module (encoder-free alignment).
- Predicts future multisensory objectives to facilitate physical world modeling.
Vid, L → Prog (Video, Language → Progress)
- Website: qianzhong-chen.github.io/sarm.github.io
- Paper: SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
- Code: Qianzhong-Chen/SARM
- Notes:
- Released Sep 2025.
- Stage-Aware Reward Modeling framework for long-horizon robot manipulation.
- Jointly predicts the high-level task stage and fine-grained progress within each stage from video frames.
- Uses natural language subtask annotations to derive consistent progress labels.
- Enables Reward-Aligned Behavior Cloning (RA-BC), weighting training samples based on predicted progress.
I, L → A (Image, Language → Actions)
- Website: motovla.github.io
- Paper: Generalist Robot Manipulation beyond Action Labeled Data
- Notes:
- Released Sep 2025.
- Leverages motion data (without explicit action labels) to train generalist policies.
- Introduces a Motion Tokenizer to learn discrete motion representations.
- Enables scaling up training data by utilizing large-scale video datasets.
G, P → A (Goal/Objectives, Proprioception → Actions)
- Paper: Behavior Foundation Model for Humanoid Robots
- Notes:
- Released Sep 2025.
- Generative model pretrained on large-scale behavioral datasets for humanoid robots.
- Models the distribution of full-body behavioral trajectories conditioned on goals and proprioception.
- Integrates masked online distillation with CVAE.
- Enables flexible operation across diverse control modes (velocity, motion tracking, teleop) and generalizes robustly.
I, L → A (Image, Language → Actions)
- Website: pku-epic.github.io/NavFoM-Web
- Paper: Embodied Navigation Foundation Model
- Notes:
- Released Sep 2025.
- Cross-embodiment and cross-task navigation foundation model.
- Trained on 8 million navigation samples (quadrupeds, drones, wheeled robots, vehicles).
- Unified architecture handling diverse camera setups and temporal horizons.
I, L → A (Image, Language → Actions)
- Paper: FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
- Notes:
- Released Sep 2025.
- Proposes Vision-Language-Flow (VLF) models to make generalist policies more efficient.
- Achieves 3x faster inference speed compared to diffusion-based VLAs.
- Demonstrates strong performance on CALVIN and real-world tasks.
I, L → A (Image, Language → Actions)
- Website: maniflow-policy.github.io
- Paper: ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training
- Notes:
- Released Sep 2025.
- Uses consistency-based flow matching for efficient action generation.
- Trained on large-scale open-source datasets (Open-X).
- Outperforms OpenVLA and other baselines in simulation and real-world experiments.
I, A → I', A' (Image, Actions → Future Images, Future Actions)
- Website: unigen-x.github.io/unifolm-world-model-action.github.io
- Code: unitreerobotics/unifolm-world-model-action
- Notes:
- Released Sep 2025.
- Unitree's open-source world-model-action architecture for general-purpose robot learning.
- Functions as both a Simulation Engine (generating synthetic data) and Policy Enhancement (predicting future interactions).
- Trained on Unitree's open-source datasets and fine-tuned on Open-X.
I, L → A (Image, Language → Actions)
- Paper: Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
- Code: Liang-ZX/DiscreteDiffusionVLA
- Notes:
- Released Aug 2025.
- Discretizes continuous action spaces and uses discrete diffusion for action decoding.
- Unified transformer framework compatible with standard VLM token interfaces.
- Achieves 96.3% success rate on LIBERO and outperforms continuous diffusion baselines.
I, L → A (Image, Language → Actions)
- Website: long-vla.github.io
- Paper: Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation
- Notes:
- Released Aug 2025.
- Addresses the limitation of current VLAs in long-horizon tasks.
- Incorporates a hierarchical planning mechanism within the VLA framework.
- Significantly improves success rates on multi-stage manipulation tasks.
I, L → A (Image, Language → Actions)
- Website: embodied-r1.github.io
- Paper: Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
- Code: pickxiguapi/Embodied-R1
- Notes:
- Released Aug 2025.
- 3B Vision-Language Model designed for embodied reasoning and "pointing".
- Uses "pointing" as a unified intermediate representation (similar concept to Molmo).
- Trained with Reinforced Fine-tuning (RFT) with multi-task reward design.
- Demonstrates robust zero-shot generalization (e.g., 56.2% success in SIMPLEREnv).
P, G → A (Proprioception, Goal → Actions)
- Website: amazon.science/blog/amazon-builds-first-foundation-model-for-multirobot-coordination
- Paper: DeepFleet: Multi-Agent Foundation Models for Mobile Robots
- Notes:
- Released Aug 2025.
- A suite of foundation models for coordinating large-scale mobile robot fleets.
- Trained on fleet movement data from hundreds of thousands of robots in Amazon warehouses.
- Explores four architectures, with Robot-Centric (RC) and Graph-Floor (GF) showing the most promise for scaling.
- Enables proactive planning to avoid congestion and deadlocks in complex multi-agent environments.
I, L → D, I_plan, A (Image, Language → Depth Tokens, Image-Space Plan, Actions)
- Website: allenai.org/blog/molmoact
- Paper: MolmoAct: Action Reasoning Models that can Reason in Space
- Weights: Hugging Face
- Notes:
- Released Aug 2025.
- A very interesting and large model with a unique reasoning process.
- It first estimates depth tokens, then plans a trajectory in the image space (independent of the robot's body), and finally generates the actions.
- Because the image trace can be modified by a user, the resulting actions are steerable.
I, L → A (Image, Language → Actions)
- Website: ricl-vla.github.io
- Paper: RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models
- Notes:
- Released Aug 2025.
- Enables VLA models to adapt to new tasks via in-context learning (ICL).
- Uses a retrieval-based mechanism to fetch relevant demonstrations.
- Avoids the need for expensive fine-tuning for every new task.
G, P → A (Goal/Objectives, Proprioception → Actions)
- Website: agilityrobotics.com/content/training-a-whole-body-control-foundation-model
- Notes:
- Released Aug 2025.
- A whole-body control foundation model trained purely in simulation (Isaac Sim).
- Uses a small LSTM (<1M params) to handle balance, locomotion, and disturbance recovery.
- Functions as a "motor cortex," taking end-effector objectives and handling the low-level dynamics.
I, L → A (Image, Language → Actions)
- Paper: InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
- Notes:
- Released Jul 2025.
- Two-stage pipeline: pretrains an action expert/latent interface, then instruction-tunes a VLM.
- Uses an MoE-adapted VLM to switch between textual reasoning and latent action generation.
- Focuses on preserving multimodal reasoning while adding precise manipulation capabilities.
I, P, L → A (Image, Proprioception, Language → Actions)
- Paper: GR-3 Technical Report
- Notes:
- Released Jul 2025.
- Trained on three diverse data types: internet-scale vision-language data, human hand tracking data, and robot trajectories.
- The architecture is a VLM + DiT, similar to other leading models.
- Employs compliance control during teleoperation, which is beneficial for contact-rich tasks.
- Showed that it can learn new tasks from only 10 human trajectory demonstrations.
I, P, L → A (Image, Proprioception, Language → Actions)
- Website: toyotaresearchinstitute.github.io/lbm1/
- Paper: A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
- Notes:
- Released Jul 2025.
- Uses a Diffusion Transformer (DiT) with Image and Text Encoders.
- Demonstrated for complex bimanual manipulation tasks.
- Has been implemented on a Boston Dynamics humanoid robot.
I, L → A (Image, Language → Actions)
- Paper: Unified Vision-Language-Action Model
- Notes:
- Released Jun 2025.
- Autoregressively models vision, language, and actions as a single interleaved stream of discrete tokens.
- Incorporates world modeling during post-training to capture causal dynamics.
- Achieves strong results on CALVIN and LIBERO benchmarks.
I, L → A (Image, Language → Actions)
- Website: robomonkey-vla.github.io
- Paper: RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models
- Notes:
- Released Jun 2025.
- Focuses on test-time compute scaling for VLAs.
- Uses a learned verifier (value function) to sample and select the best actions during inference.
- Demonstrates that scaling test-time compute can rival training-time scaling.
I, L → A (Image, Language → Actions)
- Paper: ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models
- Notes:
- Released Jun 2025.
- Adapts pre-trained VLAs to new objects and tasks using few-shot learning.
- Employs object-centric representations via a ControlNet-style adapter to preserve pre-trained knowledge.
- Achieves efficient adaptation with minimal data.
Vid → S, A (Video → Embeddings, Actions)
- Website: ai.meta.com/vjepa/
- Paper: V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
- Code: facebookresearch/vjepa2
- Notes:
- Released Jun 2025.
- A spatially capable vision encoder trained entirely with self-supervision.
- Capable of next-state prediction, functioning as a world model.
- The V-JEPA 2-AC version is post-trained with an "action-conditioned probe" to generate robot actions.
S, M → A (State, Map → Trajectory)
- Website: waymo.com/research/scaling-laws-of-motion-forecasting-and-planning
- Paper: Scaling Laws of Motion Forecasting and Planning -- Technical Report
- Notes:
- Released Jun 2025.
- Demonstrates that motion forecasting and planning models follow scaling laws similar to LLMs.
- Trained on a massive dataset of 500,000 hours of driving data.
- Uses an encoder-decoder autoregressive transformer architecture.
- Shows that increasing compute and data predictably improves both open-loop and closed-loop performance.
I, L → Vid, A (Image, Language → Video, Actions)
- Paper: Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
- Notes:
- Released Jun 2025 (Presented at RSS 2025 as Best Systems Paper finalist).
- Unified platform collapsing robot sensing, policy learning, and evaluation into a single closed-loop video generative world model.
- Trained on ~3,000 hours of video-language paired data (AgiBot-World-Beta).
I, L → A (Image, Language → Actions)
- Paper: Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning
- Notes:
- Released Jun 2025.
- Dual-system VLA embedding a fast execution module (System 1) within a slow reasoning VLM (System 2).
- System 1 shares parameters with System 2 but operates at higher frequency.
- Uses a dual-aware co-training strategy to jointly fine-tune both systems.
- Addresses the trade-off between reasoning capability and execution speed.
I, T → A (Image, Tactile → Actions)
- Website: feel-the-force-ftf.github.io
- Paper: Feel the Force: Contact-Driven Learning from Humans
- Notes:
- Released Jun 2025.
- A robot learning system that models human tactile behavior to learn force-sensitive manipulation.
- Uses a tactile glove to collect human demonstrations with precise contact forces.
- Achieves robust force-aware control by continuously predicting the forces needed for manipulation.
I, L → A (Image, Language → Actions)
- Website: agentic-robot.github.io
- Paper: Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents
- Notes:
- Released May 2025.
- A brain-inspired framework that uses a Large Reasoning Model (LRM) to decompose tasks into subgoals (Standardized Action Procedure).
- Features a VLA executor for low-level control and a temporal verifier for error recovery.
- Achieves state-of-the-art performance on the LIBERO benchmark.
I, L → A (Image, Language → Actions)
- Website: pku-epic.github.io/TrackVLA-web
- Paper: TrackVLA: Embodied Visual Tracking in the Wild
- Notes:
- Released May 2025.
- Integrates visual tracking capabilities into a VLA architecture.
- Enables robots to track and interact with moving targets in dynamic environments.
- Trained on a diverse dataset of tracking scenarios.
I, L, M → A (Image, Language, Memory → Actions)
- Website: 3dllm-mem.github.io
- Paper: 3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model
- Notes:
- Released May 2025.
- Introduces a dynamic memory management system for Embodied 3D Large Language Models.
- Uses working memory tokens to selectively attend to episodic memory, enabling long-term spatial-temporal reasoning.
- Outperforms strong baselines by 16.5% on challenging in-the-wild embodied tasks (3DMem-Bench).
I, Vid → A (Image, Video → Actions)
- Website: kimhanjung.github.io/UniSkill
- Paper: UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations
- Notes:
- Released May 2025.
- Learns skill representations from large-scale human videos.
- Uses Inverse Skill Dynamics (ISD) to extract motion patterns and Forward Skill Dynamics (FSD) for future prediction.
- Transfers these skills to robot embodiments using a cross-embodiment interface.
- Enables learning from observing humans without explicit teleoperation data.
I, L → A (Image, Language → Actions)
- Paper: UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
- Notes:
- Released May 2025.
- Learns task-centric action representations from videos using a latent action model (within DINO feature space).
- Can leverage data from arbitrary embodiments and perspectives without explicit action labels.
- Allows deploying generalist policies to various robots via efficient latent action decoding.
I, L → A (Image, Language → Actions)
- Website: openhelix-robot.github.io
- Paper: OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation
- Notes:
- Released May 2025.
- Open-source Dual-System VLA (Vision-Language-Action) model.
- Provides systematic empirical evaluations on dual-system architectures (System 1 for fast execution, System 2 for reasoning).
- Highlights a "prompt tuning" paradigm: adding a new
<ACT>token and only training thelm-headpreserves generalization.
I, L → A (Image, Language → Actions)
- Paper: RobotxR1: Enabling Embodied Robotic Intelligence on Large Language Models through Closed-Loop Reinforcement Learning
- Notes:
- Released May 2025.
- Extends R1-Zero approach to robotics via closed-loop Reinforcement Learning.
- Enables small-scale LLMs (e.g., Qwen2.5-3B) to achieve effective reasoning and control.
- Demonstrated on autonomous driving tasks.
I, L → A (Image, Language → Actions)
- Website: pku-epic.github.io/GraspVLA-web
- Paper: GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
- Notes:
- Released May 2025.
- A specialist foundation model for grasping.
- Pre-trained on a massive synthetic dataset (billion-scale) of grasping actions.
- Demonstrates zero-shot transfer to real-world objects.
I, P, A → I', S (Image, Proprioception, Actions → Future Images, Uncertainty)
- Website: sites.google.com/view/uncertainty-aware-rwm
- Paper: Uncertainty-Aware Robotic World Model Makes Offline Model-Based Reinforcement Learning Work on Real Robots
- Code: leggedrobotics/robotic_world_model_lite
- Notes:
- Released Apr 2025.
- Extends Robotic World Model (RWM) with ensemble-based epistemic uncertainty estimation.
- Enables fully offline model-based reinforcement learning (MBRL) on real robots by penalizing high-risk imagined transitions (MOPO-PPO).
- Evaluated on real quadruped and humanoid robots for manipulation and locomotion.
I, P, L → A (Image, Proprioception, Language → Actions)
- Website: developer.nvidia.com/isaac/gr00t
- Paper: GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
- Code: NVIDIA/Isaac-GR00T
- Notes:
- Released Mar 2025.
- Combines a Vision-Language Model (VLM) with a Diffusion Transformer (DiT).
- Personal Note: A very nice codebase that is highly compatible with
lerobot. I found the client/server inference utilities quite handy to experiment with.
I, L → A (Image, Language → Actions)
- Website: hybrid-vla.github.io
- Paper: HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
- Code: PKU-HMI-Lab/Hybrid-VLA
- Notes:
- Released Mar 2025.
- Unified framework integrating autoregressive reasoning and diffusion-based action prediction.
- Uses a collaborative action ensemble mechanism to fuse predictions from both paradigms.
- Outperforms previous SOTA VLA methods by 14% in simulation and 19% in real-world tasks.
I, Vid, L → A (Image, Video, Language → Actions)
- Website: microsoft.github.io/Magma
- Paper: Magma: A Foundation Model for Multimodal AI Agents
- Code: microsoft/Magma
- Notes:
- Released Feb 2025.
- Multimodal foundation model for agentic tasks in digital and physical worlds.
- Uses Set-of-Mark (SoM) for action grounding and Trace-of-Mark (ToM) for action planning.
- State-of-the-art on UI navigation and robotic manipulation.
I, L → A (Image, Language → Actions)
- Website: dex-vla.github.io
- Paper: DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
- Notes:
- Released Feb 2025.
- Combines a VLM for high-level reasoning with a diffusion expert for low-level control.
- The diffusion expert is "plug-in", allowing modular upgrades.
- Focused on dexterous manipulation tasks.
Vid, L, A → Vid, L (Video, Language, Control → Video, Language/Reasoning)
- Website: nvidia.com/en-us/ai/cosmos
- Paper: Cosmos World Foundation Model Platform for Physical AI
- Code: nvidia-cosmos
- Notes:
- Released Jan 2025.
- A comprehensive world foundation model platform for Physical AI.
- Includes
cosmos-predict(video generation),cosmos-transfer(control-to-video), andcosmos-reason(reasoning VLM). - Models are open-weight and designed for robotics and autonomous vehicle simulation.
I, P, L → A (Image, Proprioception, Language → Actions)
- Website: physicalintelligence.company/blog/pi05
- Notes:
- Released Early 2025.
- An evolution of π0 focused on open-world generalization.
- Capable of controlling mobile manipulators to perform tasks in entirely unseen environments like kitchens and bedrooms.
I, Vid, L → A (Image, Video, Language → Actions)
- Website: smolvla.net
- Blog: huggingface.co/blog/smolvla
- Notes:
- Released Early 2025.
- A compact (~450M parameter) Vision-Language-Action model designed for efficiency.
- Optimized for running on consumer-grade GPUs and edge devices.
- Trained on the LeRobot community datasets.
- Personal Note: Tried to train this and successfully ran inference on a consumer GPU. Very fast and lightweight.
I, L → Vid (Image, Language → Interactive World Video)
- Website: deepmind.google/blog/genie-3-a-new-frontier-for-world-models/
- Notes:
- Released Early 2025.
- A general-purpose world model capable of generating interactive environments at 24fps.
- Used to train embodied agents (like SIMA) in rich, simulated worlds.
- Maintains environmental consistency over long horizons (minutes) and allows promptable world events.
I, L → A (Image, Language → Actions)
- Website: rdt-robotics.github.io/rdt2/
- Code: thu-ml/RDT2
- Weights: Hugging Face
- Notes:
- Released Early 2025.
- The sequel to RDT-1B, designed for zero-shot cross-embodiment generalization.
- RDT2-VQ: A 7B VLA adapted from Qwen2.5-VL-7B, using Residual VQ for action tokenization.
- RDT2-FM: Uses a Flow-Matching action expert for lower latency control.
- Trained on 10,000+ hours of human manipulation videos across 100+ scenes (UMI data).
I, L, F → A (Image, Language, Force → Actions)
- Paper: Embodied large language models enable robots to complete complex tasks in unpredictable environments
- Code: ruaridhmon/ELLMER
- Notes:
- Released Early 2025.
- Embodied Large-Language-Model-Enabled Robot framework.
- Uses GPT-4 and Retrieval-Augmented Generation (RAG) to extract relevant code examples from a knowledge base.
- Generates action plans that incorporate real-time force and visual feedback to adapt to unpredictable environments.
- Enables robots to complete long-horizon tasks like coffee making.
I, G → A (Image, Goal → Actions)
- Website: kylestach.github.io/lifelong-nav-rl
- Paper: Lifelong Autonomous Improvement of Navigation Foundation Models in the Wild
- Code: kylestach/lifelong-nav-rl
- Weights: Hugging Face
- Notes:
- Released Early 2025.
- The first navigation foundation model capable of autonomous fine-tuning in the wild.
- Combines offline RL pretraining with online RL for continuous improvement.
- Robust to new environments and embodiments.
I, A → I' (Image, Actions → Predicted Image)
- Paper: Latent Action Robot Foundation World Models for Cross-Embodiment Adaptation
- Notes:
- Released Early 2025.
- Learns a unified latent action space to handle diverse robot embodiments.
- Achieves significant performance improvements (up to 46.7%) over models with explicit motion labels.
- Enables efficient cross-embodiment learning and generalization.
I, P → A (Image, Proprioception → Actions)
- Paper: Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation
- Code: PKU-HMI-Lab/LIFT3D
- Notes:
- Released Early 2025.
- Lifts 2D foundation models to construct robust 3D manipulation policies.
- Uses a task-aware masked autoencoder to enhance implicit 3D representations.
- Establishes positional mapping between 3D points and 2D model embeddings.
I, L, D → A (Image, Language, Depth → Actions)
- Paper: 3DS-VLA: A 3D Spatial-Aware Vision Language Action Model for Robust Multi-Task Manipulation
- Notes:
- Released Early 2025.
- Enhances 2D VLAs with explicit 3D spatial awareness.
- Uses a 2D-to-3D positional alignment mechanism to encode spatial observations.
- Outperforms state-of-the-art 2D and 3D policies on RLBench and real-world tasks.
I, L → A (Image, Language → Actions)
- Website: dyna.co
- Notes:
- Released Early 2025.
- Production-ready foundation model built for autonomy at scale.
- Achieved >99% success rate in 24-hour non-stop operation.
- Deployed in commercial settings like hotels and gyms.
I, P, L → A (Image, Proprioception, Language → Actions)
- Website: physicalintelligence.company/blog/pi0
- Paper: π0: A vision-language-action flow model for general robot control
- Code: Physical-Intelligence/openpi
- Weights: Hugging Face
- Notes:
- Released Oct 2024.
- Showcased in incredible bimanual and mobile robot demonstrations.
- Architecture consists of a pretrained Vision-Language Model (VLM) combined with an action expert.
- The pretrained VLM used is Paligemma.
- Personal Note: I attempted to train this locally on my hardware but couldn't get the flow-matching loss to converge properly. However, it showcased incredible dexterous zero-shot capabilities out-of-the-box.
I, L → A (Image, Language → Actions)
- Website: openvla.github.io
- Paper: OpenVLA: An Open-Source Vision-Language-Action Model
- Code: openvla/openvla
- Weights: Hugging Face
- Notes:
- Released Jun 2024.
- Considered a fundamental work in open-source Vision-Language-Action models.
- Built with a Llama transformer backbone. Uses SigLIP + DINO for its vision component.
- Personal Note: I fine-tuned this 7B model locally. It was very VRAM-hungry but provided an excellent baseline!
I, P, G → A (Image, Proprioception, Goal Image → Actions)
- Website: deepmind.google/discover/blog/robocat-a-self-improving-robotic-agent/
- Paper: RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
- Notes:
- Released Jun 2023.
- A multi-task, multi-embodiment generalist agent based on a decision transformer architecture (Gato).
- Demonstrates a self-improvement loop: a trained model is fine-tuned for a new task, generates more data for that task, and this new data is used to train the next, more capable version of the generalist agent.
- Can adapt to new tasks, objects, and even entirely new robot embodiments (e.g., KUKA arm) with only 100-1000 demonstration examples.
- Website: kinder-site
- Paper: KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
- Code: Princeton-Robot-Planning-and-Learning/kindergarden
- Notes:
- Released Apr 2026.
- A physical reasoning benchmark for robot learning and planning.
- Comprises 25 procedurally generated environments testing spatial relations, nonprehensile multi-object manipulation, tool use, combinatorial geometric constraints, and dynamic constraints.
- Includes a Gymnasium-compatible Python library with parameterized skills and demonstrations.
- Provides a standardized evaluation suite with 13 baselines spanning TAMP, imitation learning, RL, and foundation-model-based approaches.
- Paper: Rethinking Video Generation Model for the Embodied World
- Notes:
- Released Jan 2026.
- Introduces RBench, a comprehensive robotics benchmark for video generation.
- Presents RoVid-X, a large-scale high-quality robotic dataset for training video generation models.
- Evaluation results on 25 video models show high agreement with human assessments.
- Website: utiasDSL.github.io/crisp_controllers
- Paper: CRISP -- Compliant ROS2 Controllers for Learning-Based Manipulation Policies and Teleoperation
- Notes:
- Released Sep 2025.
- A lightweight C++ implementation of compliant Cartesian and joint-space controllers for the ROS2 control standard.
- Designed for seamless integration with high-level learning-based policies as well as teleoperation.
- Website: constrained-robot-fms.github.io
- Paper: Constrained Decoding for Robotics Foundation Models
- Notes:
- Released Sep 2025.
- A constrained decoding framework for autoregressive robot foundation models.
- Enforces task-specific safety rules (Signal Temporal Logic) at inference time without retraining.
- Compatible with state-of-the-art policies like SPOC and PoliFormer.
- Paper: Risk-Guided Diffusion: Toward Deploying Robot Foundation Models in Space, Where Failure Is Not An Option
- Notes:
- Released Jun 2025.
- Proposes a risk-guided diffusion framework fusing a fast "System-1" with a slow, physics-based "System-2".
- Addresses safety for deploying foundation models in space exploration.
- Reduces failure rates by up to 4x while matching goal-reaching performance.
- Paper: Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning
- Code: pairlab/Adapt3R
- Notes:
- Released Mar 2025.
- Focuses on RGB-D based, viewpoint-invariant learning for imitation across domain gaps.
- Provides a well-presented analysis of the limitations of current methods.
- Paper: Towards Safe Robot Foundation Models
- Notes:
- Released Mar 2025.
- Introduces a safety layer to constrain the action space of any generalist policy.
- Uses ATACOM, a safe reinforcement learning algorithm, to create a safe action space and ensure safe state transitions.
- Facilitates deployment in safety-critical scenarios without requiring specific safety fine-tuning.
- Demonstrated effectiveness in avoiding collisions in dynamic environments (e.g., air hockey).
- Link: Substack Post by Chris Paxton
- Notes:
- A general overview of VLAs in the real world, with an excellent section on common failures.
- Full of great insights and references.
- Link: YouTube Video by Dieter Fox
- Notes:
- This talk exists in many video forms; it's best to find the most recent version.
- Focuses on the current state of robotics models and what is needed to achieve LLM-level general intelligence in robots.