Skip to content

A Systematic Review on Multimodal Language Models and Spatial Intelligence for Human-Robot Collaboration

License

Notifications You must be signed in to change notification settings

WuDuidi/MLLM-HRC-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

🚀 Empowering Natural Human-Robot Collaboration through Multimodal Language Models and Spatial Intelligence

📌 Title: Empowering Natural Human-Robot Collaboration through Multimodal Language Models and Spatial Intelligence: Pathways and Perspectives
🧠 Authors: Duidi Wu, Pai Zheng, Qianyou Zhao, Shuo Zhang, Jin Qi*, Jie Hu*, Guo-Niu Zhu, Lihui Wang
🏫 Affiliations: SJTU, PolyU, FDU, KTH
📄 PDF
📝 Journal: Robotics and Computer-Integrated Manufacturing RCIM
📮 Contact: [email protected]


🌟 Overview

This is the first systematic review that integrates:

  • 🤝 Human-Robot Collaboration (HRC)
  • 🧠 Embodied Intelligence
  • 🌐 Multimodal Large Language Models (MLLMs)
  • 🗺️ Spatial Intelligence

We explore how MLLMs + Embodiment can empower robots to see, think, and act like humans in open, dynamic environments — enabling seamless and proactive HRC.

HRC + MLLM 架构图

🔍 Motivation

Why now?
Industry 5.0 calls for human-centric smart manufacturing. With the rise of MLLMs (like GPT-4V, Gemini, LLaVA), we have a unique opportunity to:

  • Bridge the gap between human intent and robot execution.
  • Enable spatially-aware, low-cost, multi-skill learning.
  • Move beyond “cooperation” to collaboration and coevolution.

❓ Research Questions (RQs)

  1. RQ1: How can MLLMs and embodiment improve seamless HRC?
  2. RQ2: How can spatial skills be trained efficiently?
  3. RQ3: What are the remaining challenges and future trends?

🌈 Highlights

  • 📚 Over 200+ recent works reviewed
  • 🧩 Unified perspective for HRC × Embodied AI × Spatial Intelligence
  • 🚀 Open challenges and design pathways for future human-centered systems

🧭 Content Roadmap

算法轴

🔁 1. Perception–Cognition–Actuation Loop

  • Visual + language + motor signals for complete situational awareness.
  • From human intention recognition → task reasoning → physical execution.

📊 1.1 Robots Affordance and Value Learning

Category Method VFM LLM/VLM Benchmark/Data Tasks Links
Skill Affordance CoPa Owl-ViT, SAM GPT-4V VoxPoser Everyday manipulation tasks 📄 Paper 💻 Code
CLIPort Transporter CLIP Ravens Language-conditioned tasks 📄 Paper 💻 Code
SayCan - 540B PaLM Everyday Robots Long-horizon tasks 📄 Paper 💻 Code
Voltron ViT DistilBERT Franka Kitchen 5 robotics applications 📄 Paper 💻 Code
Keypoint Affordance MOKA GroundedSAM GPT-4V Octo, VoxPoser Table-top manipulation, unseen objects 📄 Paper 💻 Code
ReKep DINOv2, SAM GPT-4o VoxPoser In-the-wild bimanual manipulation 📄 Paper 💻 Code
KALIE CogVLM,GPT-4V MOKA, VoxPoser Diverse unseen objects 📄 Paper 💻 Code
Spatial Affordance VoxPoser OWL-ViT, SAM GPT-4 RLBench Manipulation tasks 📄 Paper 💻 Code
RAM DINOv2 / CLIP Text-embedding-3,GPT-4V DROID 3D contact planning 📄 Paper 💻 Code
RoboPoint CLIP, ViT-L/14 Vicuna-13B WHERE2PLACE Language-conditioned 3D actions 📄 Paper 💻 Code
Human Affordance HRP DINO, CLIP - Ego4D Human-hand-object interaction 📄 Paper 💻 Code
HULC++ - GPT-3, MiniLM-L3-v2 CALVIN Long-horizon manipulation 📄 Paper 💻 Code

📊 1.2 High-level Step-by-step Task Planning and Executable Code Generation

Category Method VFM LLM/VLM Benchmark Robot Tasks Links
Subtask Planning PaLM-E - PaLM Language-Table Everyday Robot Visually-grounded dialogue 📄 Paper 💻 Code
Pg-vlm OWL-ViT GPT-4, PG-InstructBLIP PHYSOBJECTS Franka Panda Table-top manipulation 📄 Paper 💻 Code
ViLA OWL-ViT Llama2-70B, GPT-4V Ravens Franka Panda Long-horizon planning 📄 Paper 💻 Code
SayCan ViLD 540B PaLM Everyday Robots Everyday Robot Long-horizon tasks 📄 Paper 💻 Code
GD OWL-ViT InstructGPT, PaLM Ravens, CLIPort Everyday Robot Rearrangement, mobile manipulation 📄 Paper 💻 Code
Text2Motion - Text-davinci-003 TableEnv - Long-horizon manipulation 📄 Paper 💻 Code
Code Generation Instruct2Act SAM Text-davinci-003 VIMABench - Manipulation & reasoning 📄 Paper 💻 Code
Inner Monologue MDETR InstructGPT Ravens, CLIPort UR5e, ERobot Mobile rearrangement 📄 Paper 💻 Code
CaP ViLD, MDETR GPT-3Codex HumanEval UR5e Table-top & mobile manipulation 📄 Paper 💻 Code
ProgPrompt ViLD GPT-3 Virtual Home Panda Household table-top tasks 📄 Paper 💻 Code

📊 1.3 Robots Learning from Demonstration

Model Structure Problem Benchmark Input Output Links
SeeDo SAM2 + GPT-4o Code Generation CaP Human demo videos Executable code 📄 Paper 💻 Code
OKAMI GPT-4V + SLAHMR Humanoid manipulation ORION Human video Manipulation policy 📄 Paper 💻 Code
R3M ResNet50 + DistilBERT Visual Representation Ego4D Image, proprioception Action vector 📄 Paper 💻 Code
R+X DINO + Gemini Skill retrieval R3M RGB-D observation 6-DoF action 📄 Paper 💻 Code
RT-Trajectory PaLM-E Trajectory generalization RT-1 Drawings, videos Trajectory tokens 📄 Paper 💻 Code
Gen2Act Gemini + VideoPoet Behavior cloning Vid2robot Instruction, observation Trajectory 📄 Paper 💻 Code
EgoMimic ACT-based End-to-end imitation ACT Hand pose, proprioception SE(3) pose prediction 📄 Paper 💻 Code

📊 1.4 Robots Learning from Demonstration

Method Policy Type Input State Action Output Core Structure Links
PlayLMP GCBC Observation, proprioception 8-DoF action Seq2Seq CVAE 📄 Paper 💻 Code
MCIL GCBC Observation + instruction 8-DoF action TransferLangLfP 📄 Paper 💻 Code
BC-Z End-to-end BC Image + task embedding 7-DoF action ResNet18 + FiLM + FC 📄 Paper 💻 Code
Language Table LCBC Language instruction 2D point LAVA 📄 Paper 💻 Code
CALVIN LH-MTLC Multi-modal input Cartesian or joint Seq2Seq CVAE 📄 Paper 💻 Code
HULC LCBC Static image + language 7-DoF action Seq2Seq CVAE 📄 Paper 💻 Code
HULC++ LCBC Static image + language 7-DoF action HULC + VAPO 📄 Paper 💻 Code

📊 1.5 Language-Enhanced Reinforcement Learning

Method Policy Type Challenge MLLM Role Environment Links
Di Palo BC Sparse-reward FLAN-T5, CLIP Subgoal generation MuJoCo 📄 Paper 💻 Code
L2R MJPC Reward optimization GPT-4 Reward function design MuJoCo 📄 Paper 💻 Code
VLM-RM DQN, SAC Zero-shot rewards CLIP Reward computation - 📄 Paper 💻 Code
Song et al. PPO Self-refinement GPT-4 Reward designer Isaac Sim 📄 Paper 💻 Code
Eureka PPO Human-level reward GPT-4 Zero-shot reward Isaac Gym 📄 Paper 💻 Code
LIV BC Goal-conditioned reward CLIP Multimodal value learning MetaWorld 📄 Paper 💻 Code

📊 1.6 Language-Guided Diffusion Policies

Model Structure Problem Input Output Robot Links
Diffusion Policy DDPM Action generation Observation, proprioception Action sequence UR5, Panda 📄 Paper 💻 Code
3DDA CLIP + 3D Diffuser 3D conditional planning Instruction + 3D scene Trajectory Franka 📄 Paper 💻 Code
PoCo Diffusion Policy Heterogeneous policy RGB, pointcloud, language Trajectory Franka 📄 Paper 💻 Code
MDT CLIP + Voltron + Perceiver Core diffusion policy Observation + goal Action chunk Franka 📄 Paper 💻 Code
Octo T5-base Action chunk diffusion Obs + instruction Action chunk 9 robots 📄 Paper 💻 Code
RDT-1B SigLIP + T5-XXL Scaled policy learning Visuo-lingo-motor data Denoised action chunk ALOHA robot 📄 Paper 💻 Code
𝜋0 PaliGemma VLM Diffusion policy Visuo-lingo-motor data Consecutive action chunk 7robots 📄 Paper 💻 Code

🚀 2. Advancing Visuo-lingo-motor Fusion

fusion

  • Visuo/Motor:
    "vision → action" : $\pi(a| o)$ — robots output actions based on visual observation, as in Diffusion policy.
    "state → action" : $\pi(a| s)$ — for example, Decision Transformer uses states to predict actions at the next time step.

  • Visuo-Motor:
    "vision + state → action" — observations and proprioceptions (e.g., joint positions) are integrated to output actions, such as ACT.

  • Visuo-Lingo:
    "vision + language → action" : $\pi(a| o,l)$ — also known as language-conditioned visuomotor policy, this paradigm predicts actions from observations and instructions.
    Most VLAs follow this structure, such as openvla, RT-2.

  • Visuo-Lingo-Motor:
    "vision + state + language → action" : $\pi(a| s,o,l)$ — robots holistically integrate visual, linguistic, and physical inputs, such as in 𝜋0 , Octo, gensim2.

🧠 3. Pathway for General Intelligence

  • This pathway toward human-like intelligence spans from imitation and reinforcement learning, to out-of-the-box or instruction-tuned vision-language action models (VLAs), and further to diffusion policy and world model, paving the way toward general embodied intelligence.

Pathway 流程图

Category Method VFM / LLM / VLM Benchmark/Data Input / Tasks / Output Links
Robotic Transformers (RT) RT-1 FiLM EfficientNet RT-1 Observation and instructions; Output: 11D actions 📄 Paper 💻 Code
RT-2 ViT, PaLI-E, PaLI-X RT-1 Observation and instructions; Output: 7D action tokens 📄 Paper 💻 Code
MOO Owl-ViT, FiLM EfficientNet RT-1 Images and language instructions; Output: 7D action tokens 📄 Paper 💻 Code
Q-Transformer FiLM EfficientNet Manual dataset Observation and instructions; Output: Q-value of action 📄 Paper 💻 Code
RT-H ViT, PaLI-X Kitchen dataset Image and task tokens, action query; Output: Action token 📄 Paper 💻 Code
Vision-Language-Action (VLA) Bi-VLA Qwen-VL - Observation and user request; Output: Executable code 📄 Paper 💻 Code
OpenVLA SigLIP, DinoV2, Prismatic-7B OXE, BridgeData V2 Observation and instructions; Output: 7D action tokens 📄 Paper 💻 Code
TinyVLA Pythia MetaWorld Observation and instructions; Output: 6D action 📄 Paper 💻 Code
LLaRA GPT-4, LLaVA-1.5-7B VIMA, inBC, D-inBC Observation, task, and previous actions; Output: Textual actions 📄 Paper 💻 Code
RoboPoint CLIP, Vicuna-v1.5 WHERE2PLACE Observation and instructions; Output: 3D action points 📄 Paper 💻 Code
Roboflamingo LLaMA, GPT-4, OpenFlamingo CALVIN Task and 2 camera views; Output: 7D action tokens 📄 Paper 💻 Code
RoboUniView ViT, UVFormer CALVIN Task and multi-camera views; Output: 7D action tokens 📄 Paper 💻 Code
RoboMamba CLIP, Mamba LLaVA 1.5, RoboVQA Image and language question; Output: 6-DoF EEF poses 📄 Paper 💻 Code
Out-of-box Usage CoPa Owl-ViT, SAM, GPT-4V Real-world data Observation and instructions; Output: 6-DoF end-effector poses 📄 Paper 💻 Code
VoxPoser Owl-ViT, SAM, GPT-4 RLBench Observation and instructions; Output: Sequence of 6-DoF waypoints 📄 Paper 💻 Code
ReKep DINOv2, SAM, GPT-4o VoxPoser Observation and instructions; Output: Sequence of 6-DoF poses 📄 Paper 💻 Code
MA GPT-4V, Qwen-VL RLBench Task goal and multi-views images; Output: 6-DoF EEF poses 📄 Paper 💻 Code
Open6DOR GroundedSAM, GPT-4V Synthetic dataset Observation and instructions; Output: Robot motion trajectory 📄 Paper 💻 Code
World Model 3D-VLA Flan, T5XL, BLIP2 OXE, RH20T Interaction token with 3D scene; Output: Image, pointcloud, action 📄 Paper 💻 Code
GR-1 ViT, CLIP RT-1, HULC, R3M, CALVIN Instructions, video frame, robot state; Output: Images, action trajectories 📄 Paper 💻 Code
GR-2 VQGAN, cVAE GR-1, RT-1, HULC, RoboFlamingo Instructions, video frame, robot state; Output: Images, action trajectories 📄 Paper 💻 Code
RoboDreamer T5-XXL UniPi, AVDC, RLBench Language and multimodal instructions; Output: Video and actions 📄 Paper 💻 Code
EVA CLIP, Vicuna-v1.5, ChatUniVi EVA-Bench Observation and instructions; Output: Videos, text responses 📄 Paper 💻 Code
PIVOT-R CLIP, LLAVA BC-Z, Gato, RT-1, Octo, GR-1 Instructions, observation, robot state; Output: Waypoint image, EEF action 📄 Paper 💻 Code
DINO-WM DINOv2 Dreamerv3, AVDC Current and goal observation; Output: Action sequence 📄 Paper 💻 Code
WHALE ST-transformer OXE, Meta-World Observation and action subsequences; Output: Observation predictions 📄 Paper 💻 Code

🔬 4. From Generalist to Specialist

  • Most existing efforts focus on daily life tasks with limited application scope.
  • A clear research gap remains in diverse domains, especially production and manufacturing.
  • Some initial studies have explored industrial scenarios:
    • PoCo : tasks like Hammer and Wrench.
    • Isaac : actions like Gear Insertion and Screw.
    • Robosuite : tasks like Nut Assembly and Peg-In-Hole .
  • It is important to develop specialist ("small") models for specific domains, alongside generalist large models for everyday tasks.

Pathway 流程图

🛠️ 5. Platforms & Benchmarks

Name Type Focus Area Key Features / Environment Link Key Publication
Dataset Open X-Embodiment (OpenX) General Manipulation
DetailsAggregates 20+ datasets, cross-embodiment/task/environment, >1M trajectories
💻Project 📄 Paper
DROID Real-world Manipulation
DetailsLarge-scale human-collected data (500+ tasks, 26k hours)
💻Project 📄 Paper
BEHAVIOR-1K Household Activities
Details1000 simulated human household activities
💻Project 📄 Paper
Simulator MuJoCo Physics Engine
DetailsPopular physics engine for robotics and RL
💻Website -
PyBullet Physics Engine
DetailsOpen-source physics engine, used for CALVIN, etc.
💻Website -
Isaac Sim / Orbit High-fidelity Robot Simulation
DetailsNVIDIA Omniverse-based, physically realistic
💻Isaac-sim, Orbit -
Habitat Sim Embodied AI Navigation
DetailsFlexible, high-performance 3D simulator
💻Project 📄 Paper
ManiSkill Generalizable Manipulation Skills
DetailsLarge-scale manipulation benchmark based on SAPIEN
💻Project 📄 Paper
Benchmark Meta-World Multi-task / Meta RL Manipulation
Details50 Sawyer arm manipulation tasks, MuJoCo
💻Project 📄 Paper
RLBench Robot Learning Manipulation
Details100+ manipulation tasks, CoppeliaSim (V-REP)
💻Project 📄 Paper
CALVIN Long-Horizon Manipulation
DetailsLong-horizon tasks with language conditioning, Franka arm, PyBullet simulation
💻Project 📄 Paper

🧩 Citation

If you find this work helpful, please consider citing:

@article{WU2026103064,
      title = {Empowering natural human–robot collaboration through multimodal language models and spatial intelligence: Pathways and perspectives},
      journal = {Robotics and Computer-Integrated Manufacturing},
      volume = {97},
      pages = {103064},
      year = {2026},
      issn = {0736-5845},
      doi = {https://doi.org/10.1016/j.rcim.2025.103064},
      url = {https://www.sciencedirect.com/science/article/pii/S0736584525001188},
      author = {Duidi Wu and Pai Zheng and Qianyou Zhao and Shuo Zhang and Jin Qi and Jie Hu and Guo-Niu Zhu and Lihui Wang},
}

About

A Systematic Review on Multimodal Language Models and Spatial Intelligence for Human-Robot Collaboration

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published