🚀 Empowering Natural Human-Robot Collaboration through Multimodal Language Models and Spatial Intelligence
📌 Title: Empowering Natural Human-Robot Collaboration through Multimodal Language Models and Spatial Intelligence: Pathways and Perspectives
🧠 Authors: Duidi Wu, Pai Zheng, Qianyou Zhao, Shuo Zhang, Jin Qi*, Jie Hu*, Guo-Niu Zhu, Lihui Wang
🏫 Affiliations: SJTU, PolyU, FDU, KTH
📄 PDF
📝 Journal: Robotics and Computer-Integrated Manufacturing RCIM
📮 Contact: [email protected]
This is the first systematic review that integrates:
- 🤝 Human-Robot Collaboration (HRC)
- 🧠 Embodied Intelligence
- 🌐 Multimodal Large Language Models (MLLMs)
- 🗺️ Spatial Intelligence
We explore how MLLMs + Embodiment can empower robots to see, think, and act like humans in open, dynamic environments — enabling seamless and proactive HRC.
Why now?
Industry 5.0 calls for human-centric smart manufacturing. With the rise of MLLMs (like GPT-4V, Gemini, LLaVA), we have a unique opportunity to:
- Bridge the gap between human intent and robot execution.
- Enable spatially-aware, low-cost, multi-skill learning.
- Move beyond “cooperation” to collaboration and coevolution.
- RQ1: How can MLLMs and embodiment improve seamless HRC?
- RQ2: How can spatial skills be trained efficiently?
- RQ3: What are the remaining challenges and future trends?
- 📚 Over 200+ recent works reviewed
- 🧩 Unified perspective for HRC × Embodied AI × Spatial Intelligence
- 🚀 Open challenges and design pathways for future human-centered systems
- Visual + language + motor signals for complete situational awareness.
- From human intention recognition → task reasoning → physical execution.
| Category | Method | VFM | LLM/VLM | Benchmark/Data | Tasks | Links |
|---|---|---|---|---|---|---|
| Skill Affordance | CoPa | Owl-ViT, SAM | GPT-4V | VoxPoser | Everyday manipulation tasks | 📄 Paper 💻 Code |
| CLIPort | Transporter | CLIP | Ravens | Language-conditioned tasks | 📄 Paper 💻 Code | |
| SayCan | - | 540B PaLM | Everyday Robots | Long-horizon tasks | 📄 Paper 💻 Code | |
| Voltron | ViT | DistilBERT | Franka Kitchen | 5 robotics applications | 📄 Paper 💻 Code | |
| Keypoint Affordance | MOKA | GroundedSAM | GPT-4V | Octo, VoxPoser | Table-top manipulation, unseen objects | 📄 Paper 💻 Code |
| ReKep | DINOv2, SAM | GPT-4o | VoxPoser | In-the-wild bimanual manipulation | 📄 Paper 💻 Code | |
| KALIE | CogVLM,GPT-4V | MOKA, VoxPoser | Diverse unseen objects | 📄 Paper 💻 Code | ||
| Spatial Affordance | VoxPoser | OWL-ViT, SAM | GPT-4 | RLBench | Manipulation tasks | 📄 Paper 💻 Code |
| RAM | DINOv2 / CLIP | Text-embedding-3,GPT-4V | DROID | 3D contact planning | 📄 Paper 💻 Code | |
| RoboPoint | CLIP, ViT-L/14 | Vicuna-13B | WHERE2PLACE | Language-conditioned 3D actions | 📄 Paper 💻 Code | |
| Human Affordance | HRP | DINO, CLIP | - | Ego4D | Human-hand-object interaction | 📄 Paper 💻 Code |
| HULC++ | - | GPT-3, MiniLM-L3-v2 | CALVIN | Long-horizon manipulation | 📄 Paper 💻 Code |
| Category | Method | VFM | LLM/VLM | Benchmark | Robot | Tasks | Links |
|---|---|---|---|---|---|---|---|
| Subtask Planning | PaLM-E | - | PaLM | Language-Table | Everyday Robot | Visually-grounded dialogue | 📄 Paper 💻 Code |
| Pg-vlm | OWL-ViT | GPT-4, PG-InstructBLIP | PHYSOBJECTS | Franka Panda | Table-top manipulation | 📄 Paper 💻 Code | |
| ViLA | OWL-ViT | Llama2-70B, GPT-4V | Ravens | Franka Panda | Long-horizon planning | 📄 Paper 💻 Code | |
| SayCan | ViLD | 540B PaLM | Everyday Robots | Everyday Robot | Long-horizon tasks | 📄 Paper 💻 Code | |
| GD | OWL-ViT | InstructGPT, PaLM | Ravens, CLIPort | Everyday Robot | Rearrangement, mobile manipulation | 📄 Paper 💻 Code | |
| Text2Motion | - | Text-davinci-003 | TableEnv | - | Long-horizon manipulation | 📄 Paper 💻 Code | |
| Code Generation | Instruct2Act | SAM | Text-davinci-003 | VIMABench | - | Manipulation & reasoning | 📄 Paper 💻 Code |
| Inner Monologue | MDETR | InstructGPT | Ravens, CLIPort | UR5e, ERobot | Mobile rearrangement | 📄 Paper 💻 Code | |
| CaP | ViLD, MDETR | GPT-3Codex | HumanEval | UR5e | Table-top & mobile manipulation | 📄 Paper 💻 Code | |
| ProgPrompt | ViLD | GPT-3 | Virtual Home | Panda | Household table-top tasks | 📄 Paper 💻 Code |
| Model | Structure | Problem | Benchmark | Input | Output | Links |
|---|---|---|---|---|---|---|
| SeeDo | SAM2 + GPT-4o | Code Generation | CaP | Human demo videos | Executable code | 📄 Paper 💻 Code |
| OKAMI | GPT-4V + SLAHMR | Humanoid manipulation | ORION | Human video | Manipulation policy | 📄 Paper 💻 Code |
| R3M | ResNet50 + DistilBERT | Visual Representation | Ego4D | Image, proprioception | Action vector | 📄 Paper 💻 Code |
| R+X | DINO + Gemini | Skill retrieval | R3M | RGB-D observation | 6-DoF action | 📄 Paper 💻 Code |
| RT-Trajectory | PaLM-E | Trajectory generalization | RT-1 | Drawings, videos | Trajectory tokens | 📄 Paper 💻 Code |
| Gen2Act | Gemini + VideoPoet | Behavior cloning | Vid2robot | Instruction, observation | Trajectory | 📄 Paper 💻 Code |
| EgoMimic | ACT-based | End-to-end imitation | ACT | Hand pose, proprioception | SE(3) pose prediction | 📄 Paper 💻 Code |
| Method | Policy Type | Input State | Action Output | Core Structure | Links |
|---|---|---|---|---|---|
| PlayLMP | GCBC | Observation, proprioception | 8-DoF action | Seq2Seq CVAE | 📄 Paper 💻 Code |
| MCIL | GCBC | Observation + instruction | 8-DoF action | TransferLangLfP | 📄 Paper 💻 Code |
| BC-Z | End-to-end BC | Image + task embedding | 7-DoF action | ResNet18 + FiLM + FC | 📄 Paper 💻 Code |
| Language Table | LCBC | Language instruction | 2D point | LAVA | 📄 Paper 💻 Code |
| CALVIN | LH-MTLC | Multi-modal input | Cartesian or joint | Seq2Seq CVAE | 📄 Paper 💻 Code |
| HULC | LCBC | Static image + language | 7-DoF action | Seq2Seq CVAE | 📄 Paper 💻 Code |
| HULC++ | LCBC | Static image + language | 7-DoF action | HULC + VAPO | 📄 Paper 💻 Code |
| Method | Policy Type | Challenge | MLLM | Role | Environment | Links |
|---|---|---|---|---|---|---|
| Di Palo | BC | Sparse-reward | FLAN-T5, CLIP | Subgoal generation | MuJoCo | 📄 Paper 💻 Code |
| L2R | MJPC | Reward optimization | GPT-4 | Reward function design | MuJoCo | 📄 Paper 💻 Code |
| VLM-RM | DQN, SAC | Zero-shot rewards | CLIP | Reward computation | - | 📄 Paper 💻 Code |
| Song et al. | PPO | Self-refinement | GPT-4 | Reward designer | Isaac Sim | 📄 Paper 💻 Code |
| Eureka | PPO | Human-level reward | GPT-4 | Zero-shot reward | Isaac Gym | 📄 Paper 💻 Code |
| LIV | BC | Goal-conditioned reward | CLIP | Multimodal value learning | MetaWorld | 📄 Paper 💻 Code |
| Model | Structure | Problem | Input | Output | Robot | Links |
|---|---|---|---|---|---|---|
| Diffusion Policy | DDPM | Action generation | Observation, proprioception | Action sequence | UR5, Panda | 📄 Paper 💻 Code |
| 3DDA | CLIP + 3D Diffuser | 3D conditional planning | Instruction + 3D scene | Trajectory | Franka | 📄 Paper 💻 Code |
| PoCo | Diffusion Policy | Heterogeneous policy | RGB, pointcloud, language | Trajectory | Franka | 📄 Paper 💻 Code |
| MDT | CLIP + Voltron + Perceiver | Core diffusion policy | Observation + goal | Action chunk | Franka | 📄 Paper 💻 Code |
| Octo | T5-base | Action chunk diffusion | Obs + instruction | Action chunk | 9 robots | 📄 Paper 💻 Code |
| RDT-1B | SigLIP + T5-XXL | Scaled policy learning | Visuo-lingo-motor data | Denoised action chunk | ALOHA robot | 📄 Paper 💻 Code |
| 𝜋0 | PaliGemma VLM | Diffusion policy | Visuo-lingo-motor data | Consecutive action chunk | 7robots | 📄 Paper 💻 Code |
-
Visuo/Motor:
"vision → action" :$\pi(a| o)$ — robots output actions based on visual observation, as in Diffusion policy.
"state → action" :$\pi(a| s)$ — for example, Decision Transformer uses states to predict actions at the next time step. -
Visuo-Motor:
"vision + state → action" — observations and proprioceptions (e.g., joint positions) are integrated to output actions, such as ACT. -
Visuo-Lingo:
"vision + language → action" :$\pi(a| o,l)$ — also known as language-conditioned visuomotor policy, this paradigm predicts actions from observations and instructions.
Most VLAs follow this structure, such as openvla, RT-2. -
Visuo-Lingo-Motor:
"vision + state + language → action" :$\pi(a| s,o,l)$ — robots holistically integrate visual, linguistic, and physical inputs, such as in 𝜋0 , Octo, gensim2.
- This pathway toward human-like intelligence spans from imitation and reinforcement learning, to out-of-the-box or instruction-tuned vision-language action models (VLAs), and further to diffusion policy and world model, paving the way toward general embodied intelligence.
| Category | Method | VFM / LLM / VLM | Benchmark/Data | Input / Tasks / Output | Links |
|---|---|---|---|---|---|
| Robotic Transformers (RT) | RT-1 | FiLM EfficientNet | RT-1 | Observation and instructions; Output: 11D actions | 📄 Paper 💻 Code |
| RT-2 | ViT, PaLI-E, PaLI-X | RT-1 | Observation and instructions; Output: 7D action tokens | 📄 Paper 💻 Code | |
| MOO | Owl-ViT, FiLM EfficientNet | RT-1 | Images and language instructions; Output: 7D action tokens | 📄 Paper 💻 Code | |
| Q-Transformer | FiLM EfficientNet | Manual dataset | Observation and instructions; Output: Q-value of action | 📄 Paper 💻 Code | |
| RT-H | ViT, PaLI-X | Kitchen dataset | Image and task tokens, action query; Output: Action token | 📄 Paper 💻 Code | |
| Vision-Language-Action (VLA) | Bi-VLA | Qwen-VL | - | Observation and user request; Output: Executable code | 📄 Paper 💻 Code |
| OpenVLA | SigLIP, DinoV2, Prismatic-7B | OXE, BridgeData V2 | Observation and instructions; Output: 7D action tokens | 📄 Paper 💻 Code | |
| TinyVLA | Pythia | MetaWorld | Observation and instructions; Output: 6D action | 📄 Paper 💻 Code | |
| LLaRA | GPT-4, LLaVA-1.5-7B | VIMA, inBC, D-inBC | Observation, task, and previous actions; Output: Textual actions | 📄 Paper 💻 Code | |
| RoboPoint | CLIP, Vicuna-v1.5 | WHERE2PLACE | Observation and instructions; Output: 3D action points | 📄 Paper 💻 Code | |
| Roboflamingo | LLaMA, GPT-4, OpenFlamingo | CALVIN | Task and 2 camera views; Output: 7D action tokens | 📄 Paper 💻 Code | |
| RoboUniView | ViT, UVFormer | CALVIN | Task and multi-camera views; Output: 7D action tokens | 📄 Paper 💻 Code | |
| RoboMamba | CLIP, Mamba | LLaVA 1.5, RoboVQA | Image and language question; Output: 6-DoF EEF poses | 📄 Paper 💻 Code | |
| Out-of-box Usage | CoPa | Owl-ViT, SAM, GPT-4V | Real-world data | Observation and instructions; Output: 6-DoF end-effector poses | 📄 Paper 💻 Code |
| VoxPoser | Owl-ViT, SAM, GPT-4 | RLBench | Observation and instructions; Output: Sequence of 6-DoF waypoints | 📄 Paper 💻 Code | |
| ReKep | DINOv2, SAM, GPT-4o | VoxPoser | Observation and instructions; Output: Sequence of 6-DoF poses | 📄 Paper 💻 Code | |
| MA | GPT-4V, Qwen-VL | RLBench | Task goal and multi-views images; Output: 6-DoF EEF poses | 📄 Paper 💻 Code | |
| Open6DOR | GroundedSAM, GPT-4V | Synthetic dataset | Observation and instructions; Output: Robot motion trajectory | 📄 Paper 💻 Code | |
| World Model | 3D-VLA | Flan, T5XL, BLIP2 | OXE, RH20T | Interaction token with 3D scene; Output: Image, pointcloud, action | 📄 Paper 💻 Code |
| GR-1 | ViT, CLIP | RT-1, HULC, R3M, CALVIN | Instructions, video frame, robot state; Output: Images, action trajectories | 📄 Paper 💻 Code | |
| GR-2 | VQGAN, cVAE | GR-1, RT-1, HULC, RoboFlamingo | Instructions, video frame, robot state; Output: Images, action trajectories | 📄 Paper 💻 Code | |
| RoboDreamer | T5-XXL | UniPi, AVDC, RLBench | Language and multimodal instructions; Output: Video and actions | 📄 Paper 💻 Code | |
| EVA | CLIP, Vicuna-v1.5, ChatUniVi | EVA-Bench | Observation and instructions; Output: Videos, text responses | 📄 Paper 💻 Code | |
| PIVOT-R | CLIP, LLAVA | BC-Z, Gato, RT-1, Octo, GR-1 | Instructions, observation, robot state; Output: Waypoint image, EEF action | 📄 Paper 💻 Code | |
| DINO-WM | DINOv2 | Dreamerv3, AVDC | Current and goal observation; Output: Action sequence | 📄 Paper 💻 Code | |
| WHALE | ST-transformer | OXE, Meta-World | Observation and action subsequences; Output: Observation predictions | 📄 Paper 💻 Code |
- Most existing efforts focus on daily life tasks with limited application scope.
- A clear research gap remains in diverse domains, especially production and manufacturing.
- Some initial studies have explored industrial scenarios:
- It is important to develop specialist ("small") models for specific domains, alongside generalist large models for everyday tasks.
| Name | Type | Focus Area | Key Features / Environment | Link | Key Publication |
|---|---|---|---|---|---|
| Dataset | Open X-Embodiment (OpenX) | General Manipulation | DetailsAggregates 20+ datasets, cross-embodiment/task/environment, >1M trajectories |
💻Project | 📄 Paper |
| DROID | Real-world Manipulation | DetailsLarge-scale human-collected data (500+ tasks, 26k hours) |
💻Project | 📄 Paper | |
| BEHAVIOR-1K | Household Activities | Details1000 simulated human household activities |
💻Project | 📄 Paper | |
| Simulator | MuJoCo | Physics Engine | DetailsPopular physics engine for robotics and RL |
💻Website | - |
| PyBullet | Physics Engine | DetailsOpen-source physics engine, used for CALVIN, etc. |
💻Website | - | |
| Isaac Sim / Orbit | High-fidelity Robot Simulation | DetailsNVIDIA Omniverse-based, physically realistic |
💻Isaac-sim, Orbit | - | |
| Habitat Sim | Embodied AI Navigation | DetailsFlexible, high-performance 3D simulator |
💻Project | 📄 Paper | |
| ManiSkill | Generalizable Manipulation Skills | DetailsLarge-scale manipulation benchmark based on SAPIEN |
💻Project | 📄 Paper | |
| Benchmark | Meta-World | Multi-task / Meta RL Manipulation | Details50 Sawyer arm manipulation tasks, MuJoCo |
💻Project | 📄 Paper |
| RLBench | Robot Learning Manipulation | Details100+ manipulation tasks, CoppeliaSim (V-REP) |
💻Project | 📄 Paper | |
| CALVIN | Long-Horizon Manipulation | DetailsLong-horizon tasks with language conditioning, Franka arm, PyBullet simulation |
💻Project | 📄 Paper |
If you find this work helpful, please consider citing:
@article{WU2026103064,
title = {Empowering natural human–robot collaboration through multimodal language models and spatial intelligence: Pathways and perspectives},
journal = {Robotics and Computer-Integrated Manufacturing},
volume = {97},
pages = {103064},
year = {2026},
issn = {0736-5845},
doi = {https://doi.org/10.1016/j.rcim.2025.103064},
url = {https://www.sciencedirect.com/science/article/pii/S0736584525001188},
author = {Duidi Wu and Pai Zheng and Qianyou Zhao and Shuo Zhang and Jin Qi and Jie Hu and Guo-Niu Zhu and Lihui Wang},
}



