Skip to content

showlab/Dream.exe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 

Repository files navigation

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Rui Zhao1,*, Kaiming Yang1,*, Jifeng Zhu1,†, Siyang Chen1,†, Ziqi Wang1, Weijia Wu1, Kevin Qinghong Lin2, Heng Wang3, Mike Zheng Shou1,‑

1Show Lab, National University of Singapore Β Β  2University of Oxford Β Β  3Tencent

*Equal contribution Β Β  †Equal contribution (second authors) Β Β  ‑Corresponding author


Important

🚧 Code, benchmark data, and evaluation tools will be open-sourced here. Stay tuned! Please ⭐ star and watch this repository to be notified when the release lands.


πŸ“– Overview

Can a video generation model's dream of manipulation actually be executed by a robot?

Dream.exe answers this by taking generated videos out of the screen and into a physics simulator. Instead of judging a video only by how good it looks, we convert the motion it depicts into a robot trajectory, execute it, and measure whether the task actually succeeds. Execution success then becomes a grounding signal that purely visual metrics cannot offer.

What's inside:

  • 🎬 Video-to-execution pipeline. From a single scene image and task prompt, we generate a manipulation video, lift it into a 3D robot trajectory, and roll it out in simulation.
  • πŸ§ͺ 101-task benchmark. Manually curated from RoboCasa and stratified into three levels of physical complexity, scored on visual quality, trajectory fidelity, and execution success.
  • πŸ€– 8 models evaluated. Frontier closed-source, open-source, and robot-specific video generators under one unified protocol.

Key findings:

  • βœ… Generative priors from internet-scale data already encode meaningful physical knowledge. Several models achieve measurable execution success with no robot-specific supervision.
  • ⚠️ Visual quality is a poor predictor of executability. Physical-plausibility scores barely correlate with task success.
  • πŸ§— Long-horizon tasks remain hard. Multi-stage manipulation exposes the limits of current models.

πŸ§ͺ Benchmark Task Suite

Overview of the Dream.exe task suite

Overview of the Dream.exe task suite. Left: representative scenes and task prompts from each difficulty level. Top right: distribution of 101 tasks across the three levels. Bottom right: camera viewpoints are deliberately diversified across scenes to improve generalization coverage.

The tasks are stratified into three levels of increasing physical complexity:

  • Level 1, Single-object manipulation. Geometrically consistent end-effector motion with correct grasp/release timing.
  • Level 2, Multi-object interaction. Reasoning about object-to-object relationships and placement.
  • Level 3, Multi-stage composite tasks. Maintaining physical coherence across a long task horizon with correctly sequenced sub-goals.

πŸ“Œ Citation

If you find our work useful, please consider citing:

@article{zhao2026dreamexe,
  title   = {Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?},
  author  = {Zhao, Rui and Yang, Kaiming and Zhu, Jifeng and Chen, Siyang and Wang, Ziqi and Wu, Weijia and Lin, Kevin Qinghong and Wang, Heng and Shou, Mike Zheng},
  year    = {2026}
}

About

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors