Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Rui Zhao^1,*, Kaiming Yang^1,*, Jifeng Zhu^1,†, Siyang Chen^1,†, Ziqi Wang¹, Weijia Wu¹, Kevin Qinghong Lin², Heng Wang³, Mike Zheng Shou^1,‡

¹Show Lab, National University of Singapore ²University of Oxford ³Tencent

_{^*Equal contribution ^†Equal contribution (second authors) ^‡Corresponding author}

Important

🚧 Code, benchmark data, and evaluation tools will be open-sourced here. Stay tuned! Please ⭐ star and watch this repository to be notified when the release lands.

📖 Overview

Can a video generation model's dream of manipulation actually be executed by a robot?

Dream.exe answers this by taking generated videos out of the screen and into a physics simulator. Instead of judging a video only by how good it looks, we convert the motion it depicts into a robot trajectory, execute it, and measure whether the task actually succeeds. Execution success then becomes a grounding signal that purely visual metrics cannot offer.

What's inside:

🎬 Video-to-execution pipeline. From a single scene image and task prompt, we generate a manipulation video, lift it into a 3D robot trajectory, and roll it out in simulation.
🧪 101-task benchmark. Manually curated from RoboCasa and stratified into three levels of physical complexity, scored on visual quality, trajectory fidelity, and execution success.
🤖 8 models evaluated. Frontier closed-source, open-source, and robot-specific video generators under one unified protocol.

Key findings:

✅ Generative priors from internet-scale data already encode meaningful physical knowledge. Several models achieve measurable execution success with no robot-specific supervision.
⚠️ Visual quality is a poor predictor of executability. Physical-plausibility scores barely correlate with task success.
🧗 Long-horizon tasks remain hard. Multi-stage manipulation exposes the limits of current models.

🧪 Benchmark Task Suite

Overview of the Dream.exe task suite. Left: representative scenes and task prompts from each difficulty level. Top right: distribution of 101 tasks across the three levels. Bottom right: camera viewpoints are deliberately diversified across scenes to improve generalization coverage.

The tasks are stratified into three levels of increasing physical complexity:

Level 1, Single-object manipulation. Geometrically consistent end-effector motion with correct grasp/release timing.
Level 2, Multi-object interaction. Reasoning about object-to-object relationships and placement.
Level 3, Multi-stage composite tasks. Maintaining physical coherence across a long task horizon with correctly sequenced sub-goals.

📌 Citation

If you find our work useful, please consider citing:

@article{zhao2026dreamexe,
  title   = {Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?},
  author  = {Zhao, Rui and Yang, Kaiming and Zhu, Jifeng and Chen, Siyang and Wang, Ziqi and Wu, Weijia and Lin, Kevin Qinghong and Wang, Heng and Shou, Mike Zheng},
  year    = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

📖 Overview

🧪 Benchmark Task Suite

📌 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

📖 Overview

🧪 Benchmark Task Suite

📌 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages