Rui Zhao1,*, Kaiming Yang1,*, Jifeng Zhu1,β , Siyang Chen1,β , Ziqi Wang1, Weijia Wu1, Kevin Qinghong Lin2, Heng Wang3, Mike Zheng Shou1,β‘
1Show Lab, National University of Singapore Β Β 2University of Oxford Β Β 3Tencent
*Equal contribution Β Β β Equal contribution (second authors) Β Β β‘Corresponding author
Important
π§ Code, benchmark data, and evaluation tools will be open-sourced here. Stay tuned! Please β star and watch this repository to be notified when the release lands.
Can a video generation model's dream of manipulation actually be executed by a robot?
Dream.exe answers this by taking generated videos out of the screen and into a physics simulator. Instead of judging a video only by how good it looks, we convert the motion it depicts into a robot trajectory, execute it, and measure whether the task actually succeeds. Execution success then becomes a grounding signal that purely visual metrics cannot offer.
What's inside:
- π¬ Video-to-execution pipeline. From a single scene image and task prompt, we generate a manipulation video, lift it into a 3D robot trajectory, and roll it out in simulation.
- π§ͺ 101-task benchmark. Manually curated from RoboCasa and stratified into three levels of physical complexity, scored on visual quality, trajectory fidelity, and execution success.
- π€ 8 models evaluated. Frontier closed-source, open-source, and robot-specific video generators under one unified protocol.
Key findings:
- β Generative priors from internet-scale data already encode meaningful physical knowledge. Several models achieve measurable execution success with no robot-specific supervision.
β οΈ Visual quality is a poor predictor of executability. Physical-plausibility scores barely correlate with task success.- π§ Long-horizon tasks remain hard. Multi-stage manipulation exposes the limits of current models.
Overview of the Dream.exe task suite. Left: representative scenes and task prompts from each difficulty level. Top right: distribution of 101 tasks across the three levels. Bottom right: camera viewpoints are deliberately diversified across scenes to improve generalization coverage.
The tasks are stratified into three levels of increasing physical complexity:
- Level 1, Single-object manipulation. Geometrically consistent end-effector motion with correct grasp/release timing.
- Level 2, Multi-object interaction. Reasoning about object-to-object relationships and placement.
- Level 3, Multi-stage composite tasks. Maintaining physical coherence across a long task horizon with correctly sequenced sub-goals.
If you find our work useful, please consider citing:
@article{zhao2026dreamexe,
title = {Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?},
author = {Zhao, Rui and Yang, Kaiming and Zhu, Jifeng and Chen, Siyang and Wang, Ziqi and Wu, Weijia and Lin, Kevin Qinghong and Wang, Heng and Shou, Mike Zheng},
year = {2026}
}