MMSI-Bench

Sihan Yang^1*, Runsen Xu^1,2*‡, Yiman Xie^1,3, Sizhe Yang^1,2, Mo Li^1,4, Jingli Lin^1,5, Chenming Zhu^1,6, Xiaochen Chen⁷, Haodong Duan¹, Xiangyu Yue^1,2, Dahua Lin^1,2, Tai Wang^1†, Jiangmiao Pang^1†

¹Shanghai AI Laboratory, ²The Chinese University of Hong Kong, ³Zhejiang University, ⁴Tsinghua University, ⁵Shanghai Jiaotong University, ⁶University of Hong Kong, ⁷Beijing Normal University

^*Equal Contribution ^‡Project Lead ^†Corresponding Author

🌐 Homepage | 🤗 Dataset | 📑 Paper | 📖 arXiv

🔔News

🔥[2025-05-30]: We released our paper, benchmark, and evaluation codes.

Introduction

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI’s o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering insights for advancing multi-image spatial intelligence.

MMSI-Bench systematically categorizes multi-image spatial reasoning tasks into ten basic types and one multi-step reasoning category, covering three fundamental spatial elements: camera (the agent), object (entities in the environment), and region (semantic areas like rooms). The six positional relationship categories include camera-camera, camera-object, camera-region, object-object, object-region, and region-region. In addition, there are two types of attribute reasoning (measurement and appearance), two types of motion reasoning (camera motion and object motion), and a multi-step reasoning category for more complex tasks. Each question requires information from multiple images, aiming to comprehensively evaluate a model’s ability to understand and reason about spatial relationships, attributes, and movements across images.

Example

Load Dataset

from datasets import load_dataset

dataset = load_dataset("RunsenXu/MMSI-Bench")
print(dataset)

Evaluation

Please refer to the evaluation guidelines of VLMEvalKit

# api model
python run.py --model Seed1.5-VL --data MMSI_Bench

# huggingface model
python run.py --model Qwen2.5-VL-7B-Instruct --data MMSI_Bench

🏆 MMSI-Bench Leaderboard

Model	Avg. (%)	Type
🥇 Human Level	97.2	Baseline
🥈 o3	41.0	Proprietary
🥉 GPT-4.5	40.3	Proprietary
Gemini-2.5-Pro--Thinking	37.0	Proprietary
Gemini-2.5-Pro	36.9	Proprietary
Doubao-1.5-pro	33.0	Proprietary
GPT-4.1	30.9	Proprietary
Qwen2.5-VL-72B	30.7	Open-source
NVILA-15B	30.5	Open-source
GPT-4o	30.3	Proprietary
Claude-3.7-Sonnet--Thinking	30.2	Proprietary
Seed1.5-VL	29.7	Proprietary
InternVL2.5-2B	29.0	Open-source
InternVL2.5-8B	28.7	Open-source
DeepSeek-VL2-Small	28.6	Open-source
InternVL3-78B	28.5	Open-source
InternVL2.5-78B	28.5	Open-source
LLaVA-OneVision-72B	28.4	Open-source
NVILA-8B	28.1	Open-source
InternVL2.5-26B	28.0	Open-source
DeepSeek-VL2	27.1	Open-source
InternVL3-1B	27.0	Open-source
InternVL3-9B	26.7	Open-source
Qwen2.5-VL-3B	26.5	Open-source
InternVL2.5-4B	26.3	Open-source
InternVL2.5-1B	26.1	Open-source
Qwen2.5-VL-7B	25.9	Open-source
InternVL3-8B	25.7	Open-source
InternVL3-2B	25.3	Open-source
Llama-3.2-11B-Vision	25.4	Open-source
🃏 Random Guessing	25.0	Baseline
LLaVA-OneVision-7B	24.5	Open-source
DeepSeek-VL2-Tiny	24.0	Open-source
Blind GPT-4o	22.7	Baseline

🔗 Citation

If you find our work and this codebase helpful, please consider starring this repo 🌟 and cite:

@article{yang2025mmsi,
  title={MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence},
  author={Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and Lin, Dahua and Wang, Tai and Pang, Jiangmiao},
  journal={arXiv preprint arXiv:2505.23764},
  year={2025}
}

📄 License

Shield:

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Acknowledgment

MMSI-Bench makes use of data from existing image datasets: ScanNet, nuScenes, Matterport3D, Ego4D, AgiBot-World, DTU, DAVIS-2017 ,and Waymo. We thank these teams for their open-source contributions.

Contact

Sihan Yang: [email protected]
Runsen Xu: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
requirements		requirements
scripts		scripts
vlmeval		vlmeval
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMSI-Bench

🔔News

Introduction

Example

Load Dataset

Evaluation

🏆 MMSI-Bench Leaderboard

🔗 Citation

📄 License

Acknowledgment

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

OpenRobotLab/MMSI-Bench

Folders and files

Latest commit

History

Repository files navigation

MMSI-Bench

🔔News

Introduction

Example

Load Dataset

Evaluation

🏆 MMSI-Bench Leaderboard

🔗 Citation

📄 License

Acknowledgment

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages