Skip to content

aim-uofa/Active-o3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ACTIVE-o3 : Empowering MLLMs with Active Perception via Pure Reinforcement Learning

1Zhejiang University, Β  2Ant Group

πŸ“„ PaperΒ  |  🌐 Project PageΒ  | Β πŸ’Ύ Model Weights

πŸš€ Overview

SegAgent Framework

πŸ“– Description

we propose ACTIVE-O3, a purely reinforcement learning-based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasksβ€”such as small-object and dense object groundingβ€”and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. Experimental results demonstrate that ACTIVE-O3 significantly enhances active perception capabilities compared to Qwen-VL2.5-CoT. For example, Figure 1 shows an example of zero-shot reasoning on the V* benchmark, where ACTIVE- O3 successfully identifies the number on the traffic light by zooming in on the relevant region, while Qwen2.5-VL fails to do so. Moreover, across all downstream tasks, ACTIVE-O3 consistently improves performance under fixed computational budgets. We hope that our work here can provide a simple codebase and evaluation protocol to facilitate future research on active perception MLLM.

🚩 Plan

  • Release the weights.
  • Release the inference demo.
  • Release the dataset.
  • Release the training scripts.
  • Release the evaluation scripts.

πŸ› οΈ Getting Started

πŸ“ Set up Environment

# build environment
conda create -n activeo3 python=3.10
conda activate activeo3

# install packages
pip install torch==2.5.1 torchvision==0.20.1
pip install flash-attn --no-build-isolation
pip install transformers==4.51.3
pip install qwen-omni-utils[decord]

πŸ” demo

# run demo
python demo/activeo3_demo_vstar.py

🎫 License

For academic usage, this project is licensed under the 2-clause BSD License. For commercial inquiries, please contact Chunhua Shen.

πŸ–ŠοΈ Citation

If you find this work helpful for your research, please cite:

@article{zhu2025active,
  title={Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO},
  author={Zhu, Muzhi and Zhong, Hao and Zhao, Canyu and Du, Zongze and Huang, Zheng and Liu, Mingyu and Chen, Hao and Zou, Cheng and Chen, Jingdong and Yang, Ming and others},
  journal={arXiv preprint arXiv:2505.21457},
  year={2025}
}

About

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published