Interactive crowd counting dataset and an MLLM-driven coarse-to-fine counting agent.
Runling Long1, Yunlong Wang1, Jia Wan1*, Xiang Deng1, Xingting Zhu2, Weili Guan1, Antoni B. Chan2, Liqiang Nie1
1 <Harbin Institute of Technology, Shenzhen>
2 <City University of Hong Kong>
* Corresponding author
- Paper:
Paper Link - Hugging Face Dataset:
Dataset - Code Repository:
GitHub
- Updates
- Introduction
- Highlights
- Framework
- Project Structure
- Installation
- Usage
- Citation
- Acknowledgement
- License
- [04/2026] Initial release
This is the official implementation for Embodied Crowd Counting.
Occlusion is one of the fundamental challenges in crowd counting. In the community, various data-driven approaches have been developed to address this issue, yet their effectiveness is limited. This is mainly because most existing crowd counting datasets on which the methods are trained are based on passive cameras, restricting their ability to fully sense the environment. Recently, embodied navigation methods have shown significant potential for precise object detection in interactive scenes. These methods incorporate active camera settings, holding promise for addressing the fundamental issues in crowd counting. However, most existing methods are designed for indoor navigation, showing unknown performance in analyzing complex object distributions in large-scale scenes, such as crowds. In addition, most existing embodied navigation datasets are indoor scenes with limited scale and object quantity, preventing them from being introduced into dense crowd analysis. Based on this, a novel task, Embodied Crowd Counting (ECC), is proposed to count the number of persons in a large-scale scene actively. We then build an interactive simulator, the Embodied Crowd Counting Dataset (ECCD), which enables large-scale scenes and large object quantities. A prior probability distribution approximating a realistic crowd distribution is introduced to generate crowds. Then, a zero-shot navigation method (ZECC) is proposed as a baseline. This method contains an MLLM-driven coarse-to-fine navigation mechanism, enabling active Z-axis exploration, and a normal-line-based crowd distribution analysis method for fine-grained counting. Experimental results show that the proposed method achieves the best trade-off between counting accuracy and navigation cost.
We present the dataset and the method implementation in this page.
- We provide the full pipeline of our method.
- The dataset will be ready for further studies.
Figure 1. The proposed framework. First, ATE is proposed to estimate the global crowd distribution efficiently. Then, NLBN is proposed to generate fine observation points, alleviating crowd overlap. The final result is generated by aggregating all fine detections.
.
├── Config.yml
├── Main.py
├── Configs/
│ ├── CountConfig.yml
│ ├── FBEConfig.yml
│ ├── FBEWithDGConfig.yml
│ ├── OurMethodConfig.yml
│ └── settings.json
├── Agent/
│ └── Prompts.py
├── assests/
├── Count/
│ └── Count.py
├── Dataset/
├── Drone/
│ └── Control.py
├── Explore/
│ ├── DensityGuided.py
│ ├── DroneLift.py
│ ├── Explore.py
│ ├── Frontier.py
│ ├── OurExplore.py
│ ├── path_3D.py
│ └── Target.py
├── Log/
├── Methods/
│ ├── FBE.py
│ ├── FBEWithDG.py
│ ├── OurMethod.py
│ └── count.py
├── Others/
│ ├── DensityGuided/
│ ├── IntuitionMap/
│ └── ValueMap/
├── Perception/
│ ├── GeneralizedLoss.py
│ ├── GPT.py
│ ├── GroundingDINO.py
├── Point_cloud/
│ ├── Map_element.py
│ └── Point_cloud.py
├── Record/
├── Simulator/
│ └── Simulator.py
├── utils/
│ ├── flight.py
│ ├── logger.py
│ ├── saver.py
│ └── video.py
├── Vision_models/
│ ├── GeneralizedLoss/
│ └── GroundingDINO/
├── requirements.txt
├── README.md
├── LICENSE
Note that this project currently supports Windows only.
git clone https://github.com/iLearn-Lab/NeurIPS25-Embodied-Crowd-Counting.git
cd NeurIPS25-Embodied-Crowd-Countingpip install -r requirements.txtcd Vision_models/GroundingDINO
pip install -e .cd Vision_models/GroundingDINO
mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
cd ..Download the pretrained model from the Google Drive link in Vision_models/GeneralizedLoss/README.md and place it in Vision_models/GeneralizedLoss.
pip install -U huggingface_hub
huggingface-cli download iLearn-Lab/NeurIPS25-Embodied-Crowd-Counting `
--repo-type dataset `
--local-dir .\Dataset `
--local-dir-use-symlinks FalseThe Dataset/ directory should look like:
Dataset/
├── CITY/...
├── FACADES/...
├── HARBOUR/...
├── NEIGHBOUR/...
├── PARKING/...
└── STADIUM/...
- Configure the dataset path in
Config.yml. - Configure the selected method parameters and counting parameters in
Configs/. - Run
Main.py. - Check results in
Record.
@article{long2025embodied,
title={Embodied Crowd Counting},
author={Long, Runling and Wang, Yunlong and Wan, Jia and Deng, Xiang and Zhu, Xinting and Guan, Weili and Chan, Antoni B and Nie, Liqiang},
journal={arXiv preprint arXiv:2503.08367},
year={2025}
}- Thanks to our supervisor and collaborators for valuable support.
This project is released under the Apache License 2.0.