Shaoan Wang
Jiazhao Zhang
Minghan Li
Jiahang Liu
Anqi Li
Kui Wu
Fangwei Zhong
Junzhi Yu
Zhizheng Zhang
He Wang
Peking University
Galbot
Beihang University
Beijing Normal University
Beijing Academy of Artificial Intelligence
TrackVLA is a vision-language-action model capable of simultaneous object recognition and visual tracking, trained on a dataset of 1.7 million samples. It demonstrates robust tracking, long-horizon tracking, and cross-domain generalization across diverse challenging environments.
- [25/10/16]: We are thrilled to announce that Uni-NaVid (a VLA model with discrete actions) has been officially open-sourced and now fully supports testing on the EVT-Bench!
- [25/07/02]: The EVT-Bench is now available.
-
Preparing conda env
First, you need to install conda. Once conda installed, create a new env:
conda create -n evt_bench python=3.9 cmake=3.14.0 conda activate evt_bench
-
Conda install habitat-sim
You need to install habitat-sim v0.3.1
conda install habitat-sim==0.3.1 withbullet -c conda-forge -c aihabitat -
Clone the repo
git clone https://github.com/wsakobe/TrackVLA.git cd TrackVLA -
Install habitat-lab
pip install -e habitat-lab -
Prepare datasets
Download Habitat Matterport 3D (HM3D) dataset from here and Matterport3D (MP3D) from here.
Then move the dataset to
data/scene_datasets. The structure of the dataset is outlined as follows:data/ └── scene_datasets/ ├── hm3d/ │ ├── train/ │ │ └── ... │ ├── val/ │ │ └── ... │ └── minival │ └── ... └── mp3d/ ├── 1LXtFkjw3qL │ └── ... └── ...Next, run the following code to obtain data for the humanoid avatars:
python download_humanoid_data.pyIf the above download file fails to download, please download the humanoids.zip from this link and then manually unzip it to the data/ directory.
Run the script with:
bash eval_baseline.sh
Results will be saved in the specified SAVE_PATH, which will include a log directory and a video directory. To monitor the results during the evaluation process, run:
watch -n 1 python analyze_results.py --path YOUR_RESULTS_PATH
To stop the evaluation, use:
bash kill_eval.sh
First, you need to clone the origin repo of Uni-NaVid:
git clone https://github.com/jzhzhang/Uni-NaVid
Then, you need to link the main code folder of Uni-NaVid to EVT-Bench:
ln -s /path/to/Uni-NaVid/uninavid /path/to/TrackVLA/uninavid
Modify eval_uninavid.sh to set the weight path of Uni-NaVid, and run the script with:
bash eval_uninavid.sh
Results will be saved in the specified SAVE_PATH, which will include a log directory and a video directory. To monitor the results during the evaluation process, run:
watch -n 1 python analyze_results.py --path YOUR_RESULTS_PATH
To stop the evaluation, use:
bash kill_eval.sh
Results of Uni-NaVid on EVT-Bench:
| Evaliation Benchmark | SR | TR | CR |
|---|---|---|---|
| Uni-NaVid EVT-Bench STT | 53.3 | 67.2 | 12.6 |
| Uni-NaVid EVT-Bench DT | 31.9 | 50.1 | 21.3 |
| Uni-NaVid EVT-Bench AT | 15.8 | 41.5 | 26.5 |
- Release the arXiv paper in May, 2025.
- Release the EVT-Bench (Embodied Visual Tracking Benchmark).
- Release the checkpoint and inference code of Uni-NaVid.
For any questions, please feel free to email [email protected]. We will respond to it as soon as possible.
If you find our work helpful, please consider citing it as follows:
@article{wang2025trackvla,
title={Trackvla: Embodied visual tracking in the wild},
author={Wang, Shaoan and Zhang, Jiazhao and Li, Minghan and Liu, Jiahang and Li, Anqi and Wu, Kui and Zhong, Fangwei and Yu, Junzhi and Zhang, Zhizheng and Wang, He},
journal={arXiv preprint arXiv:2505.23189},
year={2025}
}
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
