TrackVLA: Embodied Visual Tracking in the Wild

Shaoan Wang Jiazhao Zhang Minghan Li Jiahang Liu Anqi Li
Kui Wu Fangwei Zhong Junzhi Yu Zhizheng Zhang He Wang
Peking University Galbot
Beihang University Beijing Normal University Beijing Academy of Artificial Intelligence

🏡 About

TrackVLA is a vision-language-action model capable of simultaneous object recognition and visual tracking, trained on a dataset of 1.7 million samples. It demonstrates robust tracking, long-horizon tracking, and cross-domain generalization across diverse challenging environments.

📢 News

[25/07/02]: The EVT-Bench is now available.

💡 Installation

Preparing conda env

First, you need to install conda. Once conda installed, create a new env:
```
conda create -n evt_bench python=3.9 cmake=3.14.0
conda activate evt_bench
```

Conda install habitat-sim

You need to install habitat-sim v0.3.1

conda install habitat-sim==0.3.1 withbullet -c conda-forge -c aihabitat

Clone the repo

git clone https://github.com/wsakobe/TrackVLA.git
cd TrackVLA

Install habitat-lab
```
pip install -e habitat-lab
```
Prepare datasets

Download Habitat Matterport 3D (HM3D) dataset from here and Matterport3D (MP3D) from here.

Then move the dataset to data/scene_datasets. The structure of the dataset is outlined as follows:
```
data/
 └── scene_datasets/
    ├── hm3d/
    │ ├── train/
    │ │   └── ...
    │ ├── val/
    │ │   └── ...
    │ └── minival
    │     └── ...
    └── mp3d/
      ├── 1LXtFkjw3qL
      │   └── ...
      └── ...
```
Next, run the following code to obtain data for the humanoid avatars:
```
python download_humanoid_data.py
```
If the above download file fails to download, please download the humanoids.zip from this link and then manually unzip it to the data/ directory.

🧪 Evaluation

Run the script with:

bash eval.sh

Results will be saved in the specified SAVE_PATH, which will include a log directory and a video directory. To monitor the results during the evaluation process, run:

watch -n 1 python analyze_results.py --path YOUR_RESULTS_PATH

To stop the evaluation, use:

bash kill_eval.sh

📝 TODO List

Release the arXiv paper in May, 2025.
Release the EVT-Bench (Embodied Visual Tracking Benchmark).
Release the checkpoint and code of TrackVLA.

✉️ Contact

For any questions, please feel free to email wangshaoan@stu.pku.edu.cn. We will respond to it as soon as possible.

🔗 Citation

If you find our work helpful, please consider citing it as follows:

@article{wang2025trackvla,
  title={Trackvla: Embodied visual tracking in the wild},
  author={Wang, Shaoan and Zhang, Jiazhao and Li, Minghan and Liu, Jiahang and Li, Anqi and Wu, Kui and Zhong, Fangwei and Yu, Junzhi and Zhang, Zhizheng and Wang, He},
  journal={arXiv preprint arXiv:2505.23189},
  year={2025}
}

📄 License

This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
data		data
evt_bench		evt_bench
habitat-lab		habitat-lab
scripts		scripts
track_episode_step		track_episode_step
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
analyze_results.py		analyze_results.py
baseline_agent.py		baseline_agent.py
download_humanoid_data.py		download_humanoid_data.py
eval.sh		eval.sh
humanoid_infos.json		humanoid_infos.json
kill_eval.sh		kill_eval.sh
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrackVLA: Embodied Visual Tracking in the Wild

🏡 About

📢 News

💡 Installation

🧪 Evaluation

📝 TODO List

✉️ Contact

🔗 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TrackVLA: Embodied Visual Tracking in the Wild

🏡 About

📢 News

💡 Installation

🧪 Evaluation

📝 TODO List

✉️ Contact

🔗 Citation

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages