Skip to content

[CoRL 2025] Repository relating to "TrackVLA: Embodied Visual Tracking in the Wild"

Notifications You must be signed in to change notification settings

AresCheah/TrackVLA

 
 

Repository files navigation

TrackVLA: Embodied Visual Tracking in the Wild

Shaoan WangJiazhao Zhang  Minghan Li  Jiahang Liu  Anqi Li 
Kui Wu  Fangwei ZhongJunzhi YuZhizheng ZhangHe Wang
Peking University  Galbot 
Beihang University  Beijing Normal University  Beijing Academy of Artificial Intelligence 

Project arXiv Video

🏡 About

TrackVLA is a vision-language-action model capable of simultaneous object recognition and visual tracking, trained on a dataset of 1.7 million samples. It demonstrates robust tracking, long-horizon tracking, and cross-domain generalization across diverse challenging environments.

Dialogue_Teaser

📢 News

  • [25/07/02]: The EVT-Bench is now available.

💡 Installation

  1. Preparing conda env

    First, you need to install conda. Once conda installed, create a new env:

    conda create -n evt_bench python=3.9 cmake=3.14.0
    conda activate evt_bench
  2. Conda install habitat-sim

    You need to install habitat-sim v0.3.1

    conda install habitat-sim==0.3.1 withbullet -c conda-forge -c aihabitat
    
  3. Clone the repo

    git clone https://github.com/wsakobe/TrackVLA.git
    cd TrackVLA
    
  4. Install habitat-lab

    pip install -e habitat-lab
    
  5. Prepare datasets

    Download Habitat Matterport 3D (HM3D) dataset from here and Matterport3D (MP3D) from here.

    Then move the dataset to data/scene_datasets. The structure of the dataset is outlined as follows:

    data/
     └── scene_datasets/
        ├── hm3d/
        │ ├── train/
        │ │   └── ...
        │ ├── val/
        │ │   └── ...
        │ └── minival
        │     └── ...
        └── mp3d/
          ├── 1LXtFkjw3qL
          │   └── ...
          └── ...
    

    Next, run the following code to obtain data for the humanoid avatars:

    python download_humanoid_data.py
    

    If the above download file fails to download, please download the humanoids.zip from this link and then manually unzip it to the data/ directory.

🧪 Evaluation

Run the script with:

bash eval.sh

Results will be saved in the specified SAVE_PATH, which will include a log directory and a video directory. To monitor the results during the evaluation process, run:

watch -n 1 python analyze_results.py --path YOUR_RESULTS_PATH

To stop the evaluation, use:

bash kill_eval.sh

📝 TODO List

  • Release the arXiv paper in May, 2025.
  • Release the EVT-Bench (Embodied Visual Tracking Benchmark).
  • Release the checkpoint and code of TrackVLA.

✉️ Contact

For any questions, please feel free to email [email protected]. We will respond to it as soon as possible.

🔗 Citation

If you find our work helpful, please consider citing it as follows:

@article{wang2025trackvla,
  title={Trackvla: Embodied visual tracking in the wild},
  author={Wang, Shaoan and Zhang, Jiazhao and Li, Minghan and Liu, Jiahang and Li, Anqi and Wu, Kui and Zhong, Fangwei and Yu, Junzhi and Zhang, Zhizheng and Wang, He},
  journal={arXiv preprint arXiv:2505.23189},
  year={2025}
}

📄 License

Creative Commons License
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

About

[CoRL 2025] Repository relating to "TrackVLA: Embodied Visual Tracking in the Wild"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%