Skip to content

APEX-BJUT/Taga-VLM

Repository files navigation

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

arXiv Project Page HuggingFace License ICRA 2026

Official implementation of the ICRA 2026 paper "TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation".

⚠️ IMPORTANT: Code cleaning and preprocessed data release are in progress. Full release coming soon!

For details, please visit our project page.

TagaVLM Framework

Results on R2R (Val Unseen)

Method Backbone NE ↓ OSR ↑ SR ↑ SPL ↑
NavCoT LLaMA2-7B 6.26 48.11 40.23 36.64
MapGPT GPT-4V 5.62 57.9 47.7 38.1
TagaVLM-0.5B (Ours) Qwen2-0.5B 5.57 55.09 45.72 41.91
TagaVLM-7B (Ours) Qwen2-7B 4.97 60.2 51.09 47.18

Installation

git clone https://github.com/APEX-BJUT/Taga-VLM.git
cd Taga-VLM

conda create -n tagavlm python=3.9 -y
conda activate tagavlm
pip install --upgrade pip
pip install -e ".[train]"

Install the patched transformers (required for STAR-Att):

cd transformers-4.40.0 && pip install -e . && cd ..

Additional pinned dependencies: accelerate==0.28.0, numpy<=2.0.

Flash-Attention 2: Download the prebuilt .whl for your CUDA/Python version from Flash-Attention Releases (select the abiFALSE variant), then:

pip install flash_attn-*.whl

Matterport3D Simulator: Follow Matterport3DSimulator.

Data Preparation

Download model weights and data from HuggingFace and place them as:

Taga-VLM
├── data
│   ├── mp3d_data
│   ├── view_images_bgr_from_mattersim.h5
│   ├── view_images_hm3d
│   ├── view_images_hm3d_pano
│   └── anno
├── model_zoo
│   ├── TagaVLM-qwen2-7b
│   └── TagaVLM-qwen2-0.5b

Training & Evaluation

# Training
bash scripts/train/finetune_TagaVLM.sh

# Evaluation on R2R
cd map_nav_src && bash run_r2r.sh

Note: Make sure the dtype is torch.float16 in line 325,327 of Taga-VLM/llava/model/llava_arch.py before evaluation, and for 0.5b model , add "vocab_size": 151936 and "tie_word_embeddings": true in config.json after training

Citation

@inproceedings{liu2026tagavlm,
  title     = {TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation},
  author    = {Liu, Jiaxing and Zhang, Zexi and Li, Xiaoyan and Wang, Boyue and Hu, Yongli and Yin, Baocai},
  booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026}
}

Acknowledgement

This project builds upon LLaVA-NeXT and VLN-DUET. We thank the authors for open-sourcing their code.

About

[ICRA 2026] Official implementation of the paper: "TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation"

Resources

License

Stars

Watchers

Forks