Official implementation of the ICRA 2026 paper "TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation".
⚠️ IMPORTANT: Code cleaning and preprocessed data release are in progress. Full release coming soon!
For details, please visit our project page.
| Method | Backbone | NE ↓ | OSR ↑ | SR ↑ | SPL ↑ |
|---|---|---|---|---|---|
| NavCoT | LLaMA2-7B | 6.26 | 48.11 | 40.23 | 36.64 |
| MapGPT | GPT-4V | 5.62 | 57.9 | 47.7 | 38.1 |
| TagaVLM-0.5B (Ours) | Qwen2-0.5B | 5.57 | 55.09 | 45.72 | 41.91 |
| TagaVLM-7B (Ours) | Qwen2-7B | 4.97 | 60.2 | 51.09 | 47.18 |
git clone https://github.com/APEX-BJUT/Taga-VLM.git
cd Taga-VLM
conda create -n tagavlm python=3.9 -y
conda activate tagavlm
pip install --upgrade pip
pip install -e ".[train]"Install the patched transformers (required for STAR-Att):
cd transformers-4.40.0 && pip install -e . && cd ..Additional pinned dependencies: accelerate==0.28.0, numpy<=2.0.
Flash-Attention 2: Download the prebuilt .whl for your CUDA/Python version from Flash-Attention Releases (select the abiFALSE variant), then:
pip install flash_attn-*.whlMatterport3D Simulator: Follow Matterport3DSimulator.
Download model weights and data from HuggingFace and place them as:
Taga-VLM
├── data
│ ├── mp3d_data
│ ├── view_images_bgr_from_mattersim.h5
│ ├── view_images_hm3d
│ ├── view_images_hm3d_pano
│ └── anno
├── model_zoo
│ ├── TagaVLM-qwen2-7b
│ └── TagaVLM-qwen2-0.5b
# Training
bash scripts/train/finetune_TagaVLM.sh
# Evaluation on R2R
cd map_nav_src && bash run_r2r.shNote: Make sure the dtype is torch.float16 in line 325,327 of Taga-VLM/llava/model/llava_arch.py before evaluation, and for 0.5b model , add "vocab_size": 151936 and "tie_word_embeddings": true in config.json after training
@inproceedings{liu2026tagavlm,
title = {TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation},
author = {Liu, Jiaxing and Zhang, Zexi and Li, Xiaoyan and Wang, Boyue and Hu, Yongli and Yin, Baocai},
booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
year = {2026}
}This project builds upon LLaVA-NeXT and VLN-DUET. We thank the authors for open-sourcing their code.
