CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
CombatVLA surpasses GPT-4o and Qwen2.5-VL in combat understanding, runs 50× faster than Cradle and VARP frameworks, and achieves a higher task success rate than human players.
- [2025/11/17] Released the action execution framework.
- [2025/06/26] CombatVLA is accepted to ICCV 2025!
Recent advances in Vision-Language-Action (VLA) models have significantly expanded the capabilities of embodied AI. However, real-time decision-making in complex 3D environments remains extremely challenging — requiring high-resolution perception, tactical reasoning, and sub-second reaction times.
To address these challenges, we introduce CombatVLA, an efficient 3B Vision-Language-Action model tailored for combat tasks in 3D action role-playing games (ARPGs). CombatVLA is trained on large-scale video–action pairs collected using an action tracker, with a compact Action-of-Thought (AoT) training paradigm.
CombatVLA integrates seamlessly into an optimized action execution framework and supports efficient inference through our truncated AoT strategy. Experiments show that CombatVLA:
- Outperforms all existing models in combat understanding
- Achieves 50× acceleration in efficient combat
- Surpasses human players in task success rate
git clone https://github.com/ChenVoid/CombatVLA.git
cd CombatVLAOS: Windows 10/11 (capable of running Black Myth: Wukong)
conda create -n framework python=3.9
conda activate framework
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
# Download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_lgDownload the videosubfinder from https://sourceforge.net/projects/videosubfinder/ and extract the files into the res/tool/subfinder folder. We have already created the folder for you and included a test.srt, which is a required dummy file that will not affect results.
The file structure should be like this:
├── res
├── tool
├── subfinder
├── VideoSubFinderWXW.exe
├── test.srt
├── ...
Then please use res/tool/general.clg to overwrite res/tool/subfinder/settings/general.cfg file.
Deploy CombatVLA or your fine-tuned VLM on a cloud server (e.g., with vLLM) and expose an OpenAI-compatible API.
Edit call_api.py to drive CombatVLA or your fine-tuned VLM:
API_URL="https://<your-server-ip>:8000/v1"
API_KEY="your_api_key"python runner.py
This launches the efficient game control framework powered by CombatVLA.
@InProceedings{Chen_2025_ICCV,
author = {Chen, Peng and Bu, Pi and Wang, Yingyao and Wang, Xinyi and Wang, Ziming and Guo, Jie and Zhao, Yingxiu and Zhu, Qi and Song, Jun and Yang, Siran and Wang, Jiamang and Zheng, Bo},
title = {CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {10919-10928}
}We would like to thank the contributors to Cradle for their valuable open research contributions.

