EGM: Efficient Visual Grounding Language Models

This repository releases the official implementation of EGM: Efficient Visual Grounding Language Models.

Abstract

Visual grounding is an essential capability of Visual Language Models (VLMs) to understand the real physical world. Previous state-of-the-art grounding visual language models usually have large model sizes, making them heavy for deployment and slow for inference. However, we notice that the sizes of visual encoders are nearly the same for small and large VLMs and the major difference is the sizes of the language models. Small VLMs fall behind larger VLMs in grounding because of the difference in language understanding capability rather than visual information handling. To mitigate the gap, we introduce 'Efficient visual Grounding language Models' (EGM): generate many mid-quality tokens (from small models) to match the performance of large VLMs with few high-quality but expensive tokens. This method is deployment-friendly, and yields better end-to-end latency: On the RefCOCO benchmark, our EGM-Qwen3-VL-8B demonstrates 91.4 IoU with an average of 737ms (5.9x faster) latency while Qwen3-VL-235B demands 4,320ms to reach 90.5 IoU. To validate our approach's generality, we further set up a new amodal grounding setting that requires the model to predict both the visible and occluded parts of the objects. Experiments show our method consistently improves both vanilla and amodal grounding capabilities of small models to match or outperform larger models, thereby improving efficiency for visual grounding.

See our website for more details : https://nvlabs.github.io/EGM

EGM: Efficient Visual Grounding Language Models

Motivation of the EGM method

Previous state-of-the-art grounding VLMs usually have large model sizes, making them heavy for deployment and slow for inference. In contrast, our EGM scales test-time inference instead; by outputting more tokens with a smaller model to bridge the understanding gap, we achieve on-par performance with significantly better efficiency.

Performance of the EGM method

Our EGM models (2B/4B/8B) have greatly improved the efficiency of visual grounding. For example, EGM-Qwen3-VL-8B outperforms both the state-of-the-art Qwen3-VL-235B-Instruct and Qwen3-VL-235B-Thinking models for accuracy, while speeding up 5.9× and 18.9× in terms of GPU latency.

The table below presents detailed results for EGM-Qwen3-VL-4B and EGM-Qwen3-VL-8B, comparing them against the baseline models as well as the Qwen3-VL-235B Instruct and Thinking variants:

We provide RL model checkpoints for both EGM-Qwen3-VL-4B-v1 and EGM-Qwen3-VL-8B-v1. You may download these checkpoints and evaluate them following the instructions in the Installation and the Evaluation.

Installation

Run the following commands to create the environment:

git clone https://github.com/NVlabs/EGM.git
cd EGM
conda create -n EGM python=3.11.13
conda activate EGM
pip install -r requirement.txt

Note: If you need to install cuda toolkit on your machine, run the following commands to create the environment:

git clone https://github.com/NVlabs/EGM.git
cd EGM
conda create -n EGM -y -c nvidia/label/cuda-12.9.0 -c nvidia -c conda-forge python=3.11.13 cuda-toolkit=12.9
conda activate EGM
pip install -r requirement.txt

Evaluation

You can directly download our model and test datasets for evaluation with the commands below:

# Models
hf download nvidia/EGM-8B --local-dir ./models/EGM-8B
hf download nvidia/EGM-4B --local-dir ./models/EGM-4B

# Datasets
hf download Antoinegg1/EGM_Datasets --local-dir ./data/EGM_Datasets --repo-type dataset
cat ./data/EGM_Datasets/coco.tar.part_* > ./data/EGM_Datasets/coco.tar
tar -xvf ./data/EGM_Datasets/coco.tar -C ./data/
tar -xvf ./data/EGM_Datasets/coco_flip.tar -C ./data/

To evaluate the model, install sglang with pip install sglang==0.5.5 and use the command provided below.

Note: The RefCOCO benchmark consists of eight distinct JSON files. Consequently, you must run the evaluation script sequentially for each of the 8 files to obtain the complete benchmark results.

The following example demonstrates evaluation using refcoco+_testA.jsonl:

# In the EGM folder
export BASE_DIR=$(pwd) 
export MODEL_PATH="${BASE_DIR}/models/EGM-8B"
export DATA_JSON="${BASE_DIR}/data/EGM_Datasets/metadata/eval/refcoco+_testA.jsonl"
export OUTPUT_DIR="${BASE_DIR}/result/"
export BASE_IMG_DIR="${BASE_DIR}"

cd verl
bash scripts/sglang_infer.sh

We also support evaluation with vLLM:

# In the EGM folder
export BASE_DIR=$(pwd) 
export MODEL_PATH="${BASE_DIR}/models/EGM-8B"
export DATA_JSON="${BASE_DIR}/data/EGM_Datasets/metadata/eval/refcoco+_testA.jsonl"
export OUTPUT_DIR="${BASE_DIR}/result/"
export BASE_IMG_DIR="${BASE_DIR}"

cd verl
bash scripts/vllm_infer.sh

Dataset and Models

You can download the base model and our datasets from HuggingFace using the commands below:

pip install -U huggingface_hub
hf download Qwen/Qwen3-VL-8B-Thinking --local-dir ./models/Qwen3-VL-8B-Thinking

# If you have already downloaded and processed this data in the previous Evaluation section, you can skip the commands below.
hf download Antoinegg1/EGM_Datasets --local-dir ./data/EGM_Datasets --repo-type dataset
cat ./data/EGM_Datasets/coco.tar.part_* > ./data/EGM_Datasets/coco.tar
tar -xvf ./data/EGM_Datasets/coco.tar -C ./data/
tar -xvf ./data/EGM_Datasets/coco_flip.tar -C ./data/

If you wish to directly use our pre-trained EGM-Qwen3-VL-8B-SFT model for development, use the following command to download it. Afterward, please refer to the RL Training section for training instructions.

hf download nvidia/EGM-8B-SFT --local-dir ./models/EGM-8B-SFT

Note: If you fail to connect HuggingFace, please try to use the HF Mirror:

export HF_ENDPOINT=https://hf-mirror.com

SFT Training

To train the EGM-Qwen3-VL-8B-SFT model, execute the following commands within the /EGM directory:

export BASE_DIR=$(pwd)
export REFCOCO_ANNOTATION_PATH="${BASE_DIR}/data/EGM_Datasets/vanilla_grounding_reasoning_training_dataset_cot_subset.jsonl"
export REFCOCO_DATA_PATH="${BASE_DIR}/data/"
export OUTPUT_DIR="${BASE_DIR}/models/EGM-8B-SFT"
export WANDB_BASE_URL=${YOUR_WANDB_BASE_URL}   
export WANDB_API_KEY=${YOUR_WANDB_API_KEY} 
export WANDB_PROJECT="EGM-SFT"

cd sft/qwen-vl-finetune
bash scripts/sft_qwen3_8b_grounding.sh

The finetuned model is saved in ${BASE_DIR}/models/EGM-8B-SFT. In this repository, we provide a 1k subset vanilla_grounding_reasoning_training_dataset_cot_subset.jsonl solely as an example for SFT. For further details on full SFT data construction, please refer tosft/README.md.

RL Training

Reinforcement Learning is conducted based on the SFT checkpoint. The default configuration utilizes 8 GPUs. You may customize the distributed training settings via the trainer.nnodes and trainer.n_gpus_per_node arguments.

We provide the necessary training and validation sets in Parquet format for EGM-Qwen3-VL-8B-v1. Please use the following code to replace the relative image paths within the Parquet files:

cd ../../
# Now at the `EGM` folder 

export BASE_DIR=$(pwd)

# prepare the train data
python ./verl/scripts/replace_img_dir.py \
  --parquet_path ./data/EGM_Datasets/processed_rl_data/train_grounding.parquet  \
  --base_img_root ${BASE_DIR}/data/

# prepare the val data
python ./verl/scripts/replace_img_dir.py \
  --parquet_path ./data/EGM_Datasets/processed_rl_data/val_grounding.parquet  \
  --base_img_root ${BASE_DIR}/data/

Alternatively, you may construct the datasets manually using verl/examples/data_preprocess/grounding_all.sh andverl/examples/data_preprocess/grounding_val.sh.

To initiate training, execute the script below from within the /EGM directory:

# Configure Weights & Biases (W&B) for experiment tracking
export WANDB_BASE_URL=${YOUR_WANDB_BASE_URL}   
export WANDB_API_KEY=${YOUR_WANDB_API_KEY} 


export BASE_DIR=$(pwd)
export MODEL_PATH="${BASE_DIR}/models/EGM-8B-SFT"
export OUTPUT_DIR="${BASE_DIR}/checkpoint/"
export DATA_DIR="${BASE_DIR}/data/EGM_Datasets/processed_rl_data/"

cd verl
bash scripts/grounding_qwen.sh

ray stop -f

Citation

@article{zhan2026EGM,
    author = {Zhan, Guanqi and Li, Changye and Liu, Zhijian and Lu, Yao and Wu, Yi and Han, Song and Zhu, Ligeng},
    title = {EGM: Efficient Visual Grounding Language Models},
    booktitle = {arXiv},
    year = {2026}
}

Acknowledgment

This repository benefits from Qwen3VL, InternVL, verl and verl-internvl.

Thanks for their wonderful works and their efforts to further promote LLM research.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
sft		sft
verl-internvl		verl-internvl
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EGM: Efficient Visual Grounding Language Models

Abstract

Table of Contents

EGM: Efficient Visual Grounding Language Models

Motivation of the EGM method

Performance of the EGM method

Installation

Evaluation

Dataset and Models

SFT Training

RL Training

Citation

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EGM: Efficient Visual Grounding Language Models

Abstract

Table of Contents

EGM: Efficient Visual Grounding Language Models

Motivation of the EGM method

Performance of the EGM method

Installation

Evaluation

Dataset and Models

SFT Training

RL Training

Citation

Acknowledgment

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages