Official implementation of the paper:
Our framework is designed to retain object detection capabilities while providing users with essential information to simplify query formulation for their object of interest.
We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities.
* OPT-RSVG
Methods | Venue | Visual Encoder | Language Encoder | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | meanIoU | cmuIoU |
---|---|---|---|---|---|---|---|---|---|---|
NMTree | ICCV'19 | ResNet-101 | BiLSTM | 69.28 | 64.17 | 55.22 | 40.31 | 12.90 | 60.12 | 69.85 |
Ref-NMS | AAAI'21 | ResNet-101 | Bi-GRU | 70.59 | 65.61 | 58.01 | 41.36 | 14.58 | 60.42 | 70.72 |
LBYL-Net | CVPR'21 | DarkNet-53 | BERT | 70.22 | 65.39 | 58.65 | 37.54 | 9.46 | 60.57 | 70.28 |
TransVG | CVPR'21 | ResNet-50 | BERT | 69.96 | 64.17 | 54.68 | 38.01 | 12.75 | 59.80 | 69.31 |
VLTVG | CVPR'22 | ResNet-101 | BERT | 73.50 | 68.31 | 59.93 | 43.45 | 15.31 | 62.84 | 73.80 |
MGVLF | TGRS'23 | ResNet-50 | BERT | 72.19 | 66.86 | 58.02 | 42.51 | 15.30 | 61.51 | 71.80 |
LPVA | TGRS'24 | ResNet-50 | BERT | 78.03 | 73.32 | 62.22 | 49.60 | 25.61 | 66.20 | 76.30 |
MB-ORES (Ours) | - | Swin-T | BERT | 83.81 | 81.54 | 76.40 | 63.82 | 36.01 | 73.18 | 79.29 |
* DIOR-RSVG
Methods | Venue | Visual Encoder | Language Encoder | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | meanIoU | cmuIoU |
---|---|---|---|---|---|---|---|---|---|---|
ReSC | ECCV'20 | DarkNet-53 | BERT | 72.71 | 68.92 | 63.01 | 53.70 | 33.37 | 64.24 | 68.10 |
LBYL-Net | CVPR'21 | DarkNet-53 | BERT | 73.78 | 69.22 | 65.56 | 47.89 | 15.69 | 65.92 | 76.37 |
TransVG | CVPR'21 | ResNet-50 | BERT | 72.41 | 67.38 | 60.05 | 49.10 | 27.84 | 63.56 | 76.27 |
QRNet | CVPR'22 | Swin | BERT | 75.84 | 70.82 | 62.27 | 49.63 | 25.69 | 66.80 | 83.02 |
EarthGPT | TGRS'24 | ViT | Llama-2 | 76.65 | 71.93 | 66.52 | 56.53 | 37.63 | 69.34 | 81.54 |
GeoGround | - | CLIP-ViT | Vicuna 1.5 | 77.73 | - | - | - | - | - | - |
VLTVG | CVPR'22 | ResNet-101 | BERT | 75.79 | 72.22 | 66.33 | 55.17 | 33.11 | 66.32 | 77.85 |
MGVLF | TGRS'23 | ResNet-50 | BERT | 76.78 | 72.68 | 66.74 | 56.42 | 35.07 | 68.04 | 78.41 |
LPVA | TGRS'24 | ResNet-50 | BERT | 82.27 | 77.44 | 72.25 | 60.98 | 39.55 | 72.35 | 85.11 |
MB-ORES (Ours) | - | Swin-T | BERT | 85.65 | 83.89 | 80.87 | 73.00 | 54.39 | 77.73 | 83.06 |
* Ablation Study
Our Multi-branch based design significantly improves the performance of the VG task.
# Heads / # Layers | Multi-Branch | Object Reasoner | #Params. | DIOR-RSVG | OPT-RSVG | ||
---|---|---|---|---|---|---|---|
MeanIoU | CmuIoU | MeanIoU | CmuIoU | ||||
(h, l/k) | (4,1) | (4,3) | 6.38M | 77.18 | 81.67 | 72.15 | 78.27 |
(8,6) | 11.13M | 77.26 | 81.71 | 72.37 | 78.31 | ||
(4,3) | (4,3) | 7.97M | 77.73 | 83.06 | 72.73 | 78.60 | |
(8,6) | 12.70M | 77.72 | 82.42 | 73.18 | 79.29 | ||
× | (4,3) | 5.13M | 73.50 | 77.94 | 66.04 | 72.54 | |
(8,6) | 9.87M | 73.93 | 78.43 | 66.38 | 73.26 |
- Referring Expression Comprehension (REC)
Visualization of referring objects for multiple referring expression queries per image.
- Object Detection (OD)
- Unification of REC and OD
Simultaneous object detection and referring object localization.
If you find this work useful in your research, please cite:
@article{radouane2025mboresmultibranchobjectreasoner,
title={MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing},
author={Karim Radouane and Hanane Azzag and Mustapha lebbah},
year={2025},
eprint={2503.24219},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.24219},
}