Official PyTorch implementation of the IEEE TCSVT 2024 paper "CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation".
Mingzhu Xu1, Tianxiang Xiao1, Yutong Liu1, Haoyu Tang1*, Yupeng Hu1, Liqiang Nie2
1 Shandong University
2 Harbin Institute of Technology (Shen Zhen)
* Corresponding author
- Paper:
IEEE Xplore - Code Repository:
GitHub
- Introduction
- Highlights
- Project Structure
- Installation
- Checkpoints / Models
- Dataset / Benchmark
- Usage
- Citation
- License
This project is the official implementation of the paper "CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation".
CMIRNet aims to address the challenge of cross-modal alignment in the task of Referring Image Segmentation:
- Problem Addressed:How to effectively perform deep interaction and logical reasoning between visual features and natural language descriptions.
- Core Idea:A Cross-Modal Interactive Reasoning Network (CMIRNet) is proposed, which leverages architectures such as Graph Neural Networks (GNNs) to enhance semantic correlations across modalities.
- This Repository Provides:Complete training and testing code supporting both ResNet and Swin Transformer backbones.
- Proposes a Cross-Modal Interactive Reasoning mechanism to enhance alignment between vision and language.
- Supports multiple backbone architectures (ResNet-50/101 and Swin Transformer Base/Large).
- Achieves strong performance on benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, RefCLEF).
.
├── data/ # Stores images and annotation data
├── train_resnet.py # Training script based on ResNet
├── train_swin.py # Training script based on Swin Transformer
├── test_resnet.py # Testing script based on ResNet
├── test_swin.py # Testing script based on Swin Transformer
├── README.md
└── requirements.txt
git clone [https://github.com/iLearn-Lab/CMIRNet.git](https://github.com/iLearn-Lab/CMIRNet.git)
cd CMIRNetpip install -r requirements.txtPlease download the pretrained classification weights to initialize the model:
- Download: Baidu Drive (Password:
td6n)
- Images: Download the
2014 Train imagesfrom COCO and extract them to./data/images/. - Referring Expressions: Download RefCOCO, RefCOCO+, RefCOCOg, and RefCLEF from the Official Site.
# ResNet
python train_resnet.py --model_id cmirnet_refcoco_res --device cuda:0
python train_resnet.py --model_id cmirnet_refcocop_res --device cuda:0 --dataset refcoco+
python train_resnet.py --model_id cmirnet_refcocog_res --device cuda:0 --dataset refcocog --splitBy umd
# Swin
python train_swin.py --model_id cmirnet_refcoco_swin --device cuda:0
python train_swin.py --model_id cmirnet_refcocop_swin --device cuda:0 --dataset refcoco+
python train_swin.py --model_id cmirnet_refcocog_swin --device cuda:0 --dataset refcocog --splitBy umdPlease ensure that the --resume argument points to the correct checkpoint path:
# ResNet
python test_resnet.py --device cuda:0 --resume path/to/weights
# Swin (note to include --window12)
python test_swin.py --device cuda:0 --resume path/to/weights --window12If you use this code or method in your research, please cite our paper:
@ARTICLE{CMIRNet2025TCSVT,
author={Xu, Mingzhu and Xiao, Tianxiang and Liu, Yutong and Tang, Haoyu and Hu, Yupeng and Nie, Liqiang},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation},
year={2025},
volume={35},
number={4},
pages={3234-3249},
doi={10.1109/TCSVT.2024.3508752}}
This project is released under the Apache License 2.0.
