A curated collection of papers, datasets, and resources for Composed Image Retrieval (CIR), serving as the companion repository for our survey:
📄 A Comprehensive Survey on Composed Image Retrieval
Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, Liqiang Nie
ACM Transactions on Information Systems (TOIS), Vol. 44, No. 1, Article 19, 2025
- Systematic analysis of 120+ seminal papers (2017–2025)
- Unified evaluation spanning ~10 benchmark datasets
- Continuously updated — PRs and issues are welcome!
- 1. Attribute-based CIR
- 2. Supervised CIR
- 3. Few-shot CIR
- 4. Zero-shot CIR
- 5. Semi-supervised CIR
- 6. Conversational CIR
- 7. Composed Video Retrieval (COVR)
- 8. Sketch-based CIR
- 9. Others
- 10. Dataset statistics
- [1] [CVPR'23] | FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training. [Paper]
- [1] [ICCV'21] | Learning Attribute-driven Disentangled Representations for Interactive Fashion Retrieval. [Paper]
- [2] [ICCV'21] | Face Image Retrieval with Attribute Manipulation. [Paper]
- [1] [SIGIR'20] | Generative Attribute Manipulation Scheme for Flexible Fashion Search. [Paper]
- [1] [CVPR'18] | Learning Attribute Representations with Localization for Flexible Fashion Search. [Paper]
- [2] [WACV'18] | Efficient Multi-Attribute Similarity Learning Towards Attribute-based Fashion Search. [Paper]
- [1] [CVPR'17] | Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search. [Paper]
- [1] [Arxiv'25] | DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval. [Paper]
- [2] [Arxiv'25] | TMCIR: Token Merge Benefits Composed Image Retrieval. [Paper]
- [3] [Arxiv'25] | FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval. [Paper]
- [4] [Arxiv'24] | VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval. [Paper]
- [5] [Arxiv'23] | Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval. [Paper]
- [6] [Arxiv'23] | Ranking-aware Uncertainty for Text-guided Image Retrieval. [Paper]
- [7] [Arxiv'23] | Learning with Multi-modal Gradient Attention for Explainable Composed Image Retrieval. [Paper]
- [1] [TMM'26] | Joint Attribute Graph Reasoning and Aggregation for Composed Image Retrieval. [Paper]
- [2] [AAAI'26] | INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval. [Paper]
- [3] [ACL'26 Findings] | CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval. [Paper]
- [1] [AAAI'25] | Leveraging Large Vision-Language Model as User Intent-aware Encoder for Composed Image Retrieval. [Paper]
- [2] [AAAI'25] | ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval.[Paper]
- [3] [CVPR'25] | CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval. [Paper]
- [4] [CVPR'25] | ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval. [Paper]
- [5] [CVPR'25] | CoLLM: A Large Language Model for Composed Image Retrieval. [Paper]
- [6] [CVPR'25] | Learning with Noisy Triplet Correspondence for Composed Image Retrieval.
- [7] [ICASSP'25] | MEDIAN: Adaptive Intermediate-grained Aggregation Network for Composed Image Retrieval. [Paper]
- [8] [ICASSP'25] | PAIR: Complementarity-guided Disentanglement for Composed Image Retrieval. [Paper]
- [9] [ICASSP'25] | NCL-CIR: Noise-aware Contrastive Learning for Composed Image Retrieval. [Paper]
- [10] [ICCV'25] | Multi-Schema Proximity Network for Composed Image Retrieval. [Paper]
- [11] [ICCV'25] | MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval. [Paper]
- [12] [ACL'25] | Modeling Uncertainty in Composed Image Retrieval via Probabilistic Embeddings. [Paper]
- [13] [ACM MM'25] | OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval.
- [14] [NeurIPS'25] | Towards Robust Uncertainty Calibration for Composed Image Retrieval. [Paper]
- [15] [TMM'25] | Scale Up Composed Image Retrieval Learning via Modification Text Generation. [Paper]
- [16] [AAAI'25] | VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering. [Paper]
- [17] [ICML'25] | QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval. [Paper]
- [18] [IJCNN'25] | HADFF: Composed Image Retrieval Based on Hybrid-Attention and Dynamic Feature Fusion. [Paper]
- [1] [WACV'24] | Bi-directional Training for Composed Image Retrieval via Text Prompt Learning. [Paper]
- [2] [TOMM'24] | Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval. [Paper]
- [3] [TOMM'24] | SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text Feedback. [Paper]
- [4] [TPAMI'24] | Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval. [Paper]
- [5] [AAAI'24] | Dynamic Weighted Combiner for Mixed-Modal Image Retrieval. [Paper]
- [6] [AAAI'24] | Data Roaming and Quality Assessment for Composed Image Retrieval. [Paper]
- [7] [AAAI'24] | FashionERN Enhance-and-Refine Network for Composed Fashion Image Retrieval. [Paper]
- [8] [AAAI'24] | Decomposing Semantic Shifts for Composed Image Retrieval. [Paper]
- [9] [SIGIR'24] | Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval. [Paper]
- [10] [SIGIR'24] | CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval. [Paper]
- [11] [CVPR'24] | SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining. [Paper]
- [12] [ICLR'24] | Sentence-level Prompts Benefit Composed Image Retrieval. [Paper]
- [13] [ICLR'24] | Composed Image Retrieval with Text Feedback via Multi-Grained Uncertainty Regularization. [Paper]
- [14] [TMLR'24] | Candidate Set Re-ranking for Composed Image Retrieval with Dual Multimodal Encoder. [Paper]
- [15] [TCSVT'24] | Set of Diverse Queries with Uncertainty Regularization for Composed Image Retrieval. [Paper]
- [16] [ICMR'24] | CLIP-ProbCR:CLIP-based Probability embedding Combination Retrieval. [Paper]
- [17] [TMM'24] | Align and Retrieve: Composition and Decomposition Learning in Image Retrieval with Text Feedback. [Paper]
- [18] [KBS'24] | Collaborative Group: Composed Image Retrieval via Consensus Learning From Noisy Annotations. [Paper]
- [19] [TIP'24] | Multimodal Composition Example Mining for Composed Query Image Retrieval. [Paper]
- [20] [TOIS'24] | LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model. [Paper]
- [21] [ACM MM'24] | Semantic Distillation from Neighborhood for Composed Image Retrieval. [Paper]
- [22] [ACM MM'24] | Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives. [Paper]
- [23] [NeurIPS'24] | Easy Regional Contrastive Learning of Expressive Fashion Representations. [Paper]
- [24] [ACL'24] | UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation. [Paper]
- [1] [TOMM'23] | AMC: Adaptive Multi-expert Collaborative Network for Text-guided Image Retrieval. [Paper]
- [2] [TMM'23] | Multi-Modal Transformer With Global-Local Alignment for Composed Query Image Retrieval. [Paper]
- [3] [WACV'23] | Fashion Image Retrieval with Text Feedback by Additive Attention Compositional Learning. [Paper]
- [4] [ICMR'23] | Dual-Path Semantic Construction Network for Composed Query-Based Image Retrieval. [Paper]
- [5] [TCSVT'23] | Multi-Grained Attention Network With Mutual Exclusion for Composed Query-Based Image Retrieval. [Paper]
- [6] [TOMM'23] | Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features. [Paper]
- [7] [ICME'23] | Visual-Linguistic Alignment and Composition for Image Retrieval with Text Feedback. [Paper]
- [8] [TIP'23] | Composed Image Retrieval via Cross Relation Network With Hierarchical Aggregation Transformer. [Paper]
- [9] [MM'23] | Target-Guided Composed Image Retrieval. [Paper]
- [10] [CVPR'23] | FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks. [Paper]
- [11] [ICCVW'23] | ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion. [Paper]
- [12] [NeurIPSW'23] | NEUCORE: Neural Concept Reasoning for Composed Image Retrieval. [Paper]
- [13] [NeurIPSW'23] | Benchmarking Robustness of Text-Image Composed Retrieval. [Paper]
- [13] [MMW'23] | Fashion-GPT: Integrating LLMs with Fashion Retrieval System. [Paper]
- [1] [ICLR'22] | ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity. [Paper][Arxiv]
- [2] [TOMM'22] | Tell, Imagine, and Search: End-to-end Learning for Composing Text and Image to Image Retrieval. [Paper]
- [3] [TIP'22] | Geometry Sensitive Cross-Modal Reasoning for Composed Query Based Image Retrieval. [Paper]
- [4] [TIP'22] | Composed Image Retrieval via Explicit Erasure and Replenishment With Semantic Alignment. [Paper]
- [5] [WACV'22] | SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval. [Paper]
- [6] [SIGIR'22] | Progressive Learning for Image Retrieval with Hybrid-Modality Queries. [Paper]
- [7] [CVPR'22] | Effective Conditioned and Composed Image Retrieval Combining CLIP-based Features. [Paper]
- [8] [CVPR'22] | FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback. [Paper]
- [9] [TMM'22] | Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness Perception. [Paper]
- [10] [TMM'22] | Adversarial and Isotropic Gradient Augmentation for Image Retrieval With Text Feedback. [Paper]
- [11] [EMNLP'22] | FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning. [Paper]
- [12] [ECCV'22] | FashionViL: Fashion-Focused Vision-and-Language Representation Learning. [Paper]
- [1] [ICCV'21] | Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. [Paper][Arxiv]
- [2] [CVPR'21] | CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback. [Paper]
- [3] [WACV'21] | Compositional Learning of Image-Text Query for Image Retrieval. [Paper]
- [4] [MM'21] | Heterogeneous Feature Fusion and Cross-modal Alignment for Composed Image Retrieval. [Paper]
- [5] [MM'21] | Cross-modal Joint Prediction and Alignment for Composed Query Image Retrieval. [Paper]
- [6] [MM'21] | Image Retrieval with Text Feedback by Deep Hierarchical Attention Mutual Information Maximization. [Paper]
- [7] [AAAI'21] | Dual Compositional Learning in Interactive Image Retrieval. [Paper]
- [8] [SIGIR'21] | Comprehensive Linguistic-Visual Composition Network for Image Retrieval. [Paper]
- [1] [CVPR'20] | Image Search With Text Feedback by Visiolinguistic Attention Learning. [Paper]
- [2] [MM'20] | Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval. [Paper]
- [3] [ECCV'20] | Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval. [Paper]
- [4] [CVPR'20] | Composed Query Image Retrieval Using Locally Bounded Features. [Paper]
- [1] [Arxiv'24] | Pseudo Triplet Guided Few-shot Composed Image Retrieval. [Paper]
- [1] [AAAI'23] | Few-Shot Composition Learning for Image Retrieval with Prompt Tuning. [Paper]
- [1] [Arxiv'25] | MLLM-Guided VLMFine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval. [Paper]
- [2] [Arxiv'25] | Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval. [Paper]
- [3] [Arxiv'25] | From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval. [Paper]
- [4] [Arxiv'25] | Data-Efficient Generalization for Zero-shot Composed Image Retrieval. [Paper]
- [5] [Arxiv'25] | SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval. [Paper]
- [6] [Arxiv'24] | MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval. [Paper]
- [7] [Arxiv'24] | Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs. [Paper]
- [8] [Arxiv'24] | Training-free Zero-shot Composed Image Retrieval with Local Concept Re-ranking. [Paper]
- [9] [Arxiv'24] | HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels. [Paper]
- [10] [Arxiv'24] | Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity. [Paper]
- [11] [Arxiv'24] | Composed Image Retrieval for Training-Free Domain Conversion. [Paper]
- [12] [Arxiv'24] | Compositional Image Retrieval via Instruction-Aware Contrastive Learning. [Paper]
- [13] [Arxiv'24] | MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval. [Paper]
- [14] [Arxiv'24] | Denoise-I2W: Mapping Images to Denoising Words for Accurate Zero-Shot Composed Image Retrieval. [Paper]
- [15] [Arxiv'23] | Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed Image Retrieval. [Paper]
- [1] [AAAI'26] | Modality and Task Adaptation for Enhanced Zero-shot Composed Image Retrieval. [Paper]
- [2] [AAAI'26] | Duplex Rewards Optimization for Test-Time Composed Image Retrieval. [Paper]
- [3] [WACV2026] | PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval [Paper]
- [1] [COLING'25] | MLLM-I2W: Harnessing Multimodal Large Language Model for Zero-Shot Composed Image Retrieval. [Paper]
- [2] [CVPR'25] | Generative Zero-Shot Composed Image Retrieval.[Paper]
- [3] [CVPR'25] | Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval.[Paper]
- [4] [CVPR'25] | Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval. [Paper]
- [5] [CVPR'25] | Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy. [Paper]
- [6] [WACV'25] | Composed Image Retrieval for Training-Free Domain Conversion. [Paper]
- [7] [SIGIR'25] | Rethinking Pseudo Word Learning in Zero-Shot Composed Image Retrieval: From an Object-Aware Perspective. [Paper]
- [8] [ICCV'25] | CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval. [Paper]
- [9] [ICCV'25] | Hierarchy-Aware Pseudo Word Learning with Text Adaptation for Zero-Shot Composed Image Retrieval. [Paper]
- [10] [ICCV'25] | An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval. [Paper]
- [11] [ICCV'25] | Zero-Shot Composed Image Retrieval via Dual-Stream Instruction-Aware Distillation. [Paper]
- [12] [KDD'25] |* Generative Thinking, Corrective Action: User-Friendly Composed Image Retrieval via Automatic Multi-Agent Collaboration. [Paper]
- [13] [NeruIPS'25] | Instance-Level Composed Image Retrieval. [Paper]
- [14] [TPAMI'25] | iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval. [Paper]
- [15] [IJCNN'25] | Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data. [Paper]
- [1] [AAAI'24] | Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval. [Paper]
- [2] [ICLR'24] | Vision-by-Language for Training-Free Compositional Image Retrieval. [Paper]
- [3] [CVPR'24] | LinCIR: Language-only Training of Zero-shot Composed Image Retrieval. [Paper]
- [4] [CVPR'24] | Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval. [Paper]
- [5] [SIGIR'24] | Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval. [Paper]
- [6] [SIGIR'24] | LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image Retrieval. [Paper]
- [7] [ICML'24] | Improve Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning. [Paper]
- [8] [ICML'24] | MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions. [Paper]
- [9] [TMLR'24] | CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion. [Paper]
- [10] [ACM MM'24] | Semantic Editing Increment Benefits Zero-Shot Composed Image Retrieval. [Paper]
- [11] [ECCV'24] | Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval. [Paper]
- [12] [ACML'24] | Prompting Vision-Language Fusion for Zero-Shot Composed Image Retrieval. [Paper]
- [1] [ICCV'23] | Zero-shot Composed Image Retrieval with Textual Inversion. [Paper]
- [2] [CVPR'23] | Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval. [Paper]
- [3] [BMVC'23] | Zero-shot Composed Text-Image Retrieval. [Paper]
- [1] [CVPR'24] | Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval. [Paper]
- [1] [Arxiv'24] | Leveraging Large Language Models for Multimodal Search. [Paper]
- [1] [ICLR'25] | MAI: A Multi-Turn Aggregation-Iteration Model for Composed Image Retrieval. [Paper]
- [2] [WWW'25] | ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning. [Paper] [Code]
- [1] [ICCV'23] | FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory. [Paper]
- [2] [MM'23] | Conversational Composed Retrieval with Iterative Sequence Refinement. [Paper]
- [3] [MMW'23] | Fashion-GPT: Integrating LLMs with Fashion Retrieval System. [Paper]
- [1] [SIGIR'21] | Conversational Fashion Image Retrieval via Multiturn Natural Language Feedback. [Paper]
- [1] [NeruIPS'18] | Dialog-based interactive image retrieval. [Paper]
- [1] [AAAI'26] | ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval. [Paper]
- [2] [ICLR'26] | OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text. [Paper]
- [1] [CVPR'25] | Localizing Events in Videos with Multimodal Queries. [Paper]
- [2] [ICLR'25] | Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval. [Paper]
- [3] [ACM MM'25] | HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval.
- [4] [ICCV'25] | Beyond Simple Edits: Composed Video Retrieval with Dense Modifications.
- [5] [NeurIPS'25] | From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos. [Paper]
- [1] [AAAI'24] | CoVR: Learning Composed Video Retrieval from Web Video Captions. [Paper]
- [2] [CVPR'24] | Composed Video Retrieval via Enriched Context and Discriminative Embeddings. [Paper]
- [3] [TPAMI'24] | CoVR-2: Automatic Data Construction for Composed Video Retrieval. [Paper]
- [4] [ECCV'24] | EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval. [Paper]
- [1] [AAAI'24] | Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions. [Paper]
- [2] [CVPR'24] | You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval. [Paper]
- [1] [NeurIPS'25] | Automatic Synthetic Data and Fine-grained Adaptive Feature Alignment for Composed Person Retrieval. [Paper]
- [2] [Arxiv'24] | Word4Per: Zero-shot Composed Person Retrieval. [Paper]
- [1] [ICCV'25] | AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs. [Paper]
- [1] [IJCAI'25] | PatternCIR Benchmark and TisCIR: Advancing Zero-Shot Composed Image Retrieval in Remote Sensing. [Paper]
- [2] [TGRS'25] | Language-Empowered Conversion for Remote Sensing Image Retrieval With Text Feedback. [Paper]
- [3] [IGARSS'24] | Composed Image Retrieval for Remote Sensing. [Paper]
- [4] [TGRS'24] | Scene Graph-Aware Hierarchical Fusion Network for Remote Sensing Image Retrieval With Text Feedback. [Paper]
- 🎯 [TOIS'25] | A Comprehensive Survey on Composed Image Retrieval. [Paper]
- [2] [Arxiv'25] | Composed Multi-modal Retrieval: A Survey of Approaches and Applications. [Paper]
- [3] [Applied Intelligence'25] | Composed Image Retrieval: A Survey on Recent Research and Development. [Paper]
- [4] [TOMM'25] | A Survey on Composed Image Retrieval. [Paper]
- [5] [Arxiv'24] | A Survey of Multimodal Composite Editing and Retrieval. [Paper]
- [1] [SIGIR'26] | A Sketch+Text Composed Image Retrieval Dataset for Thangka. [Paper]
- [2] [Arxiv'26] | FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data. [Paper]
- [3] [CVPRW'25] | good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval. [Paper]
- [4] [Pattern Recognition'25] | ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval. [Paper]
- [5] [Arxiv'24] | EUFCC-CIR: a Composed Image Retrieval Dataset for GLAM Collections. [Paper]
| Dataset | Modalities | Images Scale | Triplets Scale | Type | Link | Domain |
|---|---|---|---|---|---|---|
| FashionIQ | Image+Text | ~77.7K | ~30.1K | Human Annotated | Link | Fashion |
| Refined-FashionIQ | Image+Text | ~8.1K | ~5.5K | Automatic Refined | Link | Fashion |
| Shoes | Image+Text | ~14.7K | ~10.8K | Human Annotated | Link | Fashion |
| Fashion200K | Image+Text | ~200K | -- | -- | Link | Fashion |
| MIT | Image+Text | ~6K | -- | -- | Link | Open-domain |
| CIRR | Image+Text | ~21.6K | ~36.6K | Human Annotated | Link | Open-domain |
| Refined-CIRR | Image+Text | ~2.2K | ~3.9K | Automatic Refined | Link | Open-domain |
| CIRCO | Image+Text | ~12.3K | ~1.0K | Human Annotated | Link | Open-domain |
| CSS | Image+Text | ~1.0K | ~32K | Generated | Link | Open-domain |
| LaSCo | Image+Text | ~121.5K | ~389.3K | Generated | Link | Open-domain |
| MT-CIR | Image+Text | ~423K | ~17.7M | Generated | Link | Open-domain |
| SynthTriplets18M | Image+Text | -- | ~18M | Generated | Link | Open-domain |
| WebVid-CoVR | Video+Text | ~130.8K | ~1.6M | Generated | Link | Video |
| ITCPR | Image+Text | ~20K | ~12.2K | Human Annotated | Link | Person |
| Airplane, Tennis, and WHIRT | Image+Text | ~7.7K | ~8.7K | Human Annotated | - | Remote Sensing |
| PatternCom | Image+Text | ~30K | ~21K | Human Annotated | Link | Remote Sensing |
| FS-COCO | Sketch+Image+Text | ~10K | ~10K | Human Annotated | Link | Sketch |
| SketchyCOCO | Sketch+Image+Text | ~14K | ~14K | Automatic matching | Link | Sketch |
| CSTBIR | Sketch+Image+Text | ~108K | ~2M | Automatic matching | Link | Sketch |
| ImageNet-R | Image+Cart+Toy+Orig+Sculpt | ~17.9K | -- | -- | Link | Domain Conversion |
| Mini-DomainNet | Clip+Paint+Image+Sketch | ~137K | -- | -- | Link | Domain Conversion |
| NICO++ | Aut+Dim+Grass+Out+Rock+Water | ~80K | -- | -- | Link | Domain Conversion |
| LTLL | Today+Archive | 488 | -- | -- | Link | Instance Level Domain Conversion |
| i-CIR | Image+Text | ~750K | ~13.5K | Human Annotated | Link | Art, Landmark, Fictional, Mobility, Fashion, Product, Household, Tech |
If you find this repository helpful, please consider citing our survey and giving this repo a ⭐.
@article{cirsurvey,
title = {A Comprehensive Survey on Composed Image Retrieval},
author = {Song, Xuemeng and Lin, Haoqiang and Wen, Haokun and Hou, Bohan and Xu, Mingzhu and Nie, Liqiang},
journal = {ACM Transactions on Information Systems},
volume = {44},
number = {1},
pages = {19:1--19:54},
year = {2026},
publisher = {ACM}
}