This repository contains code for the COCO-Facet benchmark for attribute-focused text-to-image retrieval ("Facets" of the images). The benchmark can be downloaded here. Please place the downloaded json files in the "benchmark" folder for evaluation.
The annotations are from MSCOCO 2017, COCO-Stuff, Visual7W, and VisDial about COCO images. Since they reindexed the images, we recommend downloading the images at MSCOCO_val2017, VisDial_val2018, Visual7W.
conda create -n facet python=3.10
pip install -r VLM2Vec/requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolationPlease first modify the dataset path and huggingface model path in the scripts. Then you can start evaluation inside the "VLM2Vec" folder.
For CLIP-ViT-L/14-336px:
sh eval_b.shFor VLM2Vec without any attribute-specific prompt:
sh eval_d.sh
For VLM2Vec with GPT prompts:
sh eval_f.shWe also attached the human-written prompts in eval_f.py.
For the text-based retrieval:
sh eval_t_detailed.shFor VLM2Vec with GPT-chosen prompts at test time:
sh eval_e.shWe have attached the GPT responses under output/outputs_e, which can be reused.
For VLM2Vec with linear approximated promptable embeddings:
sh eval_a.shNote that we need the embeddings given by "eval_f.sh" and "eval_d.sh" to derive the matrix W.
We include the collators for other MLLM-based universal multimodal embedders in VLM2Vec/src/collator.py.
We attach the dataset construction process in the .ipynb files in the "construction" folder.
This code is mainly based on the VLM2Vec repository.
If you find our code, data, or the paper useful, please cite the paper:
@article{li2025highlighting,
title={Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval},
author={Li, Siting and Gao, Xiang and Du, Simon Shaolei},
journal={arXiv preprint arXiv:2505.15877},
year={2025}
}