This application leverages CLIP embeddings of the COCO dataset to enable semantic image search using natural language queries. Users can enter a text query, and the app retrieves the most relevant images from the dataset.
For each search, the application generates a filtered subset of the COCO dataset, including:
-
The selected images (displayed in a gallery)
-
Their corresponding COCO-style annotations
Users can then download the results as a ZIP file, which contains:
-
An images/ directory with the retrieved images
-
An annotations/annotations.json file containing the annotations
This setup allows researchers and developers to quickly extract custom subsets of the COCO dataset for experimentation, fine-tuning, or other computer vision tasks based on semantic content rather than predefined categories. All necessary files are included in the repository, so you don’t need to download the COCO dataset manually; the application fetches the images directly from their URLs on the internet.
demo.mp4
data/captions_val2014.json— COCO captions/annotations for the val2014 split (used to map filenames to coco URLs).embeddings/— binary .npy files produced by the embedding script:image_coco_validation_2014_embeddings.npyandimage_coco_validation_2014_coco_urls.npyandembeddings/image_coco_validation_2014_filenames.npy.models/— possibly-split CLIP model parts (namedclip_model_part_aa, ...).src/utils.pycontains helpers to merge these into a single in-memory model.src/— application source code:app.py— Gradio UI to enter a text query, show a gallery of results and allow downloading selected images as a ZIP.clip_embed_images.py— script to compute image embeddings for a local copy of COCO val2014 and save embeddings + COCO URLs.clip_search_images.py— search routine that loads precomputed embeddings and runs text-to-image similarity using a CLIP model.utils.py— helpers to assemble/load the split model parts and other small utilities.
- Compute image embeddings (offline):
src/clip_embed_images.pyloads a CLIP model viautils.load_split_model(), processes local image files (val2014), and computes normalized image embeddings.- It saves two files under
embeddings/: a NumPy array of embeddings and a matching list of COCO image URLs.
- Query-time search (online):
src/clip_search_images.pyloads the saved embeddings (embeddings/*.npy) and the CLIP text encoder. Given a text query, it encodes the query, computes cosine similarities with the stored image embeddings and returns the top-k image URLs and scores.
- UI
src/app.pyprovides a Gradio interface. When a user submits a query, the backend calls the search function and displays the top-k images in a gallery. Selected images can be downloaded as a ZIP.
Why the split-model code: The models/ directory may contain a large pretrained CLIP model file that was split into multiple parts for storage/transfer. src/utils.py shows a simple approach that concatenates those parts in memory and loads the PyTorch state dict.
Before using the app, make sure to install the requirements:
pip install -r requirementsTo run the app:
python src/app.pyOpen the local URL printed by Gradio in your browser, enter a query, and inspect the gallery.
- Model loading:
src/utils.py::load_split_model()concatenates files named with the prefixclip_model_part_insidemodels/into a bytes buffer and then loads the PyTorch state dict from that buffer. This is intentionally memory-resident; be careful with very large models on low-RAM machines. - Embeddings:
src/clip_embed_images.pynormalizes embeddings after encoding:embedding /= embedding.norm(dim=-1, keepdim=True); search uses the same normalization on text features to compute cosine similarity via dot product. - Search uses NumPy arrays for fast similarity computation: text feature (1 x D) dot image_embeddings.T yielding a vector of similarities.
- UI:
src/app.pyfetches images from their original COCO URLs at display/download time rather than storing local copies.
- If the
models/folder contains split parts bututils.load_split_model()fails, ensure the parts are sorted and named consistently (clip_model_part_aa,clip_model_part_ab, ...). - For quick experimentation without local models you can try using an installed CLIP checkpoint via
open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_something')— adjust code accordingly. - If running on CPU, embedding computation will be slow; prefer GPU when available.
src/clip_embed_images.py— compute and save embeddings (offline).src/clip_search_images.py— run query-time search using embeddings.src/utils.py— model assembly helper for split files.src/app.py— Gradio frontend wiring.
- Add argument parsing to
src/clip_embed_images.pyfor input/output paths and batch sizes. - Add unit tests for
utils.load_split_model()andclip_search_images.search_images_by_url().