Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This project contains code to train CLIP on the MS-COCO Captions dataset. It also includes an implementation of SigLIP, which uses a sigmoid loss as the training objective.
MSCOCO Captions dataset contains over 100000 (image, text) pairs. Download the dataset and update config.yaml
with the image folder and annotations file paths. To download the dataset:
# create directory in data/
$ mkdir data/mscoco
# download images
$ wget http://images.cocodataset.org/zips/train2017.zip -O data/mscoco/train2017.zip
$ unzip data/mscoco/train2017.zip -d data/mscoco
# download annotations
$ wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip -O data/mscoco/annotations_trainval2017.zip
$ unzip data/mscoco/annotations_trainval2017.zip -d data/mscoco
Note: The input text is tokenized using the DistilBERTTokenizer from HuggingFace. You can set your desired context size and enable shuffling (between :multiple captions) in the config file. Input images are also resized to (224,224) which is the input size of the ResNet50 model.
CLIP model consists of two encoder:
-
Image Encoder: ResNet50 (backbone) + 2 Linear Layers (projection).
-
Text Encoder: DistilBERT (backbone) + 2 Linear Layers (projection).
CLIP supports the following methods:
model.encode_image(image: Tensor)
Given a batch of images, returns the image features encoded by the image encoder of the CLIP model.
model.encode_text(text: Tensor, text_mask: Tensor)
Given a batch of text tokens and associated masks, returns the text features encoded by the text encoder of the CLIP model.
model.generate_similarity_matrix(image: Tensor, text: Tensor, text_mask: Tensor)
Given a batch of images and a batch of text tokens and masks, returns a matrix of scaled cosine similarities between the corresponding image and text features.
Update hyperparameters for training in config.yaml
file. The SigLIP paper introduced a novel contrastive learning objective that performs better than softmax baselines, particularly for small batch sizes. To use the sigmoid loss, change loss under algorithm config to siglip
from clip
.
- Install dependencies from requirements file. Make sure to create a virtual/conda environment before running this command.
# create new env clip_env
conda create -n clip python=3.11
# activate clip_env
conda activate clip_env
# install other dependencies
pip install -r requirements.txt
- Run
main.py
which starts the training script.
# navigate to the src folder
cd src
# run the main file
python main.py
Any kind of enhancement or contribution is welcomed.
- zero-shot classifier
- support for loggers
[1] CLIP Paper: Learning Transferable Visual Models From Natural Language Supervision
[2] SigLIP Paper: Sigmoid Loss for Language Image Pre-Training