CLIP: Contrastive Language-Image Pretraining

Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This project contains code to train CLIP on the MS-COCO Captions dataset. It also includes an implementation of SigLIP, which uses a sigmoid loss as the training objective.

Data

MSCOCO Captions dataset contains over 100000 (image, text) pairs. Download the dataset and update config.yaml with the image folder and annotations file paths. To download the dataset:

# create directory in data/
$ mkdir data/mscoco

# download images
$ wget http://images.cocodataset.org/zips/train2017.zip -O data/mscoco/train2017.zip
$ unzip data/mscoco/train2017.zip -d data/mscoco


# download annotations 
$ wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip -O data/mscoco/annotations_trainval2017.zip
$ unzip data/mscoco/annotations_trainval2017.zip -d data/mscoco

Note: The input text is tokenized using the DistilBERTTokenizer from HuggingFace. You can set your desired context size and enable shuffling (between :multiple captions) in the config file. Input images are also resized to (224,224) which is the input size of the ResNet50 model.

Model

CLIP model consists of two encoder:

Image Encoder: ResNet50 (backbone) + 2 Linear Layers (projection).
Text Encoder: DistilBERT (backbone) + 2 Linear Layers (projection).

CLIP supports the following methods:

model.encode_image(image: Tensor)

Given a batch of images, returns the image features encoded by the image encoder of the CLIP model.

model.encode_text(text: Tensor, text_mask: Tensor)

Given a batch of text tokens and associated masks, returns the text features encoded by the text encoder of the CLIP model.

model.generate_similarity_matrix(image: Tensor, text: Tensor, text_mask: Tensor)

Given a batch of images and a batch of text tokens and masks, returns a matrix of scaled cosine similarities between the corresponding image and text features.

Training

Update hyperparameters for training in config.yaml file. The SigLIP paper introduced a novel contrastive learning objective that performs better than softmax baselines, particularly for small batch sizes. To use the sigmoid loss, change loss under algorithm config to siglip from clip.

Running Code

Install dependencies from requirements file. Make sure to create a virtual/conda environment before running this command.

# create new env clip_env
conda create -n clip python=3.11

# activate clip_env
conda activate clip_env

# install other dependencies
pip install -r requirements.txt

Run main.py which starts the training script.

# navigate to the src folder
cd src

# run the main file
python main.py

To-Dos

Any kind of enhancement or contribution is welcomed.

zero-shot classifier
support for loggers

References

[1] CLIP Paper: Learning Transferable Visual Models From Natural Language Supervision

[2] SigLIP Paper: Sigmoid Loss for Language Image Pre-Training

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CLIP: Contrastive Language-Image Pretraining

Data

Model

Training

Running Code

To-Dos

References

About

Uh oh!

Uh oh!

Languages

License

ramanakshay/clip

Folders and files

Latest commit

History

Repository files navigation

CLIP: Contrastive Language-Image Pretraining

Data

Model

Training

Running Code

To-Dos

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages