This repository demonstrates contrastive learning for visual-text representation using Vision Transformers (ViT) and OpenAI's CLIP. Contrastive learning is a self-supervised approach that learns representations by bringing similar pairs (image-text) closer in embedding space while pushing dissimilar pairs apart.
The repository contains three main modules:
-
ViT – Implementation and fine-tuning of Vision Transformer models for image representation.
-
CLIP – Zero-shot image-text modeling using OpenAI's CLIP, along with preprocessing and embedding extraction.
-
Interacting with CLIP – Utilities for text-to-image retrieval, similarity computation, and visualization.
Note: This module is adapted from OpenAI’s CLIP GitHub repository.
Contrastive-Learning/
│
├── ViT.ipynb # Vision Transformer model and fine-tuning examples
├── CLIP.ipynb # Using CLIP for encoding and zero-shot classification
├── Interacting_with_CLIP.ipynb # Image-text retrieval, similarity scoring, and visualization (adapted from OpenAI)
├── README.md # Project documentation
-
Vision Transformer (ViT)
- Load pre-trained ViT models (
ViT-B/32) for image classification. - Fine-tune the last transformer blocks using contrastive loss.
- Freeze and unfreeze layers selectively for efficient training.
- Load pre-trained ViT models (
-
CLIP
- Encode images and text into a shared embedding space.
- Perform zero-shot classification using text prompts.
- Compute cosine similarity between images and text for retrieval tasks.
-
Interacting with CLIP
- Text-to-image search with real or simulated datasets.
- Visualization of images along with predicted labels.
- Utilities for working in Google Colab or local environments.
Adapted from OpenAI CLIP GitHub.
- Clone the repository:
git clone https://github.com/yourusername/contrastive-learning.git
cd contrastive-learning- Create and activate a Python environment (optional but recommended):
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txtRecommended Packages in requirements.txt:
torch>=2.0
torchvision
timm
clip @ git+https://github.com/openai/CLIP.git
scikit-image
matplotlib
Pillow
numpy
requests
- Load a pre-trained ViT model.
- Freeze early layers and fine-tune selected transformer blocks.
- Compute image embeddings for contrastive learning.
- Load CLIP model and preprocess images and text.
- Perform zero-shot classification or feature extraction.
- Normalize embeddings and compute cosine similarity.
- Text-to-image retrieval example with sample datasets.
- Visualize results with
matplotlib. - Test fine-tuned CLIP models with custom image-text pairs.
Adapted from OpenAI’s official CLIP repository.
import clip
import torch
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a cat", "a dog"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (image_features @ text_features.T).squeeze(0)
predicted_class = similarity.argmax().item()
print("Predicted class:", ["a cat", "a dog"][predicted_class])Contributions are welcome!
- Fork the repository
- Create a new branch for your feature or bugfix
- Submit a pull request with a clear description
This project is licensed under the MIT License – see the LICENSE file for details.