Contrastive Learning with Vision Transformers and CLIP

Overview

This repository demonstrates contrastive learning for visual-text representation using Vision Transformers (ViT) and OpenAI's CLIP. Contrastive learning is a self-supervised approach that learns representations by bringing similar pairs (image-text) closer in embedding space while pushing dissimilar pairs apart.

The repository contains three main modules:

ViT – Implementation and fine-tuning of Vision Transformer models for image representation.
CLIP – Zero-shot image-text modeling using OpenAI's CLIP, along with preprocessing and embedding extraction.
Interacting with CLIP – Utilities for text-to-image retrieval, similarity computation, and visualization.

Note: This module is adapted from OpenAI’s CLIP GitHub repository.

Repository Structure

Contrastive-Learning/
│
├── ViT.ipynb                   # Vision Transformer model and fine-tuning examples
├── CLIP.ipynb                  # Using CLIP for encoding and zero-shot classification
├── Interacting_with_CLIP.ipynb # Image-text retrieval, similarity scoring, and visualization (adapted from OpenAI)
├── README.md                   # Project documentation

Features

Vision Transformer (ViT)
- Load pre-trained ViT models (ViT-B/32) for image classification.
- Fine-tune the last transformer blocks using contrastive loss.
- Freeze and unfreeze layers selectively for efficient training.
CLIP
- Encode images and text into a shared embedding space.
- Perform zero-shot classification using text prompts.
- Compute cosine similarity between images and text for retrieval tasks.
Interacting with CLIP
- Text-to-image search with real or simulated datasets.
- Visualization of images along with predicted labels.
- Utilities for working in Google Colab or local environments.
Adapted from OpenAI CLIP GitHub.

Installation

Clone the repository:

git clone https://github.com/yourusername/contrastive-learning.git
cd contrastive-learning

Create and activate a Python environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

Install dependencies:

pip install -r requirements.txt

Recommended Packages in requirements.txt:

torch>=2.0
torchvision
timm
clip @ git+https://github.com/openai/CLIP.git
scikit-image
matplotlib
Pillow
numpy
requests

Usage

1. ViT.ipynb

Load a pre-trained ViT model.
Freeze early layers and fine-tune selected transformer blocks.
Compute image embeddings for contrastive learning.

2. CLIP.ipynb

Load CLIP model and preprocess images and text.
Perform zero-shot classification or feature extraction.
Normalize embeddings and compute cosine similarity.

3. Interacting_with_CLIP.ipynb

Text-to-image retrieval example with sample datasets.
Visualize results with matplotlib.
Test fine-tuned CLIP models with custom image-text pairs.

Adapted from OpenAI’s official CLIP repository.

Example: Zero-Shot Image Classification

import clip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a cat", "a dog"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    similarity = (image_features @ text_features.T).squeeze(0)
    predicted_class = similarity.argmax().item()

print("Predicted class:", ["a cat", "a dog"][predicted_class])

Contribution

Contributions are welcome!

Fork the repository
Create a new branch for your feature or bugfix
Submit a pull request with a clear description

License

This project is licensed under the MIT License – see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Contrastive Learning with Vision Transformers and CLIP

Overview

Repository Structure

Features

Installation

Usage

1. ViT.ipynb

2. CLIP.ipynb

3. Interacting_with_CLIP.ipynb

Example: Zero-Shot Image Classification

Contribution

License

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
pics		pics
CLIP.ipynb		CLIP.ipynb
Interacting_with_CLIP.ipynb		Interacting_with_CLIP.ipynb
README.md		README.md
ViT.ipynb		ViT.ipynb

MohammadrezaChv/Contrastive_Learning

Folders and files

Latest commit

History

Repository files navigation

Contrastive Learning with Vision Transformers and CLIP

Overview

Repository Structure

Features

Installation

Usage

1. ViT.ipynb

2. CLIP.ipynb

3. Interacting_with_CLIP.ipynb

Example: Zero-Shot Image Classification

Contribution

License

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages