This project implements an image reconstruction system using IJePA (Image-based Joint-Embedding Predictive Architecture) embeddings. The model learns to reconstruct images from their IJePA feature representations using a convolutional decoder network.
- IJePA-based feature extraction using pretrained
facebook/ijepa_vith14_1kmodel - Convolutional decoder with residual blocks for high-quality reconstruction
- Multi-component loss function (MSE, SSIM, LPIPS, TV, Cosine)
- Comprehensive evaluation metrics
- Visualization tools for reconstruction quality assessment
- Support for CIFAR-10 and Tiny ImageNet datasets
ijepa-reconstruction/
├── src/
│ ├── models/ # Model architectures
│ ├── data/ # Data loading and preprocessing
│ ├── utils/ # Utilities for metrics and visualization
│ └── training/ # Training logic
├── scripts/ # Executable scripts
├── configs/ # Configuration files
├── cache/ # Cached embeddings (gitignored)
├── checkpoints/ # Model checkpoints (gitignored)
└── results/ # Output results (gitignored)
# Clone the repository
git clone https://github.com/aymen-000/i-jeap-emdedding-inversion-attack
cd i-jeap-emdedding-inversion-attack
# Install dependencies
pip install -r requirements.txtFirst, compute and cache the IJePA embeddings for your dataset (if you want to accelerate the training ):
# For CIFAR-10
python scripts/compute_embeddings.py \
--dataset cifar10 \
--train_subset 6000 \
--test_subset 2000 \
--batch_size 8 \
--cache_dir ./cache
# For Tiny ImageNet
python scripts/compute_embeddings.py \
--dataset tiny_imagenet \
--data_root /path/to/tiny-imagenet-200/train \
--train_subset 6000 \
--test_subset 2000 \
--batch_size 8 \
--cache_dir ./cacheTrain the decoder to reconstruct images from embeddings:
python scripts/train.py \
--dataset cifar10 \
--train_subset 6000 \
--test_subset 2000 \
--batch_size 32 \
--epochs 100 \
--checkpoint_dir ./checkpoints \
--output_dir ./resultsCompute reconstruction metrics on the test set:
python scripts/evaluate.py \
--dataset cifar10 \
--checkpoint ./checkpoints/cifar10_model_final.pth \
--embeddings_file ./cache/cifar10_test_embeddings.pt \
--output_csv ./results/metrics.csv \
--test_subset 2000 \
--train_size 6000 \
--num_epochs 100Generate side-by-side comparisons of original and reconstructed images:
python scripts/visualize.py \
--dataset cifar10 \
--checkpoint ./checkpoints/cifar10_model_final.pth \
--embeddings_file ./cache/cifar10_test_embeddings.pt \
--num_samples 10 \
--output_dir ./resultsThe decoder uses a flexible progressive upsampling architecture that supports arbitrary output resolutions:
Input Embeddings (B, 1280, 16, 16)
|
v
┌─────────────────────┐
│ Upsampling Block 1 │
│ ConvTranspose2d │
│ 1280 → 640 ch │
│ 16×16 → 32×32 │
└──────────┬──────────┘
|
v
┌─────────────────────┐
│ Residual Block │
│ Conv → BN → GELU │
│ Conv → BN │
│ + Skip Connection │
└──────────┬──────────┘
|
v
┌─────────────────────┐
│ Upsampling Block 2 │
│ ConvTranspose2d │
│ 640 → 320 ch │
│ 32×32 → 64×64 │
└──────────┬──────────┘
|
v
┌─────────────────────┐
│ Residual Block │
└──────────┬──────────┘
|
v
(Continue for
log₂(output_size/16)
iterations)
|
v
┌─────────────────────┐
│ Final Conv Layer │
│ C → 3 channels │
│ 3×3 kernel │
└──────────┬──────────┘
|
v
┌─────────────────────┐
│ Sigmoid Activation │
└──────────┬──────────┘
|
v
Reconstructed Image (B, 3, output_size, output_size)
Key Features:
- Dynamic Resolution: Automatically calculates upsampling layers based on input/output size
- Progressive Channel Reduction: Each stage halves channels (1280→640→320→160...)
- Residual Connections: Improves gradient flow and reconstruction quality
- Fast GELU Activation:
x * σ(1.702x)for efficient non-linear transformations
The model uses a weighted combination of five loss components:
- MSE Loss (weight: 1.0): Pixel-level reconstruction accuracy
- SSIM Loss (weight: 0.1): Structural similarity
- LPIPS Loss (weight: 0.1): Perceptual similarity (AlexNet-based)
- TV Loss (weight: 0.01): Total variation (encourages smoothness)
- Cosine Loss (weight: 0.01): Feature space similarity
Example results on Tiny ImageNet (10000 train, 2000 test, 50 epochs):
| Metric | Value |
|---|---|
| MSE | 0.127 |
| Cosine Similarity | 0.7851 |
| LPIPS | 0.0805 |
| SSIM | 0.7026 |
@article{assran2023self,
title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
journal={arXiv preprint arXiv:2301.08243},
year={2023}
}MIT License
Contributions are welcome! Please feel free to submit a Pull Request
