This repository implements and compares two transformer-based face recognition pipelines:
- Swin Transformer – A hierarchical vision transformer yielding state-of-the-art performance in facial feature extraction.
- Vision Transformer (ViT) – A non-hierarchical transformer trained from scratch for face recognition.
Our goal is to demonstrate the superior representational power of Swin over vanilla ViT on the Labelled Faces in the Wild (LFW) dataset, and to analyse the trade-offs in training complexity, inference speed, and recognition accuracy.
- Features
- Dataset
- Installation
- Usage
- Results
- Comparison & Discussion
- Pros & Cons
- Contributing
- License
- Contact
- End-to-end training pipelines for both Swin Transformer and ViT on LFW
- Modular PyTorch implementation with configurable hyperparameters
- Automated evaluation scripts computing accuracy, ROC curves, and confusion matrices
- Jupyter notebooks demonstrating experiments and visualisations
We use the Labelled Faces in the Wild (LFW) dataset, available on Kaggle:
Labelled Faces in the Wild (LFW) Dataset
Download and unpack into:
data/lfw-deepfunneled/
└── lfw-deepfunneled/
├── ...
│ ├── ...
│ └── ...
└── ...
-
Clone the repository
git clone https://github.com/muhammadhamzagova666/face-recognition-vit-and-swin.git cd face-recognition-vit-and-swin -
Create & activate a virtual environment
python3 -m venv venv source venv/bin/activate -
Install dependencies
pip install --upgrade pip pip install -r requirements.txt
-
Verify installation
python -c "import torch; import transformers; print('Setup OK')"
# Ensure Kaggle CLI is configured with your API token
kaggle datasets download -d jessicali9530/lfw-dataset
unzip lfw-dataset.zip -d data/lfw-deepfunneledpython swin-lfw.py \
--data-dir data/lfw-deepfunneled \
--output-dir experiments/swin \
--epochs 100 \
--batch-size 32 \
--learning-rate 1e-4python vit-lfw.py \
--data-dir data/lfw-deepfunneled \
--output-dir experiments/vit \
--epochs 100 \
--batch-size 64 \
--learning-rate 1e-4| Model | Test Accuracy (%) | Loss |
|---|---|---|
| Swin | 85.12 | 1.4142 |
| ViT (scratch) | 96.66 | 0.2387 |
The trained models can be found on this link:
Face Recognition Trained Models using ViT and Swin Transformers on LFW Dataset
Key takeaway: Swin Transformer outperforms ViT by a substantial margin in both recognition accuracy and ROC AUC, while also delivering faster inference due to its hierarchical design.
-
Architectural Differences
- Swin: Window-based multi-scale self-attention; strong locality inductive bias.
- ViT: Global self-attention on fixed-size patches; requires more data to generalize.
-
Training Complexity
- Swin converges in ~25 epochs vs. ViT’s ~80 epochs for comparable performance.
- ViT demands larger batch sizes and more careful learning-rate scheduling.
-
Inference Speed
- Swin’s hierarchical tokens reduce per-layer complexity, yielding ~30% faster inference.
| Model | Pros | Cons |
|---|---|---|
| Swin | • Superior accuracy • Faster inference • Robust to scale variations |
• More complex implementation • Slightly larger memory footprint |
| ViT | • Simpler architecture • Easier to customize for new modalities |
• Lower accuracy • Slower converge and inference |
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Commit your changes (
git commit -m 'Add feature') - Push to the branch (
git push origin feature/my-feature) - Open a pull request
Please ensure the code passes linting and all tests before merging.
This project is licensed under the MIT License.
For questions or feedback, please open an issue or contact:
-
Maintainers:
-
Project URL: Face Recognition using ViT and Swin Transformers