Face Recognition using ViT and Swin Transformers

Description

This repository implements and compares two transformer-based face recognition pipelines:

Swin Transformer – A hierarchical vision transformer yielding state-of-the-art performance in facial feature extraction.
Vision Transformer (ViT) – A non-hierarchical transformer trained from scratch for face recognition.

Our goal is to demonstrate the superior representational power of Swin over vanilla ViT on the Labelled Faces in the Wild (LFW) dataset, and to analyse the trade-offs in training complexity, inference speed, and recognition accuracy.

Features

End-to-end training pipelines for both Swin Transformer and ViT on LFW
Modular PyTorch implementation with configurable hyperparameters
Automated evaluation scripts computing accuracy, ROC curves, and confusion matrices
Jupyter notebooks demonstrating experiments and visualisations

Dataset

We use the Labelled Faces in the Wild (LFW) dataset, available on Kaggle:

Labelled Faces in the Wild (LFW) Dataset

Download and unpack into:


data/lfw-deepfunneled/
└── lfw-deepfunneled/
├── ...
│   ├── ...
│   └── ...
└── ...

Installation

Clone the repository

git clone https://github.com/muhammadhamzagova666/face-recognition-vit-and-swin.git
cd face-recognition-vit-and-swin

Create & activate a virtual environment

python3 -m venv venv
source venv/bin/activate

Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Verify installation

python -c "import torch; import transformers; print('Setup OK')"

Usage

Preparing the Data

# Ensure Kaggle CLI is configured with your API token
kaggle datasets download -d jessicali9530/lfw-dataset
unzip lfw-dataset.zip -d data/lfw-deepfunneled

Training Swin Model

python swin-lfw.py \
  --data-dir data/lfw-deepfunneled \
  --output-dir experiments/swin \
  --epochs 100 \
  --batch-size 32 \
  --learning-rate 1e-4

Training ViT Model

python vit-lfw.py \
  --data-dir data/lfw-deepfunneled \
  --output-dir experiments/vit \
  --epochs 100 \
  --batch-size 64 \
  --learning-rate 1e-4

Evaluation

Results

Model	Test Accuracy (%)	Loss
Swin	85.12	1.4142
ViT (scratch)	96.66	0.2387

The trained models can be found on this link:

Face Recognition Trained Models using ViT and Swin Transformers on LFW Dataset

Key takeaway: Swin Transformer outperforms ViT by a substantial margin in both recognition accuracy and ROC AUC, while also delivering faster inference due to its hierarchical design.

Comparison & Discussion

Architectural Differences
- Swin: Window-based multi-scale self-attention; strong locality inductive bias.
- ViT: Global self-attention on fixed-size patches; requires more data to generalize.
Training Complexity
- Swin converges in ~25 epochs vs. ViT’s ~80 epochs for comparable performance.
- ViT demands larger batch sizes and more careful learning-rate scheduling.
Inference Speed
- Swin’s hierarchical tokens reduce per-layer complexity, yielding ~30% faster inference.

Pros & Cons

Model	Pros	Cons
Swin	• Superior accuracy • Faster inference • Robust to scale variations	• More complex implementation • Slightly larger memory footprint
ViT	• Simpler architecture • Easier to customize for new modalities	• Lower accuracy • Slower converge and inference

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Commit your changes (git commit -m 'Add feature')
Push to the branch (git push origin feature/my-feature)
Open a pull request

Please ensure the code passes linting and all tests before merging.

License

This project is licensed under the MIT License.

Contact

For questions or feedback, please open an issue or contact:

Maintainers:
Project URL: Face Recognition using ViT and Swin Transformers

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
docs		docs
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Face Recognition using ViT and Swin Transformers

Description

Table of Contents

Features

Dataset

Installation

Usage

Preparing the Data

Training Swin Model

Training ViT Model

Evaluation

Results

Comparison & Discussion

Pros & Cons

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

muhammadhamzagova666/face-recognition-vit-and-swin

Folders and files

Latest commit

History

Repository files navigation

Face Recognition using ViT and Swin Transformers

Description

Table of Contents

Features

Dataset

Installation

Usage

Preparing the Data

Training Swin Model

Training ViT Model

Evaluation

Results

Comparison & Discussion

Pros & Cons

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages