Real-time sign language detection using the YOLOv5 object detection framework - a final-year B.Tech project comparing YOLOv5 with ANN and CNN approaches for sign language recognition.
Overview โข Motivation โข Why YOLOv5 โข Tech Stack โข Installation โข Training โข Detection โข Comparison
A real-time sign language detection system built with the YOLOv5 (You Only Look Once, version 5) object detection framework. The model detects and classifies sign language gestures from images and live webcam feeds with bounding box localization and confidence scores.
This is the YOLOv5 implementation of a comparative study conducted as a 7th semester B.Tech final-year project. The goal: evaluate object detection (YOLOv5) against traditional image classification approaches (ANN and CNN) for sign language recognition.
Companion Repository: The ANN and CNN baseline implementations can be found in the Sign-Language-Detection-Using-ANN-CNN repository.
Sign language is the primary mode of communication for millions of hearing-impaired individuals worldwide. However, the communication gap between signers and non-signers remains a significant barrier to inclusion.
This project aims to:
- ๐ Bridge communication gaps - Enable real-time sign language interpretation
- ๐ฏ Compare approaches - Benchmark object detection (YOLOv5) vs classification (ANN/CNN)
- ๐ฑ Real-world usability - Build a system that works via webcam in real-time
- ๐ Academic contribution - Provide empirical data on deep learning approaches for sign language
YOLOv5 is a state-of-the-art object detection framework chosen for this project because:
| Advantage | Benefit |
|---|---|
| โก Real-time speed | Processes frames at 30+ FPS on modern GPUs |
| ๐ Localization | Provides bounding boxes, not just classification |
| ๐ฏ High accuracy | State-of-the-art mAP on COCO benchmark |
| ๐ง Transfer learning | Pre-trained weights enable fast training with small datasets |
| ๐ฅ Cross-platform | Export to ONNX, TorchScript, CoreML, TFLite |
| ๐ฆ Easy to use | Well-documented training and inference pipeline |
| Aspect | ANN / CNN | YOLOv5 |
|---|---|---|
| Output | Single class label | Class + bounding box + confidence |
| Input | Pre-cropped sign image | Full scene with sign |
| Real-time | Requires pre-processing | End-to-end detection |
| Multi-sign | One sign at a time | Multiple signs simultaneously |
| Use case | Static image classification | Live video / real-world scenes |
| Category | Technology | Purpose |
|---|---|---|
| Language | Python 3.7+ | Core implementation |
| Deep Learning | PyTorch | YOLOv5 framework backbone |
| Detection Model | YOLOv5 (Ultralytics) | Object detection architecture |
| Computer Vision | OpenCV | Webcam capture and image processing |
| Notebook | Jupyter | Interactive training and detection |
| Model Format | PyTorch .pt |
Serialized trained weights |
| Visualization | Matplotlib, PIL | Display detection results |
Sign-Language-Detection-Using-YOLO-V5/
โโโ README.md
โโโ LICENSE
โ
โโโ Sign Language Recognition YOLO v5/
โโโ (YOLOV5)SignLanguageRecognition.ipynb # Main Jupyter notebook
โโโ best.pt # Trained YOLOv5 weights
โ
โโโ Result SS/ # Detection result screenshots
โ โโโ Screenshot 2022-04-29 142723.png
โ โโโ Screenshot 2022-04-29 142751.png
โ โโโ Screenshot 2022-04-29 142822.png
โ โโโ Screenshot 2022-04-29 143438.png
โ โโโ webcan visualization.png # Real-time webcam demo
โ
โโโ code SS/ # Code walkthrough screenshots
โโโ s1.png
โโโ s2.png
โโโ s3.png
โโโ s4.png
โโโ s6.png
โโโ s7.png
- Python 3.7+
- CUDA-capable GPU (recommended for training; CPU works for inference)
- Webcam (for real-time detection)
# Clone this repository
git clone https://github.com/zishnusarker/Sign-Language-Detection-Using-YOLO-V5.git
cd Sign-Language-Detection-Using-YOLO-V5
# Clone YOLOv5 framework
git clone https://github.com/ultralytics/yolov5
cd yolov5
# Install YOLOv5 dependencies
pip install -r requirements.txt
# Install Jupyter (if not already installed)
pip install jupyter notebookcd "Sign Language Recognition YOLO v5"
jupyter notebook "(YOLOV5)SignLanguageRecognition.ipynb"The notebook walks through the complete training pipeline:
- Dataset Preparation - Organize images and labels in YOLO format (
train/images,train/labels) - Custom
data.yaml- Define classes and dataset paths - Transfer Learning - Start from pre-trained YOLOv5s/m/l weights
- Training Command - Run
train.pywith custom hyperparameters - Evaluation - Monitor loss, mAP, precision, and recall
- Best Weights - Trained model saved as
best.pt
python train.py --img 640 --batch 16 --epochs 100 \
--data sign_language.yaml \
--weights yolov5s.pt \
--name sign_language_yolov5python detect.py --weights best.pt --img 640 --conf 0.25 --source path/to/image.jpgpython detect.py --weights best.pt --img 640 --conf 0.25 --source 0The system will display bounding boxes around detected signs with class labels and confidence scores in real-time.
See the Result SS/ folder for screenshots including:
- Static image detections
- Live webcam visualization demonstrating real-time inference
This project is part of a comparative study. The full comparison is discussed across two repositories:
| Model | Repository | Approach |
|---|---|---|
| ANN | Sign-language-Detection-Using-ANN-CNN | Fully-connected neural network on flattened pixels |
| CNN | Sign-language-Detection-Using-ANN-CNN | Convolutional network with feature extraction |
| YOLOv5 | This repository | Object detection with localization |
- ANN: Simple baseline, struggles with spatial features
- CNN: Better at learning hierarchical features, good for static classification
- YOLOv5: Superior for real-time detection with localization - the clear winner for real-world deployment
The model successfully detects sign language gestures with:
- โ Real-time webcam inference
- โ Bounding box localization
- โ Class labels with confidence scores
- โ Multi-sign detection in a single frame
Check the Result SS/ folder for visual examples of the model in action.
What is YOLOv5 and how does it work?
YOLOv5 is a single-stage object detector that divides an input image into a grid and predicts bounding boxes, class probabilities, and confidence scores for each grid cell in a single forward pass. This makes it much faster than two-stage detectors (like Faster R-CNN) while maintaining competitive accuracy.
Why use transfer learning?
YOLOv5 models are pre-trained on the COCO dataset (80 classes, 330K images). By starting from these weights, the model already knows how to detect generic visual features (edges, textures, shapes). Fine-tuning on a smaller sign language dataset is much faster and more effective than training from scratch.
What's inside `best.pt`?
The best.pt file contains the PyTorch state dict with the trained model weights from the epoch that achieved the best validation mAP during training. It can be loaded directly with torch.load() or used with YOLOv5's detect.py script.
What is mAP (mean Average Precision)?
mAP is the standard evaluation metric for object detection. It measures both classification accuracy and localization quality by averaging precision across all classes at various IoU (Intersection over Union) thresholds. Higher mAP = better detection.
- Expand dataset to cover more sign language alphabets (ASL, BSL, ISL)
- Deploy as a web app using Flask/Streamlit with webcam streaming
- Convert model to ONNX/TFLite for mobile deployment
- Add word-level and sentence-level sign detection (temporal models like LSTM + CNN)
- Integrate text-to-speech for detected signs
- Build a full accessibility application for hearing-impaired users
- Collect diverse dataset (different skin tones, lighting, backgrounds)
- Compare with YOLOv7, YOLOv8, and other modern detectors
- YOLOv5: Ultralytics YOLOv5 Repository
- Original YOLO Paper: You Only Look Once: Unified, Real-Time Object Detection
- PyTorch: PyTorch Documentation
This project is licensed under the MIT License - see the LICENSE file for details.
Made with โค๏ธ as a B.Tech 7th Semester Final Year Project
Breaking communication barriers with computer vision ๐ค