SigniVision is an AI-powered web application that detects Sign Language (SL) signs in real-time using a webcam and generates audio feedback using state-of-the-art transformer-based text-to-speech (TTS).
This project bridges the communication gap between the hearing-impaired and the general public using deep learning models such as YOLOv5 (for hand sign detection) and VITS (for audio synthesis).
- 🖐️ Real-time detection of Indian Sign Language words
- 🔍 YOLOv5 custom-trained model for sign recognition
- 🔊 Transformer-based VITS model for generating speech
- 🧠 FastAPI backend with easy RESTful endpoint
- 🌐 React frontend with live webcam capture (see
/frontend) - 🧪 CORS-enabled backend to allow frontend integration
| Layer | Technology |
|---|---|
| Frontend | React, Tailwind CSS, Lucide Icons |
| Backend | FastAPI, Python, Torch, OpenCV |
| ML Model | YOLOv5 for detection, VITS for TTS |
| Utilities | PIL, NumPy, transformers, torch.hub |
SigniVision/
│
├── model_backend/ # FastAPI backend with YOLOv5 + VITS integration
│ ├── main.py
│ └── ...
│
├── model/ # YOLOv5 model directory
│ └── best.pt
| └── signlang.ipynb # for model fine tuning
├── frontend/ # React frontend with webcam and UI
│
├── f2/ # training dataset
|
├── assets/ # Assets like images, PDFs, and demos
│ └── presentation.pdf
│
├── README.md
└── requirements.txt
git clone https://github.com/XML-project-2k25/SigniVision.git
cd SigniVisioncd model_backend
python3 -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windowspip install -r requirements.txtuvicorn main:app --reloadAPI will be live at http://localhost:8000
Send a POST request to /predict/ with an image file:
curl -X POST http://localhost:8000/predict/ -F "[email protected]"cd frontend
npm install
npm run devFrontend will run at http://localhost:5173
- Detected sign class name (e.g., "Thank You") is passed to the
VitsTTSclass. - VITS generates speech using pretrained model
kakao-enterprise/vits-ljs. - Output audio is encoded in Base64 WAV and sent back via API.
- Frontend decodes and plays the speech audio in the browser.
- Video call feature
- Support for dynamic ISL gestures
- Multilingual audio output
- Mobile version using React Native
- Improved UI/UX with gesture history and chat overlay