A modular sign language detection system that uses MediaPipe and OpenCV for hand tracking, and PyTorch for ASL (American Sign Language) word classification. Built to be integrated into larger applications for accessibility.
Build a sign language model that can detect sentences for sign language, designed for POV (point-of-view) camera usage to help blind people interact with sign language users.
- Real-time ASL Detection: Landmark-based classification using MediaPipe and PyTorch
- Context-Aware Dictionary: Maps letters to meaningful words (C โ "Coffee", H โ "Hello")
- REST API: FastAPI endpoints for file upload and base64 image processing
- Raspberry Pi Integration: Optimized for edge devices with semantic routing
- Hackathon Ready: Social interaction features for accessibility demonstrations
sign_language_detection/
โโโ api/
โ โโโ app.py # FastAPI application with context support
โ โโโ test_client.py # API test client
โ โโโ README.md # API documentation
โโโ features/sign_language/
โ โโโ hand_tracker.py # MediaPipe hand detection
โ โโโ word_classifier.py # PyTorch classifier
โ โโโ context_dictionary.py # Context-aware word mapping
โ โโโ dataset.py # PyTorch Dataset class
โโโ scripts/
โ โโโ preprocess_fast.py # Fast preprocessing (200 samples/class)
โ โโโ train_word_model.py # Training script
โ โโโ evaluate_test.py # Test set evaluation
โโโ data/
โ โโโ asl_alphabet_train/ # Downloaded training images
โ โโโ asl_alphabet_test/ # Downloaded test images
โ โโโ processed/ # Preprocessed landmark data
โ โโโ train/
โ โโโ landmarks.npz
โ โโโ label_mapping.npy
โโโ models/
โ โโโ best_model.pth # Trained ASL classifier (2MB)
โ โโโ final_model.pth # Final checkpoint (673KB)
โโโ docs/
โ โโโ DATASET.md # Dataset acquisition guide
โ โโโ CONTEXT_DICTIONARY.md # Context dictionary documentation
โโโ examples/
โ โโโ test_context_dictionary.py # Context demo script
โ โโโ custom_context_dictionary.json # Example custom dictionary
โโโ Dockerfile # Production container
โโโ docker-compose.yml # Docker deployment config
โโโ requirements.txt # All dependencies (dev + production)
โโโ README.md # This file
Create and activate the conda environment (Python 3.11 required for MediaPipe compatibility):
conda create -n asl_env python=3.11 -y
conda activate asl_env
pip install -r requirements.txtDownload the ASL Alphabet dataset from Kaggle:
# Install Kaggle CLI and set up credentials first (see docs/DATASET.md)
kaggle datasets download -d grassknoted/asl-alphabet --unzip -p data/Extract MediaPipe hand landmarks from images:
python scripts/preprocess_fast.pyTrain the ASL word classifier:
python scripts/train_word_model.pyTest the model on test dataset:
python scripts/evaluate_test.pyStart the FastAPI server:
python -m uvicorn api.app:app --host 0.0.0.0 --port 8000Visit API docs: http://localhost:8000/docs
Test the context-aware dictionary feature:
python examples/test_context_dictionary.pySee docs/CONTEXT_DICTIONARY.md for full documentation.
The context dictionary maps ASL letters to meaningful words AND full sentences for natural communication:
Full Sentences:
- C โ "I need a coffee", "It's too cold", "Can I have some cake?"
- H โ "Hello! How are you?", "Help me please", "I'm hungry"
- T โ "Thank you so much", "I'm thirsty", "I'm very tired"
- W โ "You're welcome", "I need water", "Please wait"
Quick Words:
- A โ "Apple", "Again", "Attention", "Ahead"
- Y โ "Yes, I agree", "You are right", "Your turn"
{
"predicted_sign": "C",
"confidence": 0.95,
"contextual_meaning": "I need a coffee",
"alternative_contexts": [
"It's too cold",
"Can I have some cake?",
"Cheese please"
]
}Use Cases:
- Blind users hear "I need a coffee" instead of just "C" via text-to-speech
- Complete sentences provide full context without spelling
- Natural communication in care settings (hospital, home)
- Hackathon demos for realistic social interaction
See docs/CONTEXT_DICTIONARY.md for custom dictionaries and advanced usage.
| Metric | Value |
|---|---|
| Training Accuracy | 82.40% |
| Test Accuracy | 59.26% (16/27 correct) |
| API Prediction Confidence | 99.58% (on high-quality images) |
| Classes Supported | 28 (A-Z, del, space) |
| Model Size | 210 KB |
| Inference Time | ~50ms per image (CPU) |
Strong performers (>90% confidence):
- โ B, C, F, G, I, J, L, M, Q, W, X, Y
Detection challenges:
- Hand detection failures on: A, H, N, O, space
- Similar sign confusion: PโQ, RโG, TโI, UโR
- Uses MediaPipe Hands for real-time hand landmark detection
- Extracts 21 3D landmarks (x, y, z) per hand
- Optimized for single-hand detection
- Simple feedforward neural network (PyTorch)
- Input: 63 features (21 landmarks ร 3 coordinates)
- Hidden layers: 512 โ 256 neurons with dropout
- Output: 28 classes (A-Z, del, space)
- PyTorch Dataset for loading preprocessed landmarks
- Handles label mapping and normalization
- Efficient loading from compressed .npz files
- Production-ready REST API
- File upload and base64 prediction endpoints
- CORS enabled for Raspberry Pi integration
- Health checks and monitoring
# Start API server
python -m uvicorn api.app:app --reload --port 8000
# Test API
python api/test_client.py
# View interactive docs
open http://localhost:8000/docs# Build and run with Docker Compose
docker-compose up -d
# View logs
docker-compose logs -f
# Stop service
docker-compose downSee api/README.md for complete deployment instructions.
GET /health- Health checkGET /classes- List supported classesPOST /predict- Predict from image file uploadPOST /predict/base64- Predict from base64 encoded image (Raspberry Pi integration)
Full API documentation: http://localhost:8000/docs
The model recognizes 28 ASL signs:
- Letters: A-Z
- Special: del (delete), space
This Sign Language Detection module is part of the SenseAI accessibility system.
โโโโโโโโโโโโโโโโโโโ
โ Web Camera โ
โ (Raspberry Pi) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Semantic Router โ
โ (Gemini + OpenRouter) โ
โโโโฌโโโโโโโโโฌโโโโโโโโโโโฌโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ TTS โ โ Face โ โSign Language โ
โ API โ โ API โ โ API โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโโโโโโ
(Eleven (Custom) (This Module)
Labs)
All APIs deployed on Digital Ocean
import cv2, base64, requests
# Capture & encode frame
cap = cv2.VideoCapture(0)
ret, frame = cap.read()
_, buffer = cv2.imencode('.jpg', frame)
img_b64 = base64.b64encode(buffer).decode()
# Call API
response = requests.post(
"http://your-api:8000/predict/base64",
json={"image_base64": img_b64}
)
print(response.json())- Add temporal models (LSTM/Transformer) for sentence detection
- Add data augmentation for improved robustness
- Support for continuous sign language (not just fingerspelling)
- Mobile deployment (TFLite/CoreML conversion)
- Multi-hand support for two-handed signs
Source: Kaggle ASL Alphabet Dataset
- Training: 87,000 images (3,000 per class ร 29 classes)
- Format: 200ร200 RGB images
- Classes: 29 (A-Z + del + nothing + space)
- License: See Kaggle dataset page
For more details, see docs/DATASET.md.
- Keep feature modules independent for easy integration
- All data preprocessing outputs go to
data/processed/ - Models are saved in
models/ - Scripts are in
scripts/
- Create new module in
features/sign_language/ - Import existing components as needed
- Update
conversational_asl.pyfor integration - Document in this README
This project is for educational and accessibility purposes.
- MediaPipe team for excellent hand tracking
- Kaggle ASL Alphabet dataset contributors
- PyTorch community
Built with โค๏ธ for accessibility
So I have thought of building it in this way that, We have 3 features:
Text-to-Speech and Speech-to-Text by using Eleven Labs People recognition built in model Sign Language Detection built in model Then we will create an API endpoint for each of them Docerize and deploy it in Digital Ocean
Now we will have a Semantic Router using Gemini which will be having an OpenRouter Interface which will check the camera frames and text and determine which API to call. The image frames and audio frames will be passed through a Web Camera, and the semantic router will be deployed in the Raspberry PI.
Now I am working in the Sign Language feature. We have created the model right now. So I must create an API endpoint which contains our optimised model and must be able to give the predictions.
Hi I have tested it is working fine for letters but I was thinking, if we can annotate the data in this way the Letter C is Coffee, Letter A is Apple, since we need to achieve the social interaction, so if we are passing the image frame which has a hand gesture of C it will lookup at a Dictionary and give that word or sentence as a response depends on what we keep in our dictionary table. Since it is a hackathon and our goal is to demo this device as a social interaction for disabled people so this will be a good strategy. Please tell me how will you implement it