MouthMap: Deep Learning Lip Reading System

🚀 State-of-the-art lip reading technology powered by deep learning

🎯 Try Live Demo • 📥 Download Model • 🐛 Issues • 💬 Discussions

🌟 What is MouthMap?

MouthMap is an advanced deep learning project that interprets lip movements from video and transcribes them into text. Using a sophisticated architecture of 3D Convolutional Neural Networks (CNNs) and Bidirectional LSTMs with a CTC loss function, MouthMap pushes the boundaries of silent speech recognition.

This research-focused project leverages computer vision and sequence modeling to assist in speech-impaired communication and human-computer interaction.

✨ Key Features

🎥 Video-to-Text Transcription: Converts lip movements in videos directly into coherent text sentences
🧠 State-of-the-Art Architecture: Built on a powerful 3D CNN + Bi-LSTM model to capture complex spatio-temporal features
⚙️ End-to-End Pipeline: Complete workflow from video preprocessing and data loading to model training and inference
🚀 Real-time Potential: Engineered for efficiency, laying the groundwork for live transcription applications
🧩 Modular & Customizable: Well-structured code making it easy to adapt, extend, and experiment with
🌐 Live Web App: Interactive Streamlit application for testing and demonstration

🎯 Applications & Use Cases

🦻 Accessibility

Empowering hearing-impaired individuals with advanced lip reading assistance

🤖 Human-Computer Interaction

Silent commands and natural interaction with smart devices

🔒 Security & Surveillance

Silent communication analysis in security applications

🎮 Live Demo & Pre-trained Model

🌐 Try the Web Application

🚀 Launch MouthMap App

Our interactive Streamlit app is now live! While accuracy with custom videos is continuously being improved, you can test its capabilities with the provided sample videos or upload your own.

📥 Download Pre-trained Model

Download 40th Epoch Model Checkpoint

Access our pre-trained model weights from Kaggle Models - ready for inference and fine-tuning.

🏗️ Model Architecture

graph TB
    A[Video Input<br/>75 frames × 46×140] --> B[3D Conv Layer 1<br/>128 filters]
    B --> C[3D Conv Layer 2<br/>256 filters]
    C --> D[3D Conv Layer 3<br/>75 filters]
    D --> E[Reshape<br/>Flatten spatial dims]
    E --> F[Bidirectional LSTM 1<br/>128 units + 50% Dropout]
    F --> G[Bidirectional LSTM 2<br/>128 units + 50% Dropout]
    G --> H[Dense Layer<br/>41 units + Softmax]
    H --> I[CTC Decoding<br/>Text Output]
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#f3e5f5
    style D fill:#f3e5f5
    style E fill:#e8f5e8
    style F fill:#fff3e0
    style G fill:#fff3e0
    style H fill:#fce4ec
    style I fill:#e1f5fe

🔬 Architecture Details

Layer Type	Configuration	Purpose
Input Layer	(75, 46, 140, 1)	75 frames of 46×140 grayscale video
3D Conv + MaxPool	Filters: 128 → 256 → 75 Kernels: (3,3,3)	Hierarchical feature extraction from spatio-temporal data
Reshape Layer	Flattens spatial dimensions	Prepares data for sequential processing
Bidirectional LSTM	2 layers × 128 units 50% Dropout	Captures temporal dependencies in speech patterns
Dense Output	41 units + Softmax	Character probability distribution for CTC decoding
CTC Loss	Connectionist Temporal Classification	Alignment-free sequence-to-sequence learning

📊 Dataset Information

📥 Download Dataset

import gdown
import os

# Create data directory
if not os.path.exists('data'):
    os.makedirs('data')

# Download and extract dataset
url = 'https://drive.google.com/uc?id=1YlvpDLix3S-U8fd-gqRwPcWXAXm8JwjL'
output = 'data/data.zip'
gdown.download(url, output, quiet=False)
gdown.extractall(output, 'data/')

📁 Dataset Structure

data/
├── s1/                     # Video files (.mpg)
│   ├── bbal6n.mpg         # Sample: "bin blue at l six now"
│   ├── pgue7s.mpg
│   └── ...
├── alignments/             # Text alignment files
│   └── s1/
│       ├── bbal6n.align   # Corresponding text alignments
│       ├── pgue7s.align
│       └── ...
└── preprocessed/           # Processed data (created during training)
    ├── frames/
    └── labels/

🚀 Quick Start

Prerequisites

Python 3.8+
TensorFlow 2.x
OpenCV (cv2)
NumPy, Matplotlib
FFmpeg (for video processing)

📦 Installation

# 1. Clone the repository
git clone https://github.com/d-kavinraja/MouthMap.git
cd MouthMap

# 2. Create and activate virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Download dataset (run the Python script above)
python download_dataset.py

🎯 Quick Inference Example

import tensorflow as tf
import numpy as np
from your_project_utils import load_video, num_to_char, build_model, CTCLoss

# 1. Build and load the pre-trained model
model = build_model()
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), 
    loss=CTCLoss
)
model.load_weights('./models/checkpoint.weights.h5')

# 2. Load and preprocess video
video_path = './data/s1/bbal6n.mpg'
frames = load_video(video_path)
video_tensor = tf.expand_dims(frames, axis=0)  # Add batch dimension

# 3. Make prediction
yhat = model.predict(video_tensor)

# 4. Decode CTC output to text
decoded = tf.keras.backend.ctc_decode(yhat, input_length=[75], greedy=True)[0][0].numpy()
predicted_text = tf.strings.reduce_join(num_to_char(decoded)).numpy().decode('utf-8')

print(f"✅ Predicted Text: {predicted_text}")
# Example Output: "bin blue at l six now"

🔧 Advanced Usage

🎯 Training from Scratch

# Configure training parameters
config = {
    'batch_size': 2,
    'epochs': 100,
    'learning_rate': 0.0001,
    'optimizer': 'adam'
}

# Prepare datasets
train_data = create_dataset('train')
val_data = create_dataset('validation')

# Define callbacks
callbacks = [
    tf.keras.callbacks.ModelCheckpoint('./models/checkpoint.weights.h5'),
    tf.keras.callbacks.LearningRateScheduler(scheduler),
    ProduceExample(val_data)
]

# Train the model
history = model.fit(
    train_data,
    epochs=config['epochs'],
    validation_data=val_data,
    callbacks=callbacks
)

🌐 Running the Streamlit App Locally

# Install Streamlit (if not already installed)
pip install streamlit

# Run the app
streamlit run app.py

# Access at http://localhost:8501

📈 Model Performance

🎯 Key Metrics

Architecture: 3D CNN + Bidirectional LSTM
Training Epochs: 100 (40th epoch checkpoint available)
CTC Loss: Optimized for sequence alignment
Vocabulary Size: 41 characters (a-z, 0-9, special characters)
Input Shape: 75 frames × 46×140 pixels

🧪 Example Predictions

Original Text	Model Prediction	Accuracy
"bin blue at l six now"	"bin blue at l six now"	✅ Perfect
"place green with t zero please"	"place green with t zero please"	✅ Perfect
"set red by v four now"	"set red by v four now"	✅ Perfect

🛠️ Development & Customization

📋 Project Structure

MouthMap/
├── models/                 # Model architectures and weights
├── data/                   # Dataset and preprocessing
├── utils/                  # Utility functions
├── app.py                  # Streamlit web application
├── train.py               # Training script
├── inference.py           # Inference utilities
├── requirements.txt       # Dependencies
└── README.md             # This file

🔧 Key Functions

load_video(): Video preprocessing and frame extraction
load_alignments(): Text alignment processing
build_model(): Model architecture definition
CTCLoss(): Custom CTC loss implementation
char_to_num / num_to_char: Character encoding/decoding

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

🎨 Ways to Contribute

🐛 Bug Reports: Found an issue? Create an issue
💡 Feature Requests: Have ideas for improvements?
🔧 Code Contributions: Submit PRs for bug fixes and features
📖 Documentation: Help improve our documentation
🧪 Testing: Test the system with different videos and datasets

🔄 Development Workflow

# 1. Fork the repository
# 2. Create a feature branch
git checkout -b feature/amazing-feature

# 3. Make your changes and test
python -m pytest tests/

# 4. Commit your changes
git commit -m "Add amazing feature"

# 5. Push and create a Pull Request
git push origin feature/amazing-feature

📈 Roadmap & Future Enhancements

timeline
    title MouthMap Development Roadmap
    
    ✅ Phase 1 : Core Model Development
             : 3D CNN + LSTM Architecture
             : CTC Loss Implementation
    
    ✅ Phase 2 : Web Application
             : Streamlit App Deployment
             : Model Export to Kaggle
    
    🔄 Phase 3 : Enhanced Accuracy
             : Attention Mechanisms
             : Data Augmentation
    
    📅 Phase 4 : Production Features
             : Real-time Processing
             : Mobile App Development
             : Multi-language Support

🎯 Upcoming Features

Attention Mechanisms for improved sequence modeling
Data Augmentation for better generalization
Real-time Video Processing with WebRTC
Mobile Application (iOS/Android)
Multi-language Support (Spanish, French, German)
Cloud API for scalable deployment
Edge Device Optimization (Raspberry Pi, Jetson)

🙏 Acknowledgments

Built with ❤️ using cutting-edge technologies

Special thanks to the open-source community and researchers advancing lip reading technology.

📜 License & Citation

📋 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use MouthMap in your research, please cite:

@software{mouthmap2024,
  title={MouthMap: Deep Learning Lip Reading System},
  author={Kavin Raja D},
  year={2024},
  url={https://github.com/d-kavinraja/MouthMap},
  note={Streamlit App: https://mouthmap.streamlit.app/}
}

🎯 Support & Community

Platform	Link	Purpose
🌐 Live Demo	mouthmap.streamlit.app	Try the application
🐛 Issues	GitHub Issues	Bug reports & feature requests
💬 Discussions	GitHub Discussions	Community Q&A
📥 Model	Kaggle Models	Pre-trained weights

Questions or feedback? Feel free to reach out through GitHub issues or discussions!

⭐ Star this repository if you find it helpful!

Made with ❤️ by Team MouthMap

🚀 Try MouthMap Now | 📥 Download Model

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github		.github
Approach-1		Approach-1
Approach-2		Approach-2
Approach-3		Approach-3
Current--Approach		Current--Approach
Img-src		Img-src
LICENSE		LICENSE
Readme.md		Readme.md
SECURITY.md		SECURITY.md

Uh oh!

License

d-kavinraja/MouthMap

Folders and files

Latest commit

History

Repository files navigation

MouthMap: Deep Learning Lip Reading System

🚀 State-of-the-art lip reading technology powered by deep learning

🌟 What is MouthMap?

✨ Key Features

🎯 Applications & Use Cases

🦻 Accessibility

🤖 Human-Computer Interaction

🔒 Security & Surveillance

🎮 Live Demo & Pre-trained Model

🌐 Try the Web Application

📥 Download Pre-trained Model

🏗️ Model Architecture

🔬 Architecture Details

📊 Dataset Information

📥 Download Dataset

📁 Dataset Structure

🚀 Quick Start

Prerequisites

📦 Installation

🎯 Quick Inference Example

🔧 Advanced Usage

🎯 Training from Scratch

🌐 Running the Streamlit App Locally

📈 Model Performance

🎯 Key Metrics

🧪 Example Predictions

🛠️ Development & Customization

📋 Project Structure

🔧 Key Functions

🤝 Contributing

🎨 Ways to Contribute

🔄 Development Workflow

📈 Roadmap & Future Enhancements

🎯 Upcoming Features

🙏 Acknowledgments

📜 License & Citation

📋 License

📚 Citation

🎯 Support & Community

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Packages 0

Contributors 4

Uh oh!

Languages

Packages