Word-Level Lip Reading Using Visual Information

A comparative study of 5 deep learning architectures for Visual Speech Recognition (VSR), developed for the DSE312: Computer Vision course at IISER Bhopal. This project explores solutions to the "Viseme Ambiguity" problem using Spatiotemporal convolutions, GRUs, and context-aware aggregation (which we might reference as attention in the future).

Abstract & Problem Statement

Standard speech recognition fails in noisy environments or when audio is unavailable. Lip reading offers a solution but faces two core challenges:

Viseme Ambiguity: Many phonemes look identical (e.g., /p/, /b/, /m/), making words like "pat", "bat", and "mat" hard to distinguish.
Spatiotemporal Complexity: The model must capture rapid, subtle changes in lip shape over time.

Model Evolution & Results

We implemented and benchmarked 5 distinct architectures to evaluate the impact of 3D Convolutions, Color (RGB), and Attention mechanisms.

Model	Architecture	Input	Accuracy	Key Takeaway
Model 1	2D CNN	Gray	FAILED	Model was too simple; did not converge.
Model 2	3D CNN + GRU	Gray	62.52%	Success. Spatiotemporal (3D) + Sequential (GRU) works.
Model 3	3D CNN + GRU	RGB	67.97%	Adding color provided more information and improved accuracy.
Model 4	3D CNN + GRU + Att	RGB	88.14%	Best Model. Focusing on specific frames (Attention) yields the best results.
Model 5	3D CNN + GRU + Att	Gray (200)	75.29%	Architecture scales well to larger vocabularies.

Performance Note: Our best model achieved an ROC AUC of 0.99 (Macro-average), with perfect classification (AUC 1.00) for several classes.

Methodology

1. Preprocessing

Face Detection: Utilized dlib face landmarks tool.
Cropping: Extracted the specific mouth region from video frames to remove background noise.

2. Architectures Implemented

3D CNNs: Used to extract spatiotemporal features from the video volume, capturing motion dynamics better than 2D CNNs.
GRU (Gated Recurrent Units): Processed the sequence of features extracted by the CNNs to model time dependencies.
Attention: Applied a linear softmax layer to assign weights to hidden GRU states, allowing the model to focus on the most relevant frames for classification.

3. Datasets

The models were trained and evaluated using data subsets from:

MIRACL-VC1: 15 speakers, 10 words, 3000 instances.
Lip Reading in the Wild (LRW): 500 words, ~500k instances.
Lip Reading Sentences (LRS): Pre-training performed on large sentence corpuses.

Project Structure

Lip-Reading-Project/
├── dataloaders/       # Custom dataset classes for MIRACL/LRW
├── models/            # Implementations of 3D-CNN, GRU, and Attention
├── plots/             # ROC curves and Loss graphs
├── utils/             # Helper functions (dlib preprocessing, etc.)
├── train.py           # Main training script
└── test.py            # Evaluation script

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word-Level Lip Reading Using Visual Information

Abstract & Problem Statement

Model Evolution & Results

Methodology

1. Preprocessing

2. Architectures Implemented

3. Datasets

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataloader		dataloader
finetune		finetune
models		models
plots		plots
utils		utils
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Word-Level Lip Reading Using Visual Information

Abstract & Problem Statement

Model Evolution & Results

Methodology

1. Preprocessing

2. Architectures Implemented

3. Datasets

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages