Speech To Text (STT) - Deep Learning Implementation

Below README generated by AI.

Speech To Text (STT) - Deep Learning Implementation

A deep learning-based Speech-to-Text system implemented using PyTorch, featuring a custom Residual CNN and Bidirectional GRU architecture. This implementation is based on the research paper: "A customized residual neural network and bi-directional gated recurrent unit-based automatic speech recognition model" published in ScienceDirect.

🎯 Overview

This project converts speech audio into text using a sophisticated neural network architecture that combines:

Residual CNNs for feature extraction from mel-spectrograms
Bidirectional GRUs for sequence modeling
CTC Loss for training without alignment requirements

🏗️ Architecture

The model follows this pipeline:

Speech → Mel-spectrogram → CNN (1st layer) → n_cnn layers of Residual CNNs → 
Shape transformation → n_rnn_layers of BiGRUs → MLP → Text output

Key Components

Residual CNN Layers: Extract hierarchical features from mel-spectrograms
Bidirectional GRU Layers: Process temporal sequences in both directions
CTC Loss: Enables training without requiring exact alignment between input and output sequences
Text Preprocessing: Handles character-to-integer mapping and padding

📊 Dataset

Training: LibriSpeech train-clean-100 dataset
Testing: LibriSpeech test-clean dataset
Audio Format: 16kHz sample rate, single channel
Text: Lowercase English text with space normalization

🚀 Features

Mel-spectrogram preprocessing with frequency and time masking for data augmentation
Custom text preprocessing with character-level tokenization
Comprehensive evaluation metrics including Word Error Rate (WER) and Character Error Rate (CER)
Model checkpointing for saving and loading trained models

📋 Requirements

torch
torchaudio
numpy
matplotlib
tqdm
Levenshtein
kagglehub

🔧 Installation

Clone the repository:

git clone <repository-url>
cd STT

Install dependencies:

pip install torch torchaudio numpy matplotlib tqdm Levenshtein kagglehub

Download the LibriSpeech dataset and place it in the appropriate directory structure.

🎮 Usage

Training

The model can be trained using the provided Jupyter notebook. Key parameters include:

pipeline_params = {
    'batch_size': 10,
    'epochs': 1,
    'learning_rate': 5e-4,
    'n_cnn_layers': 3, 
    'n_rnn_layers': 5,
    'rnn_dim': 512,
    'n_class': 29,
    'n_feats': 128,
    'stride': 2,
    'dropout': 0.1
}

Inference

# Load trained model
model = SpeechRecognitionModel(...)
model.load_state_dict(torch.load('speech_recognition_model.pt'))

# Convert speech to text
predicted_text = speech_to_text(audio_path, model, device, text_transform, valid_audio_transforms)
print(f"Predicted Text: {predicted_text}")

Model Architecture

Input: Mel-spectrograms of shape (batch, 1, 128, time)
CNN Layers: 1 initial + 3 residual CNN layers
RNN Layers: 5 bidirectional GRU layers
Output: Character probabilities over time

📁 File Structure

STT/
├── Simple Speech To Text.ipynb    # Main implementation notebook
└── README.md                      # This file

📄 License

This project is based on academic research. Please cite the original paper if you use this implementation:

"A customized residual neural network and bi-directional gated recurrent unit-based automatic speech recognition model" - ScienceDirect

🔗 References

⚠️ Notes

This implementation was originally developed on Kaggle
The model requires significant computational resources for training
Results may vary depending on hardware and dataset preprocessing
For production use, consider fine-tuning on domain-specific data

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
STT-transformer.ipynb		STT-transformer.ipynb
Simple Speech To Text.ipynb		Simple Speech To Text.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech To Text (STT) - Deep Learning Implementation

🎯 Overview

🏗️ Architecture

Key Components

📊 Dataset

🚀 Features

📋 Requirements

🔧 Installation

🎮 Usage

Training

Inference

Model Architecture

📁 File Structure

📄 License

🔗 References

⚠️ Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speech To Text (STT) - Deep Learning Implementation

🎯 Overview

🏗️ Architecture

Key Components

📊 Dataset

🚀 Features

📋 Requirements

🔧 Installation

🎮 Usage

Training

Inference

Model Architecture

📁 File Structure

📄 License

🔗 References

⚠️ Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages