Skip to content

dhruva4869/Speech-To-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Below README generated by AI.

Speech To Text (STT) - Deep Learning Implementation

A deep learning-based Speech-to-Text system implemented using PyTorch, featuring a custom Residual CNN and Bidirectional GRU architecture. This implementation is based on the research paper: "A customized residual neural network and bi-directional gated recurrent unit-based automatic speech recognition model" published in ScienceDirect.

🎯 Overview

This project converts speech audio into text using a sophisticated neural network architecture that combines:

  • Residual CNNs for feature extraction from mel-spectrograms
  • Bidirectional GRUs for sequence modeling
  • CTC Loss for training without alignment requirements

🏗️ Architecture

The model follows this pipeline:

Speech → Mel-spectrogram → CNN (1st layer) → n_cnn layers of Residual CNNs → 
Shape transformation → n_rnn_layers of BiGRUs → MLP → Text output

Key Components

  1. Residual CNN Layers: Extract hierarchical features from mel-spectrograms
  2. Bidirectional GRU Layers: Process temporal sequences in both directions
  3. CTC Loss: Enables training without requiring exact alignment between input and output sequences
  4. Text Preprocessing: Handles character-to-integer mapping and padding

📊 Dataset

  • Training: LibriSpeech train-clean-100 dataset
  • Testing: LibriSpeech test-clean dataset
  • Audio Format: 16kHz sample rate, single channel
  • Text: Lowercase English text with space normalization

🚀 Features

  • Mel-spectrogram preprocessing with frequency and time masking for data augmentation
  • Custom text preprocessing with character-level tokenization
  • Comprehensive evaluation metrics including Word Error Rate (WER) and Character Error Rate (CER)
  • Model checkpointing for saving and loading trained models

📋 Requirements

torch
torchaudio
numpy
matplotlib
tqdm
Levenshtein
kagglehub

🔧 Installation

  1. Clone the repository:
git clone <repository-url>
cd STT
  1. Install dependencies:
pip install torch torchaudio numpy matplotlib tqdm Levenshtein kagglehub
  1. Download the LibriSpeech dataset and place it in the appropriate directory structure.

🎮 Usage

Training

The model can be trained using the provided Jupyter notebook. Key parameters include:

pipeline_params = {
    'batch_size': 10,
    'epochs': 1,
    'learning_rate': 5e-4,
    'n_cnn_layers': 3, 
    'n_rnn_layers': 5,
    'rnn_dim': 512,
    'n_class': 29,
    'n_feats': 128,
    'stride': 2,
    'dropout': 0.1
}

Inference

# Load trained model
model = SpeechRecognitionModel(...)
model.load_state_dict(torch.load('speech_recognition_model.pt'))

# Convert speech to text
predicted_text = speech_to_text(audio_path, model, device, text_transform, valid_audio_transforms)
print(f"Predicted Text: {predicted_text}")

Model Architecture

  • Input: Mel-spectrograms of shape (batch, 1, 128, time)
  • CNN Layers: 1 initial + 3 residual CNN layers
  • RNN Layers: 5 bidirectional GRU layers
  • Output: Character probabilities over time

📁 File Structure

STT/
├── Simple Speech To Text.ipynb    # Main implementation notebook
└── README.md                      # This file

📄 License

This project is based on academic research. Please cite the original paper if you use this implementation:

"A customized residual neural network and bi-directional gated recurrent unit-based automatic speech recognition model" - ScienceDirect

🔗 References

⚠️ Notes

  • This implementation was originally developed on Kaggle
  • The model requires significant computational resources for training
  • Results may vary depending on hardware and dataset preprocessing
  • For production use, consider fine-tuning on domain-specific data

About

ResidualCNN + BiGRU + MLP implementation of Speech To Text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors