Below README generated by AI.
A deep learning-based Speech-to-Text system implemented using PyTorch, featuring a custom Residual CNN and Bidirectional GRU architecture. This implementation is based on the research paper: "A customized residual neural network and bi-directional gated recurrent unit-based automatic speech recognition model" published in ScienceDirect.
This project converts speech audio into text using a sophisticated neural network architecture that combines:
- Residual CNNs for feature extraction from mel-spectrograms
- Bidirectional GRUs for sequence modeling
- CTC Loss for training without alignment requirements
The model follows this pipeline:
Speech → Mel-spectrogram → CNN (1st layer) → n_cnn layers of Residual CNNs →
Shape transformation → n_rnn_layers of BiGRUs → MLP → Text output
- Residual CNN Layers: Extract hierarchical features from mel-spectrograms
- Bidirectional GRU Layers: Process temporal sequences in both directions
- CTC Loss: Enables training without requiring exact alignment between input and output sequences
- Text Preprocessing: Handles character-to-integer mapping and padding
- Training: LibriSpeech train-clean-100 dataset
- Testing: LibriSpeech test-clean dataset
- Audio Format: 16kHz sample rate, single channel
- Text: Lowercase English text with space normalization
- Mel-spectrogram preprocessing with frequency and time masking for data augmentation
- Custom text preprocessing with character-level tokenization
- Comprehensive evaluation metrics including Word Error Rate (WER) and Character Error Rate (CER)
- Model checkpointing for saving and loading trained models
torch
torchaudio
numpy
matplotlib
tqdm
Levenshtein
kagglehub- Clone the repository:
git clone <repository-url>
cd STT- Install dependencies:
pip install torch torchaudio numpy matplotlib tqdm Levenshtein kagglehub- Download the LibriSpeech dataset and place it in the appropriate directory structure.
The model can be trained using the provided Jupyter notebook. Key parameters include:
pipeline_params = {
'batch_size': 10,
'epochs': 1,
'learning_rate': 5e-4,
'n_cnn_layers': 3,
'n_rnn_layers': 5,
'rnn_dim': 512,
'n_class': 29,
'n_feats': 128,
'stride': 2,
'dropout': 0.1
}# Load trained model
model = SpeechRecognitionModel(...)
model.load_state_dict(torch.load('speech_recognition_model.pt'))
# Convert speech to text
predicted_text = speech_to_text(audio_path, model, device, text_transform, valid_audio_transforms)
print(f"Predicted Text: {predicted_text}")- Input: Mel-spectrograms of shape (batch, 1, 128, time)
- CNN Layers: 1 initial + 3 residual CNN layers
- RNN Layers: 5 bidirectional GRU layers
- Output: Character probabilities over time
STT/
├── Simple Speech To Text.ipynb # Main implementation notebook
└── README.md # This file
This project is based on academic research. Please cite the original paper if you use this implementation:
"A customized residual neural network and bi-directional gated recurrent unit-based automatic speech recognition model" - ScienceDirect
- This implementation was originally developed on Kaggle
- The model requires significant computational resources for training
- Results may vary depending on hardware and dataset preprocessing
- For production use, consider fine-tuning on domain-specific data