Effortlessly transform live audio into text with our fine-tuned Whisper and Wav2Vec2 models. Built for real-time speech recognition, this project brings low-latency transcription to life—perfect for conversations, streaming, or any dynamic audio environment.
- Capture live 16kHz mono audio with ease.
- Fine-tune models in real time for unmatched accuracy.
- Switch seamlessly between Whisper and Wav2Vec2.
- Save and load models effortlessly with timestamped precision.
- English transcription, optimized for clarity.
Clone the repository and set up in minutes:
git clone https://github.com/bniladridas/speech-model.git
cd speech-model
pip install -r requirements.txt
For Linux, install audio dependencies:
sudo apt-get install libsndfile1
Fine-Tune Models
Train on live audio with a single command:
python main.py --model whisper
# or
python main.py --model wav2vec2
Press Ctrl+C to save your fine-tuned model automatically.
Transcribe Audio
Test your model with real-time transcription:
python test_transcription.py --model whisper
# or
python test_transcription.py --model wav2vec2
Records 5 seconds of audio (customizable) and delivers instant text.
Save Your Work
Models are saved to:
models/speech_recognition_ai_fine_tune_[model]_[timestamp]
Customize the path:
export MODEL_SAVE_PATH="/your/path"
python main.py --model [whisper|wav2vec2]
- Python 3.8+
- PyTorch (2.0.1 recommended)
- Transformers (4.35.0 recommended)
- Sounddevice (0.4.6)
- Torchaudio (2.0.1)
A GPU accelerates fine-tuning. Full requirements inrequirements.txt
.
- Task: Automatic Speech Recognition (ASR)
- Models:
- Whisper (
openai/whisper-small
) - Wav2Vec2 (
facebook/wav2vec2-base-960h
)
- Whisper (
- Fine-Tuning: Optimized on live 16kHz mono audio with Adam optimizer (learning rate 1e-5).
- Input: 16kHz mono audio
- Output: Precise English transcription
Load models from Hugging Face:
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model = WhisperForConditionalGeneration.from_pretrained("bniladridas/speech-recognition-ai-fine-tune")
processor = WhisperProcessor.from_pretrained("bniladridas/speech-recognition-ai-fine-tune")
dataset.py
: Audio recording and preprocessingtrain.py
: Training pipelinetest_transcription.py
: Real-time transcriptionmain.py
: Core fine-tuning scriptrequirements.txt
: DependenciesREADME.md
: This guide
Git Push Issues
For non-fast-forward errors:
git pull origin main --rebase
git push origin main
Stuck Processes
Find and stop processes:
ps aux | grep python
kill -9 <PID>
Large Files
Exclude models from Git:
echo "models/" >> .gitignore
git add .gitignore
git rm -r --cached models/
git commit -m "Update .gitignore"
No pre-existing dataset needed. Your live audio recordings fuel the fine-tuning process, making every model uniquely yours.
Licensed under the MIT License. Contribute, refine, and share on GitHub.