A fine-tuned SpeechT5 model for high-quality Urdu text-to-speech generation with voice cloning capabilities. This model supports both Urdu and Roman Urdu scripts and allows speaker selection for personalized speech synthesis.
- 🗣️ Urdu TTS: High-quality text-to-speech synthesis for Urdu language
- 🔊 Voice Cloning: Generate speech in the style of specific speakers (including Zia Mohiuddin's voice)
- 🌐 Dual Script Support: Works with both Urdu (نثر) and Roman Urdu (Urdu written in Latin script)
- 🎛️ Speaker Selection: Choose between different voice profiles
- 🚀 FastAPI Demo: Interactive web interface for testing the model
This implementation is based on SpeechT5, a state-of-the-art model for speech synthesis tasks. Key modifications include:
- Tokenization: Character-level tokenization specifically adapted for Urdu script
- Preprocessing: Updated tokenizer and processor to handle Urdu phonetics and pronunciation
- Architecture: Fine-tuned SpeechT5 architecture with multilingual capabilities
The model was trained on a merged dataset comprising:
- xcollab tts 15k dataset: A comprehensive Urdu speech dataset with 15,000+ recordings
- Zia Mohiuddin Dataset: 350 high-quality recordings of the renowned Pakistani broadcaster
This combination enables both general Urdu TTS and voice cloning capabilities for specific speakers.
- Epochs: 50 (significant improvement observed after 40 epochs)
- Batch Size: 6-8 (smaller batch sizes yielded better results)
- Hardware: GPU-accelerated training
- Performance: Mid-level quality with potential for improvement through:
- Longer training (100+ epochs recommended)
- Larger model variants
- Additional high-quality data
Try the interactive demo at [your-demo-link-here] or run locally:
- Clone the repository
- Install dependencies:
pip install -r requirements.txt - Run the server:
python app.py - Open your browser to
http://localhost:8000
- Text input in Urdu or Roman Urdu
- Speaker selection dropdown
- Real-time audio generation
- Responsive web interface
git clone https://github.com/your-username/urdu-tts-voice-cloning.git
cd urdu-tts-voice-cloning
pip install -r requirements.txtfrom model import UrduTTS
tts = UrduTTS()
audio = tts.generate_text_to_speech(
text="یہ ایک مثال ہے", # Urdu text
speaker="zia_mohiuddin" # Optional speaker selection
)
audio.save("output.wav")audio = tts.generate_text_to_speech(
text="Ye aik misaal hai", # Roman Urdu
speaker="default"
)- Current model achieves mid-level quality with natural-sounding output
- Best results obtained with:
- 40+ training epochs
- Batch sizes of 6-8
- Adequate GPU memory (recommended: 16GB+)
- Voice cloning works best with clear reference recordings
- Increase training epochs to 100+
- Experiment with larger SpeechT5 variants
- Expand dataset with more diverse speakers
- Implement Roman Urdu normalization
- Add prosody control features
- Optimize for real-time applications
- Model: SpeechT5 (fine-tuned)
- Backend: FastAPI
- Frontend: HTML/CSS/JavaScript
- Audio Processing: Librosa, SoundFile
- ML Framework: PyTorch, Transformers
Contributions are welcome! Please feel free to submit a Pull Request.
- Original SpeechT5 model by Microsoft Research
- xcollab tts dataset contributors
- Zia Mohiuddin dataset providers
