Skip to content

asghar-rizvi/-Urdu-Text-to-Speech-with-Voice-Cloning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Urdu Text-to-Speech with Voice Cloning using SpeechT5

Output Image

A fine-tuned SpeechT5 model for high-quality Urdu text-to-speech generation with voice cloning capabilities. This model supports both Urdu and Roman Urdu scripts and allows speaker selection for personalized speech synthesis.

Features

  • 🗣️ Urdu TTS: High-quality text-to-speech synthesis for Urdu language
  • 🔊 Voice Cloning: Generate speech in the style of specific speakers (including Zia Mohiuddin's voice)
  • 🌐 Dual Script Support: Works with both Urdu (نثر) and Roman Urdu (Urdu written in Latin script)
  • 🎛️ Speaker Selection: Choose between different voice profiles
  • 🚀 FastAPI Demo: Interactive web interface for testing the model

Model Details

This implementation is based on SpeechT5, a state-of-the-art model for speech synthesis tasks. Key modifications include:

  • Tokenization: Character-level tokenization specifically adapted for Urdu script
  • Preprocessing: Updated tokenizer and processor to handle Urdu phonetics and pronunciation
  • Architecture: Fine-tuned SpeechT5 architecture with multilingual capabilities

Dataset

The model was trained on a merged dataset comprising:

  1. xcollab tts 15k dataset: A comprehensive Urdu speech dataset with 15,000+ recordings
  2. Zia Mohiuddin Dataset: 350 high-quality recordings of the renowned Pakistani broadcaster

This combination enables both general Urdu TTS and voice cloning capabilities for specific speakers.

Training

  • Epochs: 50 (significant improvement observed after 40 epochs)
  • Batch Size: 6-8 (smaller batch sizes yielded better results)
  • Hardware: GPU-accelerated training
  • Performance: Mid-level quality with potential for improvement through:
    • Longer training (100+ epochs recommended)
    • Larger model variants
    • Additional high-quality data

Demo

Try the interactive demo at [your-demo-link-here] or run locally:

  1. Clone the repository
  2. Install dependencies: pip install -r requirements.txt
  3. Run the server: python app.py
  4. Open your browser to http://localhost:8000

Demo Features:

  • Text input in Urdu or Roman Urdu
  • Speaker selection dropdown
  • Real-time audio generation
  • Responsive web interface

Installation

git clone https://github.com/your-username/urdu-tts-voice-cloning.git
cd urdu-tts-voice-cloning
pip install -r requirements.txt

Usage

Python API

from model import UrduTTS

tts = UrduTTS()
audio = tts.generate_text_to_speech(
    text="یہ ایک مثال ہے",  # Urdu text
    speaker="zia_mohiuddin"  # Optional speaker selection
)
audio.save("output.wav")

Roman Urdu Support

audio = tts.generate_text_to_speech(
    text="Ye aik misaal hai",  # Roman Urdu
    speaker="default"
)

Performance Notes

  • Current model achieves mid-level quality with natural-sounding output
  • Best results obtained with:
    • 40+ training epochs
    • Batch sizes of 6-8
    • Adequate GPU memory (recommended: 16GB+)
  • Voice cloning works best with clear reference recordings

Future Improvements

  • Increase training epochs to 100+
  • Experiment with larger SpeechT5 variants
  • Expand dataset with more diverse speakers
  • Implement Roman Urdu normalization
  • Add prosody control features
  • Optimize for real-time applications

Tech Stack

  • Model: SpeechT5 (fine-tuned)
  • Backend: FastAPI
  • Frontend: HTML/CSS/JavaScript
  • Audio Processing: Librosa, SoundFile
  • ML Framework: PyTorch, Transformers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

  • Original SpeechT5 model by Microsoft Research
  • xcollab tts dataset contributors
  • Zia Mohiuddin dataset providers

About

A fine-tuned SpeechT5 Urdu TTS model with voice cloning that converts both Urdu and Roman Urdu text into natural speech. Trained on diverse Urdu and Zia Mohiuddin recordings, it offers expressive, speaker-specific synthesis with a FastAPI demo for easy testing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors