A Python-based implementation for training Tacotron 2 Text-to-Speech (TTS) models. This repository provides a comprehensive framework for preparing your dataset, configuring training parameters, and running the training process with multi-language support.
- Features
- Requirements
- Setup
- Dataset Preparation
- Configuration (
config.yaml) - Usage
- Key Configuration Parameters (
⚠️ Needs Attention!) - Common Issues and Solutions
- Text Cleaners and Language Support
- Output Files
- Training Workflow
- Advanced Features
- License
- Citations
- Complete Tacotron 2 Training Pipeline: Implements the core Tacotron 2 architecture for TTS.
- Multi-Language Support: Works with English, Turkish, Spanish, Arabic, and many other languages through customizable text cleaners and symbol sets.
- Dataset Preprocessing: Includes utilities to:
- Generate Mel spectrograms from
.wavfiles. - Create file lists (
list.txt) from metadata (metadata.csv). - Check dataset integrity and file existence.
- Generate Mel spectrograms from
- Flexible Configuration: Uses a
config.yamlfile for easy management of paths, hyperparameters, and training settings. - Advanced Training Features:
- Checkpointing: Saves model checkpoints periodically and allows resuming training.
- Warm Start: Ability to initialize training from a pre-trained model's weights.
- Validation: Performs validation runs at specified intervals.
- Mixed-Precision Training (FP16): Option for faster training and reduced memory usage.
- Distributed Training: Supports multi-GPU training using
torch.distributed.
- Comprehensive Logging: Monitor training through console, log files (
training.log), and TensorBoard.
- Python 3.7+
- PyTorch (>= 1.7 recommended, check CUDA compatibility if using GPU)
- CUDA Toolkit & cuDNN (for GPU acceleration)
- Additional Python packages:
pip install torch torchvision torchaudio tensorboard matplotlib tensorflow numpy PyYAML tqdm num2words librosa inflect scipy(Adjust the PyTorch installation command based on your system and CUDA version - see PyTorch Official Website)
-
Install Git LFS (for downloading pretrained models):
git lfs install
-
Set up virtual environment (recommended):
python -m venv venv venv/Scripts/activate # Windows source venv/bin/activate # Linux
-
Clone the repository:
git clone https://github.com/gokhaneraslan/tacotron2-tts-training.git cd tacotron2-tts-training -
Install dependencies:
pip install -r requirements.txt
If you're unsure how to prepare your dataset,
please check the tts-dataset-generator repository,
which provides tools and instructions for creating properly formatted TTS datasets.
The training script expects your dataset in a specific format:
/base_project_path/
└── dataset_name/
├── metadata.csv
├── list.txt (generated automatically)
└── wavs/
├── segment_1.wav
├── segment_2.wav
├── segment_1.npy (generated if configure)
├── segment_2.npy (generated if configure)
└── ...
Create a metadata file (e.g., metadata.csv) with three pipe-separated columns:
<wav_filename_without_extension>|<original_text>|<normalized_transcription>
Example lines:
segment_1|This is the original text.|this is the normalized text
segment_2|Another example.|another example
IMPORTANT: The script primarily uses the third column (
normalized_transcription) as the text input for the model. Ensure this text is clean and suitable for your target language.
- Place all
.wavaudio files inside a directory namedwavswithin your main dataset directory. - Critical: Ensure all
.wavfiles have the same sampling rate as specified inconfig.yaml(sampling_rate).
If you don't know your audio's sample rate:
import torchaudio
audio_path = "your_audio_path" # Check one of your audio files
print(torchaudio.info(audio_path))Then set:
sampling_rate= sample rate from abovemel_fmax= sample_rate / 2
The training script will automatically generate or update a file named list.txt inside your dataset directory. This file links audio/mel files to their transcriptions for the data loader.
- If you set
generate_mels: Trueinconfig.yaml, the script will process all.wavfiles and save corresponding Mel spectrograms as.npyfiles. - If
load_mel_from_disk: True, the script will use these pre-generated.npyfiles, which significantly speeds up data loading.
Create a configuration file at config/config.yaml. Below is a comprehensive example with key sections:
# --- Paths (⚠️ CRITICAL - SET THESE CAREFULLY ⚠️) ---
base_project_path: /path/to/your/project/tacotron2-tts-training # Root directory
dataset_name: MyTTSDataset # Name of the dataset directory
output_base_path: output # Base directory for outputs
output_dir_name: checkpoints # Subdirectory for model checkpoints
log_dir_name: logs # Subdirectory for logs
model_filename: tacotron_model # Base name for checkpoint files
metadata_filename: metadata.csv # Name of your metadata file
filelist_name: list.txt # Name for the generated filelist
default_pretrained_path: /path/to/your/project/tacotron2-tts-training/pretrained_model/tacotron2_statedict.pt # Optional path for warm start# --- Training Control ---
epochs: 250 # Total training epochs
warm_start: False # True: Load only model weights, False: Load full checkpoint
save_interval: 10 # Save main checkpoint every N epochs
backup_interval: 25 # Save backup checkpoint every N epochs
validation_interval: 5 # Run validation every N epochs
log_interval: 100 # Log training progress every N iterations# --- Data Loading & Preprocessing (⚠️ IMPORTANT ⚠️) ---
generate_mels: True # Generate Mel spectrograms before training
load_mel_from_disk: True # Load pre-generated mels during training# --- Language Selection (⚠️ CHOOSE ONE ⚠️) ---
language: turkish
#language: english
#language: spanish
#language: french
#language: german
#language: italian
#language: portuguese
# --- Text Cleaners Selection (⚠️ CHOOSE ONE ⚠️) ---
text_cleaners: turkish_cleaners # Text cleaner(s)
#text_cleaners: english_cleaners # Text cleaner(s)
#text_cleaners: spanish_cleaners # Text cleaner(s)
#text_cleaners: french_cleaners # Text cleaner(s)
#text_cleaners: german_cleaners # Text cleaner(s)
#text_cleaners: italian_cleaners # Text cleaner(s)
#text_cleaners: portuguese_cleaners # Text cleaner(s)
# --- Hardware & Performance ---
n_gpus: 1 # Number of GPUs for training
rank: 0 # Process rank for distributed training
fp16_run: True # Enable mixed-precision training
cudnn_enabled: True # Enable cuDNN backend
cudnn_benchmark: False # Enable cuDNN benchmark mode
num_workers: 4 # CPU workers for data loading
batch_size: 4 # Adjust based on GPU memory# --- Audio Parameters (⚠️ CRITICAL - MATCH YOUR DATASET ⚠️) ---
sampling_rate: 22050 # The exact sampling rate of ALL your .wav files
mel_fmax: 11025.0 # Max frequency for Mel spectrograms (usually sampling_rate / 2)# --- Optimizer & Learning Rate ---
lr_schedule_A: 1e-4 # Initial learning rate
lr_schedule_B: 8000 # Decay rate factor
lr_schedule_C: 0 # LR schedule offset
lr_schedule_decay_start: 10000 # Iteration at which LR decay begins
min_learning_rate: 1e-5 # Minimum learning rate clamp
p_attention_dropout: 0.1 # Dropout rate for attention
p_decoder_dropout: 0.1 # Dropout rate for decoder# --- Logging & Visualization ---
show_alignments: True # Log alignment plots to TensorBoard
alignment_graph_height: 600 # Height of alignment plot
alignment_graph_width: 1000 # Width of alignment plot- Configure
config/config.yamlwith your paths and settings - Prepare your dataset as described above
- Start training:
python train.py
- Monitor progress:
tensorboard --logdir /path/to/your/project/output/logs
| Purpose | Command | Notes |
|---|---|---|
| Standard Training | python train.py |
Uses settings from config.yaml |
| Monitor Training | tensorboard --logdir output/logs |
Open URL in browser (e.g., http://localhost:6006/) |
| Check Audio Sample Rate | python -c "import torchaudio; print(torchaudio.info('path/to/audio.wav'))" |
Verify sampling rate matches config.yaml |
| Parameter | Description | Recommendation |
|---|---|---|
base_project_path, dataset_name |
Root paths for your project | Verify these point to correct directories |
sampling_rate |
Audio sample rate | MUST match all your .wav files exactly |
text_cleaners |
Language cleaner | Must match your dataset's language |
batch_size |
Batch size for training | Start small (4-8) and increase if memory allows |
generate_mels |
Generate spectrograms | True for first run, False for subsequent runs |
load_mel_from_disk |
Use pre-generated mels | Keep True after first run for faster training |
warm_start |
Loading behavior | True to load model weights only, False to resume from checkpoint |
| Issue | Possible Causes | Solutions |
|---|---|---|
FileNotFoundError |
Incorrect paths in config | Double-check all paths in config.yaml and dataset structure |
CUDA Out of Memory |
Batch size too large | Reduce batch_size in config.yaml |
| Mismatched tensor shapes | Sample rate mismatch or wrong cleaner | Verify sampling_rate matches WAVs and text_cleaners matches language |
NaN Loss values |
Learning rate too high or data issues | Reduce lr_schedule_A or check for silent/corrupt audio files |
| Slow training | Not using pre-generated mels | Set generate_mels: True once, then load_mel_from_disk: True |
| Poor audio quality | Wrong cleaner or insufficient training | Verify correct text_cleaners for your language and train longer |
| Distributed training issues | Environment setup | Configure environment variables properly and check NCCL installation |
The text_cleaners parameter selects which set of text normalization rules and character sets to use:
| Language | Cleaner Name | Features |
|---|---|---|
| Turkish | turkish_cleaners |
Handles Turkish characters (İ, ı, etc.) |
| English | english_cleaners |
Standard English processing |
| Spanish | spanish_cleaners |
Spanish-specific character handling |
| French | french_cleaners |
French accent and character support |
| German | german_cleaners |
German-specific characters (ä, ö, ü, ß) |
| Italian | italian_cleaners |
Italian accent handling |
| Portuguese | portuguese_cleaners |
Portuguese-specific processing |
| Russian | russian_cleaners |
Cyrillic character support |
| Arabic | arabic_cleaners |
Arabic script handling |
Each cleaner typically performs:
- Lowercase conversion
- Whitespace normalization
- Language-specific character normalization
- Number-to-words expansion (using
num2words) - Punctuation handling
Training artifacts are saved in the directory specified by output_base_path:
| File | Path | Description |
|---|---|---|
| Main Checkpoints | output/checkpoints/tacotron_model.pt |
Latest checkpoint, saved every save_interval epochs |
| Backup Checkpoints | output/checkpoints/tacotron_model_epoch_<N>.pt |
Backups saved every backup_interval epochs |
| Final Model | output/checkpoints/tacotron_model_final.pt |
Model saved at end of training |
| Log File | output/logs/training.log |
Detailed training logs |
| TensorBoard | output/logs/events.out.tfevents.* |
Files for TensorBoard visualization |
The script performs these steps automatically based on your config.yaml:
- Load Configuration: Reads settings from
config/config.yaml - Create/Update Filelist: Generates/updates
list.txtusing yourmetadata.csv - Calculate Audio Duration: Scans
.wavfiles to report total duration - Generate Mel Spectrograms: Creates
.npyfiles for each.wavfile (ifgenerate_mels: True) - Update Filelist for Mels: Points to
.npyfiles (ifload_mel_from_disk: True) - Check Dataset: Verifies file existence listed in the final
list.txt - Initialize Training: Sets up model, optimizer, data loaders, logger
- Load Checkpoint/Warm Start: Loads existing checkpoint or pre-trained model if configured
- Start Training Loop: Begins training with regular validation and checkpoint saving
Set fp16_run: True to enable mixed precision training, which can significantly speed up training on modern GPUs with minimal accuracy loss.
To use multiple GPUs:
- Set
n_gpusto the number of available GPUs - Ensure NCCL is installed if using Nvidia GPUs
- Set appropriate environment variables or use tools like
torchrun
Use the warm_start parameter to control how previous checkpoints are loaded:
warm_start: True: Loads only model weights fromcheckpoint_pathordefault_pretrained_path. Useful for fine-tuning on a new dataset or changing optimizers.warm_start: False: Loads the entire state (model, optimizer, iteration count, learning rate). Resumes training exactly where it left off.
This implementation builds upon NVIDIA's Tacotron 2 implementation, which is licensed under the BSD 3-Clause License.
@article{shen2018natural,
title={Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions},
author={Shen, Jonathan and Pang, Ruoming and Weiss, Ron J and Schuster, Mike and Jaitly, Navdeep and Yang, Zongheng and Chen, Zhifeng and Zhang, Yu and Wang, Yuxuan and Skerrv-Ryan, Rj and others},
journal={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2018}
}
@inproceedings{
wang2017tacotron,
title={Tacotron: Towards End-to-End Speech Synthesis},
author={Yuxuan Wang and RJ Skerry-Ryan and Daisy Stanton and Yonghui Wu and Ron J. Weiss and Navdeep Jaitly and Zongheng Yang and Ying Xiao and Zhifeng Chen and Samy Bengio and Quoc Le and Christopher Dean and Mengdao Yang and George F. Raffel},
booktitle={Proceedings of Interspeech},
year={2017}
}
For more details about the Tacotron 2 architecture, refer to the original paper.
If you don't know how to prepare the dataset, check the tts-dataset-generator repository.