中文版 | English
A modular framework for evaluating Arabic Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) systems.
This framework provides:
- TTS Generation: Generate audio from text using various TTS models
- ASR Transcription: Transcribe audio using ASR models
- Evaluation: Calculate WER/CER metrics with Arabic text normalization
- Audio Quality Metrics: STOI, PESQ, Duration Error, and MCD
# Core dependencies
pip install torch transformers soundfile pandas jiwer tqdm python-dotenv
# Audio quality metrics (optional but recommended)
pip install pystoi pesq librosa scipy# Full TTS-ASR pipeline with audio quality metrics
python main.py --dataset clArTTS --tts-model mms-tts-ara --asr-model whisper-large-v3
# ASR-only evaluation
python main.py --dataset everyayah --asr-model whisper-large-v3
# Skip TTS (use existing audio)
python main.py --dataset clArTTS --tts-model mms-tts-ara --asr-model whisper-large-v3 --skip-tts
# Skip audio quality metrics (faster evaluation)
python main.py --dataset clArTTS --tts-model mms-tts-ara --asr-model whisper-large-v3 --skip-audio-metrics- mms-tts-ara: Meta MMS-TTS Arabic
- openaudio-s1-mini: OpenAudio S1-mini (Fish Speech)
- elevenlabs-multilingual-v2: ElevenLabs API
- minimax-speech-02-hd: MiniMax API
- whisper-large-v3: OpenAI Whisper Large V3
- qwen3-omni: Qwen3-Omni 30B
- conformer-ctc: NeMo Conformer-CTC
- clArTTS: Classical Arabic TTS dataset (205 samples)
- everyayah: Quran recitation dataset (~6,000 samples)
- arvoice: Arabic voice dataset
- Ruisheng_TTS: Ruisheng TTS dataset (68 samples)
Results are saved in results/{dataset}/{tts_model}_to_{asr_model}/:
- generated_audio/: Generated WAV files
- transcriptions.jsonl: ASR transcriptions
- evaluation_results.csv: Per-sample metrics (WER, CER, STOI, PESQ, DE, MCD)
- evaluation_summary.csv: Overall metrics with averages
- timing.json: Performance metrics
Text Metrics:
- WER (Word Error Rate): Word-level transcription accuracy
- CER (Character Error Rate): Character-level transcription accuracy
Audio Quality Metrics:
- STOI (Short-Time Objective Intelligibility): Speech intelligibility (0-1, higher is better)
- PESQ (Perceptual Evaluation of Speech Quality): Speech quality (-0.5 to 4.5, higher is better)
- DE (Duration Error): Relative duration difference (0 to inf, lower is better)
- MCD (Mel-Cepstral Distortion): Spectral distance (lower is better, <6.0 is good)
- Prepare dataset structure:
datasets/my_dataset/
├── metadata.csv
└── wav/
    ├── 00000.wav
    └── ...
- Create metadata.csv:
id,file,text
0,00000.wav,النص العربي هنا
1,00001.wav,نص آخر
- Register in src/benchmark/config/dataset_config.py:
"my_dataset": DatasetConfig(
    name="my_dataset",
    metadata_file="datasets/my_dataset/metadata.csv",
    audio_dir="datasets/my_dataset/wav",
    id_column="id",
    text_column="text",
    audio_column="file"
),- Create TTS module in src/benchmark/modules/tts/my_tts.py:
from .base_tts import BaseTTS
class MyTTS(BaseTTS):
    def load(self):
        # Load your model
        pass
    
    def synthesize(self, text: str, output_path: str) -> tuple[float, float]:
        # Generate audio and save to output_path
        # Return (generation_time, audio_duration)
        pass- Register in src/benchmark/modules/tts/__init__.py:
from .my_tts import MyTTS- Add config in src/benchmark/config/model_config.py:
"my-tts": TTSModelConfig(
    model_name="my-tts",
    model_type="my_tts",
    model_path="models/my-tts",
    device="cuda",
    sampling_rate=16000
),- Create ASR module in src/benchmark/modules/asr/my_asr.py:
from .base_asr import BaseASR
class MyASR(BaseASR):
    def load(self):
        # Load your model
        pass
    
    def transcribe(self, audio_path: str) -> tuple[str, float]:
        # Transcribe audio
        # Return (transcription, transcription_time)
        pass- Register in src/benchmark/modules/asr/__init__.py:
from .my_asr import MyASR- Add config in src/benchmark/config/model_config.py:
"my-asr": ASRModelConfig(
    model_name="my-asr",
    model_type="my_asr",
    model_path="models/my-asr",
    device="cuda",
    language="ar"
),