Arabic TTS models (FastPitch, MixerTTS) from the tts-arabic-pytorch repo in the ONNX format โ usable as a Python package for offline speech synthesis.
Audio samples can be found here.
Install with
pip install git+https://github.com/nipponjo/tts_arabic.git
Examples
# %%
from tts_arabic import tts
# %%
text = "ุงููุณูููุงู
ู ุนููููููู
ููุง ุตูุฏููููู."
wave = tts(text, speaker=2, pace=0.9, play=True)
# %% Buckwalter transliteration
text = ">als~alAmu Ealaykum yA Sadiyqiy."
wave = tts(text, speaker=0, play=True)
# %% Unvocalized input
text_unvoc = "ุงููููุฉ ู
ุดุฑูุจ ูุนุฏ ู
ู ุจุฐูุฑ ุงูุจู ุงูู
ุญู
ุตุฉ"
wave = tts(text_unvoc, play=True, vowelizer='shakkelha')Pretrained models
| Model | Model ID | Type | #params | Paper | Output |
|---|---|---|---|---|---|
| FastPitch | fastpitch | Text->Mel | 46.3M | arxiv | Mel (80 bins) |
| MixerTTS | mixer128 | Text->Mel | 2.9M | arxiv | Mel (80 bins) |
| MixerTTS | mixer80 | Text->Mel | 1.5M | arxiv | Mel (80 bins) |
| HiFi-GAN | hifigan | Vocoder | 13.9M | arxiv | Wave (22.05kHz) |
| Vocos | vocos | Vocoder | 13.4M | arxiv | Wave (22.05kHz) |
| Vocos | vocos44 | Vocoder | 14.0M | arxiv | Wave (44.1kHz) |
The sequence of transformations is as follows:
Text โ Phonemizer โ Phonemes โ Tokenizer โ Token Ids โ Text->Mel model โ Mel spectrogram โ Vocoder model โ Wave
The Text->Mel models map token ids to mel frames. All models use the 80 bin configuration proposed by HiFi-GAN. This mel spectrogram contains frequencies up to 8kHz. The vocoder models map the mel spectrogram to a waveform. The vocoders with vocoder_id hifigan and vocos artificially extend the bandwidth to 11025Hz, and vocos44 to 22050Hz. Samples for comparing the models can be found here.
Manuscript
More information about how the models were trained can be found in the manuscript Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis (arXiv | ResearchGate).
TTS options
from tts_arabic import tts
text = "ุงููุณูููุงู
ู ุนููููููู
ููุง ุตูุฏููููู."
wave = tts(
text, # input text
speaker = 1, # speaker id; choose between 0,1,2,3
pace = 1, # speaker pace
denoise = 0.005, # vocoder denoiser strength
volume = 0.9, # Max amplitude (between 0 and 1)
play = True, # play audio?
pitch_mul = 1, # pitch multiplier
pitch_add = 0, # pitch offset
vowelizer = None, # vowelizer model
model_id = 'fastpitch', # Model ID for Text->Mel model
vocoder_id = 'hifigan', # Model ID for vocoder model
cuda = None, # Optional; CUDA device index
save_to = './test.wav', # Optionally; save audio WAV file
bits_per_sample = 32, # when save_to is specified (8, 16 or 32 bits)
)Vowelizer models
| Model | Model ID | Paper | Repo | Architecture |
|---|---|---|---|---|
| CATT | catt_eo | arxiv | github | Transformer Encoder |
| Shakkelha | shakkelha | arxiv | github | Bi-LSTM |
| Shakkala | shakkala | - | github | Bi-LSTM |
References
The vocoder vocos44 was converted from (patriotyk/vocos-mel-hifigan-compat-44100khz).
The vowelizer catt_eo was converted from https://github.com/abjadai/catt/releases/tag/v2 best_eo_mlm_ns_epoch_193.pt (License: Apache-2.0)
