Skip to content

๐ŸŽ™๏ธ Arabic TTS models (FastPitch, Mixer-TTS) in the ONNX format โ€” Python package for offline speech synthesis ๐Ÿš€๐Ÿ“ฆ

Notifications You must be signed in to change notification settings

nipponjo/tts_arabic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

33 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Arabic TTS models (FastPitch, MixerTTS) from the tts-arabic-pytorch repo in the ONNX format โ€” usable as a Python package for offline speech synthesis.

Audio samples can be found here.

Install with

pip install git+https://github.com/nipponjo/tts_arabic.git

Examples

# %%
from tts_arabic import tts

# %%
text = "ุงูŽู„ุณู‘ูŽู„ุงู…ู ุนูŽู„ูŽูŠูƒูู… ูŠูŽุง ุตูŽุฏููŠู‚ููŠ."
wave = tts(text, speaker=2, pace=0.9, play=True)

# %% Buckwalter transliteration
text = ">als~alAmu Ealaykum yA Sadiyqiy."
wave = tts(text, speaker=0, play=True)

# %% Unvocalized input
text_unvoc = "ุงู„ู‚ู‡ูˆุฉ ู…ุดุฑูˆุจ ูŠุนุฏ ู…ู† ุจุฐูˆุฑ ุงู„ุจู† ุงู„ู…ุญู…ุตุฉ"
wave = tts(text_unvoc, play=True, vowelizer='shakkelha')

Pretrained models

Model Model ID Type #params Paper Output
FastPitch fastpitch Text->Mel 46.3M arxiv Mel (80 bins)
MixerTTS mixer128 Text->Mel 2.9M arxiv Mel (80 bins)
MixerTTS mixer80 Text->Mel 1.5M arxiv Mel (80 bins)
HiFi-GAN hifigan Vocoder 13.9M arxiv Wave (22.05kHz)
Vocos vocos Vocoder 13.4M arxiv Wave (22.05kHz)
Vocos vocos44 Vocoder 14.0M arxiv Wave (44.1kHz)

The sequence of transformations is as follows:

Text โ†’ Phonemizer โ†’ Phonemes โ†’ Tokenizer โ†’ Token Ids โ†’ Text->Mel model โ†’ Mel spectrogram โ†’ Vocoder model โ†’ Wave

The Text->Mel models map token ids to mel frames. All models use the 80 bin configuration proposed by HiFi-GAN. This mel spectrogram contains frequencies up to 8kHz. The vocoder models map the mel spectrogram to a waveform. The vocoders with vocoder_id hifigan and vocos artificially extend the bandwidth to 11025Hz, and vocos44 to 22050Hz. Samples for comparing the models can be found here.

Manuscript

More information about how the models were trained can be found in the manuscript Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis (arXiv | ResearchGate).

TTS options

from tts_arabic import tts

text = "ุงูŽู„ุณู‘ูŽู„ุงู…ู ุนูŽู„ูŽูŠูƒูู… ูŠูŽุง ุตูŽุฏููŠู‚ููŠ."
wave = tts(
    text, # input text
    speaker = 1, # speaker id; choose between 0,1,2,3
    pace = 1, # speaker pace
    denoise = 0.005, # vocoder denoiser strength
    volume = 0.9, # Max amplitude (between 0 and 1)
    play = True, # play audio?
    pitch_mul = 1, # pitch multiplier
    pitch_add = 0, # pitch offset
    vowelizer = None, # vowelizer model
    model_id = 'fastpitch', # Model ID for Text->Mel model
    vocoder_id = 'hifigan', # Model ID for vocoder model
    cuda = None, # Optional; CUDA device index
    save_to = './test.wav', # Optionally; save audio WAV file
    bits_per_sample = 32, # when save_to is specified (8, 16 or 32 bits)
    )

Vowelizer models

Model Model ID Paper Repo Architecture
CATT catt_eo arxiv github Transformer Encoder
Shakkelha shakkelha arxiv github Bi-LSTM
Shakkala shakkala - github Bi-LSTM

References

The vocoder vocos44 was converted from (patriotyk/vocos-mel-hifigan-compat-44100khz).

The vowelizer catt_eo was converted from https://github.com/abjadai/catt/releases/tag/v2 best_eo_mlm_ns_epoch_193.pt (License: Apache-2.0)

DALLยทE 2025-03-14 18 56 01 - A surreal digital painting of a camel in a vast desert, with a futuristic speaker embedded in its mouth, symbolizing text-to-speech technology  The ca

About

๐ŸŽ™๏ธ Arabic TTS models (FastPitch, Mixer-TTS) in the ONNX format โ€” Python package for offline speech synthesis ๐Ÿš€๐Ÿ“ฆ

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages