TTS-Longeval: a framework and benchmark for TTS evaluation and generation at scale

TTS-Longeval serves a number of purposes:

Provide a unified wrapper around a number of TTS models.
Provide a wrapper around Kyutai DSM TTS that supports large scale batching and generation.
Provide a number of TTS benchmarks (existing and new) and the possibility to extend to new ones. In particular, this includes computing some metrics, such as WER or speaker similarity.

Note that this is mostly provided for reproducibility of our research work on delayed stream modeling whose inference code is provided on our repository kyutai-labs/delated-streamins-modeling, as well as offering all the implementation details and hopefully some useful bits of code for others. I (adefossez) don't have enough time to offer full support for this repository, thus improvements or PR are unlikely to be accepted unless critical, and issues might not get a timely reply.

Requirements

You will need Python 3.11 at least, and you will need uv installed. I recommand that you work from a clone of this repository:

git clone https://github.com/kyutai-labs/tts_longeval.git
cd tts_longeval
git submodule init
git submodule update

You will need to download the WavLM speaker similarity model used by F5-TTS, and save it under ./models/wavlm_large_finetune.pth.

Supported models, datasets, and metrics

Models

This support the following TTS models: ElevenLab (through the API), Dia, Orpheus, CSM, Chatterbox and Kyutai DSM TTS.

Each TTS engine has its own folder in external_tts/, with its own separate environment using uv. In particular, the TTS engines execute in subprocesses which are isolated from the main orchestrator in TTS-Longeval, in order to reduce conflicts in requirements etc. I tried my best to properly implement each one, but this might not be free of bug! Some models like CSM do not really support monologues for instance.

Datasets

We provide the following datasets, each given by a file in datasets/:

ntrex_eng, ntrex_fra: a monologue dataset with separated sentences in English and French, taken from the news article translation dataset NTREX. It is introduced and used for model evaluation in Kyutai's DSM TTS paper.
synth_dialogs_en, synth_dialogs_fr: a synthetic dialog dataset introduced in the DSM TTS paper. Scripts are divided in three categories: daily life, technical discussions, and number heavy discussion. This last category is especially challenging, but contains less scripts.
seed_en: adapted from the SEED TTS Eval dataset by ByteDance.
libri_pc: LibriSpeech test-clean with punctuation, following the exact same split as F5-TTS.

Metrics

We support the following metrics:

WER: word error rate, with text normalization either based on OpenAI English normalizer, or following F5-TTS.
Speaker Similarity: computes speaker similarity with a WavLM based model, inspired by the protocol used by F5-TTS. For this, both a cosine similarity to the relevant speaker is computed, and a nearest metric for dialogs, e.g. whether the speaker is more similar to the corresponding speaker than to the other speaker in the dialog.

Metrics can also be computed over quantiles of the audio duration or text length, e.g. over the first 25% of the words (WER), or 25% of seconds (for speaker similarity), then from 25% to 50% etc..

Usage

All outputs will be stored under ./outputs, although this can be changed by editing the entry output_folder in each .toml file. You can run the main command as follow

uv run -m tts_longeval -c configs/TOML_FILE.toml [-g N_GPU] [-D] [-s STAGE]

You will need to provide a config TOML file, indicating the available TTS model and the datasets to process, see after for the file format. -g allows to quickly change the number of GPU workers to schedule without editing the config, -D is for debug mode: any failure in any of the worker will lead to the immediate termination of all workers and a traceback being printed. -s allows to select only a given stage, available stages are gen (generation with the TTS), asr (ASR transcription of generated audio), spk (speaker similarity), met (metrics reporting). The metrics can be saved to a JSON file with --save-metrics FILENAME.json for later processing.

Not all flags are documented here, use uv run -m tts_longeval --help or check tts_longeval/__main__.py for more information.

Important: On the first run for a TTS engine, or one of the followup task like ASR, models will be downloaded, uv envs will be created etc. Heavy parallelism might lead to corrupted envs or downloads. I thus recommend running uv sync --locked in the root of the repo, as well as in each engine folder in external_tts/. In case of issue with the HuggingFace downloaded models, you can try deleted the corresponding folder in .cache/huggingface/hug and re-run.

Warning: Some methods are expected to fail on some samples, especially in configs/baselines.toml. Once you are sure all important errors are fixed, remove the -D flag to allow for as many samples as possible to be processed.

Existing config files

Here are some of the provided config files:

configs/librispeech.toml: evaluates kyutai/tts-0.75b-en-public on LibriSpeech with punctuation, following the protocol from F5-TTS.
configs/seeden.toml: evaluates kyutai/tts-0.75b-en-public on Seed TTS Eval en.

TOML config format

Here is the config format for the .toml file:

[main]
output_folder = "PATH_TO_OUTPUT"
queue_addr = "tcp://*:TCP_PORT"

[runner]
threads = 2  # number of threads for the API calls (11Lab only)

[runner.submitit]
slurm = true           # if you are on a SLURM cluster, otherwise False and the machine should have some GPUs.
partition = "default"  # partition name for SLURM.
max_gpus = 64          # number of GPU workers to schedule.
gpus_per_task = 8      # number of GPU per task scheduled, e.g. number of GPU per machine.
cpus_per_gpus = 8      # number of CPU to ask for per GPU.
time = 1440            # maximum time in minute to let a job run for.

[asr]
provider = "whisper_hf"                  # ASR backend, here Whisper from HuggingFace.

[asr.whisper_hf]
model_name = "openai/whisper-large-v3"  # ASR model to use.

[speakersim]
model_path = "PATH_TO_WAVLM_SPEAKERSIM_MODEL"      # if stored in a different place.

[tts.my_tts_name]
# `my_tts_name` can be anything and will be the name of the method in all reporting.
# This kind of entries can be repeated as many times as needed to support different models.
# The command launched should verify the TTS wrapper protocol described after.
command = ["uv", "run", "external_tts_dsm.py"]     # command to run, can be anything, and will run from `cwd` after.
cwd = "external_tts/dsm"                           # working directory for the command.
max_batch_size = 32                                # max batch size supported by the model.
supported_languages = ["fr", "en"]                 # supported languages by the TTS.


[dataset]
datasets = ["ntrex_eng", "ntrex_fra"]  # each entry should correspond to a .jsonl file in ./datasets/
speaker_audio_root = 'hf-dataset://kyutai/voices_tts_longeval'  # root HF repo or folder where to look for speaker audio files

Dataset format

Each dataset is a JSONL file, with each line being a dict with the following entries:

id: id of the sample, will also be the name of the file.
turns: list of turns of speech. For dialogs, should correspond to the change of speakers. For monologues, it can either be a list with one string, or a list of strings, in which case some of the TTS backend can benefit from the text being splitted in chunks (for instance sentences).
speaker_audios: list of paths to the audio file to use for audio conditioning for the speakers. Should contain one entry for monologues, and two entries for dialogs. .wav should be preferred for compatibility. Note that some TTS backends require the corresponding text to be available as a .txt file next to the .wav. DSM TTS models with cross attention speaker conditioning require a .safetensors file containing the speaker embeddings. Those files can come from HuggingFace, see speaker_audio_root in the config above.
language: language code, e.g. en or fr. This is used to skip generation for entries for a given TTS backend if the language is not supported.
tags: arbitrary set of tags which can be used to further filter datapoints (see --tags in the main command).

Internal implementation details

Discussing some of the geeky internal details.

TTS engine wrapper protocol

Looking at ExternalTTS in tts_longeval/tts.py, one can see the protocol used to communicate with a TTS engine. The TTS subprocess is started, loads the model, and start reading from stdin. The orchestrator dumps a single line JSON, made of a list with the following keys:

[{"turns": TURNS, "speaker_audios": [SPEAKER_AUDIO_1, ...], "language": LANGUAGE, "output_file": OUTPUT_FILE}, ...]

TURNS consist in the turns of speech for a dialog (case where there are 2 speaker audios), or the individual sentences for monologues (single speaker audio). Note that Kyutai DSM TTS supports monologues where sentences are all merged into a single turn, but not the other TTS backends. [SPEAKER_AUDIO_1, ...] is the list of audio files to use for speaker conditioning, of size 1 for a monologue, and 2 for a dialog. LANGUAGE is the language to generate (usually ignored), and OUTPUT_FILE is a filename where to store the resulting waveform.

More than one element can be provided at once, in particular for batch processing, although only DSM TTS supports it. The TTS subprocess can print anything, but lines starting with external_tts: written to stdout will be interpreted as a signal that the generation is over. This line should contain a single one-line JSON with the value {"status": "ok"}.

ZeroMQ based job queue

I wanted to have a no-dependency job queue for dispatching the generation and metric evaluation to any number of GPUs. While Redis or similar would have been great, it also requires installing and running an external service. The file tts_longeval/zmqueue.py provides a minimalistic job queue. When running the main tts_longeval command, the main process will start listening on a given TCP port, and host a number of job queues, each with a name, corresponding to a single task (e.g. one metric or one model). Each GPU worker, either started locally or through SLURM (see tts_longeval/runner.py) will connect to this address. Each GPU worker first shuffles the list of possible queue names (e.g. models) and start polling the first corresponding queue. Once the queue is empty, the worker moves to the next queue name etc. In particular, until a queue has not been emptied, the worker will stick to that queue, e.g. a specific model, in order to avoid reloading a different TTS model and subprocess for each batch.

Note that this system is not at all fault tolerant, and if one worker goes away in the middle, you will have to relaunch the main command. It is however idempotent, and it should eventually complete all tasks!

Parser combinator based French normalizer

The great English normalizer released by OpenAI has been heavily used to normalize english texts before computing WER. In particular, it tries to convert all numbers and ordinals to an all digits version. It also aims at supporting amounts of money with cents etc. One fun side quest for this repo was to reimplement a similar version for French, which you can find in tts_longeval/normalizers/french.py. It is not quite as complete, and honestly just using the English version on French gives nearly the same WER results, but it was fun to play with Parsy. In particular, Parsy helps simplifying a lot the definition of the grammar and transformation over the text, while being super lightweight.

License and citation

This code is released under the MIT license available in the ./LICENSE file, except tts_longeval/wavlm.py which is released under CC Attribution-ShareAlike 3.0 Unported.

The datasets are shared with the following licenses (license files available in datasets/):

datasets/ntrex_*.jsonl: derived from NTREX, originally under the CC-BY-SA-4.0 license, shared with the same license.
datasets/synth_dialogs_*.jsonl: shared under the CC-BY-4.0 license.
libri_pc.jsonl: processed from LibriSpeech-PC, originally under the CC BY 4.0 license, shared under the same license.
seed_en.jsonl: derived from CommonVoice, originally under CC 0 license, released under the same license.

If you use this repository for research, please cite the following paper.

@techreport{kyutai2025streaming,
      title={Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling}, 
      author={Neil Zeghidour and Eugene Kharitonov and Manu Orsini and Václav Volhejn and Gabriel de Marmiesse and Edouard Grave and Patrick Pérez and Laurent Mazaré and Alexandre Défossez},
      year={2025},
      eprint={2509.08753},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.08753}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
datasets		datasets
external_tools		external_tools
external_tts		external_tts
scripts		scripts
tts_longeval		tts_longeval
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TTS-Longeval: a framework and benchmark for TTS evaluation and generation at scale

Requirements

Supported models, datasets, and metrics

Models

Datasets

Metrics

Usage

Existing config files

TOML config format

Dataset format

Internal implementation details

TTS engine wrapper protocol

ZeroMQ based job queue

Parser combinator based French normalizer

License and citation

About

Uh oh!

Releases

Packages

Languages

License

kyutai-labs/tts_longeval

Folders and files

Latest commit

History

Repository files navigation

TTS-Longeval: a framework and benchmark for TTS evaluation and generation at scale

Requirements

Supported models, datasets, and metrics

Models

Datasets

Metrics

Usage

Existing config files

TOML config format

Dataset format

Internal implementation details

TTS engine wrapper protocol

ZeroMQ based job queue

Parser combinator based French normalizer

License and citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages