TTS-Longeval serves a number of purposes:
- Provide a unified wrapper around a number of TTS models.
- Provide a wrapper around Kyutai DSM TTS that supports large scale batching and generation.
- Provide a number of TTS benchmarks (existing and new) and the possibility to extend to new ones. In particular, this includes computing some metrics, such as WER or speaker similarity.
Note that this is mostly provided for reproducibility of our research work on delayed stream modeling whose inference code is provided on our repository kyutai-labs/delated-streamins-modeling, as well as offering all the implementation details and hopefully some useful bits of code for others. I (adefossez) don't have enough time to offer full support for this repository, thus improvements or PR are unlikely to be accepted unless critical, and issues might not get a timely reply.
You will need Python 3.11 at least, and you will need uv installed. I recommand that you work from a clone of this repository:
git clone https://github.com/kyutai-labs/tts_longeval.git
cd tts_longeval
git submodule init
git submodule update
You will need to download the WavLM speaker similarity model used by F5-TTS, and save it under ./models/wavlm_large_finetune.pth.
This support the following TTS models: ElevenLab (through the API), Dia, Orpheus, CSM, Chatterbox and Kyutai DSM TTS.
Each TTS engine has its own folder in external_tts/, with its own separate environment using uv.
In particular, the TTS engines execute in subprocesses which are isolated from the main orchestrator
in TTS-Longeval, in order to reduce conflicts in requirements etc.
I tried my best to properly implement each one, but this might not be free of bug! Some models like CSM do not really
support monologues for instance.
We provide the following datasets, each given by a file in datasets/:
ntrex_eng,ntrex_fra: a monologue dataset with separated sentences in English and French, taken from the news article translation dataset NTREX. It is introduced and used for model evaluation in Kyutai's DSM TTS paper.synth_dialogs_en,synth_dialogs_fr: a synthetic dialog dataset introduced in the DSM TTS paper. Scripts are divided in three categories: daily life, technical discussions, and number heavy discussion. This last category is especially challenging, but contains less scripts.seed_en: adapted from the SEED TTS Eval dataset by ByteDance.libri_pc: LibriSpeech test-clean with punctuation, following the exact same split as F5-TTS.
We support the following metrics:
- WER: word error rate, with text normalization either based on OpenAI English normalizer, or following F5-TTS.
- Speaker Similarity: computes speaker similarity with a WavLM based model, inspired by the protocol used by F5-TTS. For this, both a cosine similarity to the relevant speaker is computed, and a nearest metric for dialogs, e.g. whether the speaker is more similar to the corresponding speaker than to the other speaker in the dialog.
Metrics can also be computed over quantiles of the audio duration or text length, e.g. over the first 25% of the words (WER), or 25% of seconds (for speaker similarity), then from 25% to 50% etc..
All outputs will be stored under ./outputs, although this can be changed by editing the entry output_folder in each .toml file.
You can run the main command as follow
uv run -m tts_longeval -c configs/TOML_FILE.toml [-g N_GPU] [-D] [-s STAGE]
You will need to provide a config TOML file, indicating the available TTS model and the datasets to process, see
after for the file format. -g allows to quickly change the number of GPU workers to schedule without
editing the config, -D is for debug mode: any failure in any of the worker will lead to the immediate termination
of all workers and a traceback being printed. -s allows to select only a given stage, available stages are
gen (generation with the TTS), asr (ASR transcription of generated audio), spk (speaker similarity),
met (metrics reporting). The metrics can be saved to a JSON file with --save-metrics FILENAME.json for later processing.
Not all flags are documented here, use uv run -m tts_longeval --help or check tts_longeval/__main__.py for more
information.
Important: On the first run for a TTS engine, or one of the followup task like ASR, models will be downloaded, uv envs
will be created etc. Heavy parallelism might lead to corrupted envs or downloads. I thus recommend running uv sync --locked
in the root of the repo, as well as in each engine folder in external_tts/. In case of issue with the HuggingFace downloaded models,
you can try deleted the corresponding folder in .cache/huggingface/hug and re-run.
Warning: Some methods are expected to fail on some samples, especially in configs/baselines.toml. Once you are sure all
important errors are fixed, remove the -D flag to allow for as many samples as possible to be processed.
Here are some of the provided config files:
configs/librispeech.toml: evaluateskyutai/tts-0.75b-en-publicon LibriSpeech with punctuation, following the protocol from F5-TTS.configs/seeden.toml: evaluateskyutai/tts-0.75b-en-publicon Seed TTS Eval en.
Here is the config format for the .toml file:
[main]
output_folder = "PATH_TO_OUTPUT"
queue_addr = "tcp://*:TCP_PORT"
[runner]
threads = 2 # number of threads for the API calls (11Lab only)
[runner.submitit]
slurm = true # if you are on a SLURM cluster, otherwise False and the machine should have some GPUs.
partition = "default" # partition name for SLURM.
max_gpus = 64 # number of GPU workers to schedule.
gpus_per_task = 8 # number of GPU per task scheduled, e.g. number of GPU per machine.
cpus_per_gpus = 8 # number of CPU to ask for per GPU.
time = 1440 # maximum time in minute to let a job run for.
[asr]
provider = "whisper_hf" # ASR backend, here Whisper from HuggingFace.
[asr.whisper_hf]
model_name = "openai/whisper-large-v3" # ASR model to use.
[speakersim]
model_path = "PATH_TO_WAVLM_SPEAKERSIM_MODEL" # if stored in a different place.
[tts.my_tts_name]
# `my_tts_name` can be anything and will be the name of the method in all reporting.
# This kind of entries can be repeated as many times as needed to support different models.
# The command launched should verify the TTS wrapper protocol described after.
command = ["uv", "run", "external_tts_dsm.py"] # command to run, can be anything, and will run from `cwd` after.
cwd = "external_tts/dsm" # working directory for the command.
max_batch_size = 32 # max batch size supported by the model.
supported_languages = ["fr", "en"] # supported languages by the TTS.
[dataset]
datasets = ["ntrex_eng", "ntrex_fra"] # each entry should correspond to a .jsonl file in ./datasets/
speaker_audio_root = 'hf-dataset://kyutai/voices_tts_longeval' # root HF repo or folder where to look for speaker audio filesEach dataset is a JSONL file, with each line being a dict with the following entries:
id: id of the sample, will also be the name of the file.turns: list of turns of speech. For dialogs, should correspond to the change of speakers. For monologues, it can either be a list with one string, or a list of strings, in which case some of the TTS backend can benefit from the text being splitted in chunks (for instance sentences).speaker_audios: list of paths to the audio file to use for audio conditioning for the speakers. Should contain one entry for monologues, and two entries for dialogs..wavshould be preferred for compatibility. Note that some TTS backends require the corresponding text to be available as a.txtfile next to the.wav. DSM TTS models with cross attention speaker conditioning require a.safetensorsfile containing the speaker embeddings. Those files can come from HuggingFace, seespeaker_audio_rootin the config above.language: language code, e.g.enorfr. This is used to skip generation for entries for a given TTS backend if the language is not supported.tags: arbitrary set of tags which can be used to further filter datapoints (see--tagsin the main command).
Discussing some of the geeky internal details.
Looking at ExternalTTS in tts_longeval/tts.py, one can see the protocol used to communicate with a TTS engine.
The TTS subprocess is started, loads the model, and start reading from stdin. The orchestrator dumps a single line JSON,
made of a list with the following keys:
[{"turns": TURNS, "speaker_audios": [SPEAKER_AUDIO_1, ...], "language": LANGUAGE, "output_file": OUTPUT_FILE}, ...]
TURNS consist in the turns of speech for a dialog (case where there are 2 speaker audios), or the individual sentences
for monologues (single speaker audio). Note that Kyutai DSM TTS supports monologues where sentences are all merged into
a single turn, but not the other TTS backends. [SPEAKER_AUDIO_1, ...] is the list of audio files to use for speaker
conditioning, of size 1 for a monologue, and 2 for a dialog. LANGUAGE is the language to generate (usually ignored),
and OUTPUT_FILE is a filename where to store the resulting waveform.
More than one element can be provided at once, in particular for batch processing, although only DSM TTS supports it.
The TTS subprocess can print anything, but lines starting with external_tts: written to stdout will be interpreted
as a signal that the generation is over. This line should contain a single one-line JSON with the value {"status": "ok"}.
I wanted to have a no-dependency job queue for dispatching the generation and metric evaluation to any number of GPUs.
While Redis or similar would have been great, it also requires installing and running an external service.
The file tts_longeval/zmqueue.py provides a minimalistic job queue. When running the main tts_longeval command,
the main process will start listening on a given TCP port, and host a number of job queues, each with a name, corresponding to a
single task (e.g. one metric or one model).
Each GPU worker, either started locally or through SLURM (see tts_longeval/runner.py) will connect to this address.
Each GPU worker first shuffles the list of possible queue names (e.g. models) and start polling the first corresponding queue.
Once the queue is empty, the worker moves to the next queue name etc. In particular, until a queue has not been emptied,
the worker will stick to that queue, e.g. a specific model, in order to avoid reloading a different TTS model and subprocess
for each batch.
Note that this system is not at all fault tolerant, and if one worker goes away in the middle, you will have to relaunch the main command. It is however idempotent, and it should eventually complete all tasks!
The great English normalizer released by OpenAI
has been heavily used to normalize english texts before computing WER. In particular, it tries to convert all numbers and ordinals
to an all digits version. It also aims at supporting amounts of money with cents etc.
One fun side quest for this repo was to reimplement a similar version for French, which you can find in
tts_longeval/normalizers/french.py.
It is not quite as complete, and honestly just using
the English version on French gives nearly the same WER results, but it was fun to play with Parsy.
In particular, Parsy helps simplifying a lot the definition of the grammar and transformation over the text, while being super lightweight.
This code is released under the MIT license available in the ./LICENSE file,
except tts_longeval/wavlm.py which is released under CC Attribution-ShareAlike 3.0 Unported.
The datasets are shared with the following licenses (license files available in datasets/):
datasets/ntrex_*.jsonl: derived from NTREX, originally under the CC-BY-SA-4.0 license, shared with the same license.datasets/synth_dialogs_*.jsonl: shared under the CC-BY-4.0 license.libri_pc.jsonl: processed from LibriSpeech-PC, originally under the CC BY 4.0 license, shared under the same license.seed_en.jsonl: derived from CommonVoice, originally under CC 0 license, released under the same license.
If you use this repository for research, please cite the following paper.
@techreport{kyutai2025streaming,
title={Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling},
author={Neil Zeghidour and Eugene Kharitonov and Manu Orsini and Václav Volhejn and Gabriel de Marmiesse and Edouard Grave and Patrick Pérez and Laurent Mazaré and Alexandre Défossez},
year={2025},
eprint={2509.08753},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.08753},
}