This script generates per-segment speech files from a JSON transcription file, such as one produced by WhisperX or other diarization tools.
It processes a JSON file containing speech segments, synthesizes audio for each segment using the ChatterboxTTS model, and assigns the correct voice based on speaker labels.
- Parses JSON transcription files with speaker, text, and timing information.
- Generates individual
.wavfiles for each speech segment. - Creates a
_manifest.txtfile that maps the generated audio files to their original timestamps and speaker IDs, suitable for use in audio editing or further processing.
Install the required Python libraries:
pip install chatterbox-tts torchaudioThe script requires a JSON file containing a list of speech segments. Each segment must be an object with start, end, text, and speaker keys.
Example dialogue.json:
{
"segments": [
{
"start": 2.86,
"end": 15.28,
"text": "This is the first line of dialogue spoken by speaker zero.",
"speaker": "SPEAKER_00"
},
{
"start": 16.1,
"end": 22.5,
"text": "And this is the second line, spoken by a different person.",
"speaker": "SPEAKER_01"
}
]
}You need at least one high-quality .wav file to serve as a voice reference for each speaker (e.g., speaker0.wav for SPEAKER_00).
Execute the script from your terminal, providing the path to your JSON transcription and the reference audio files.
python tts.py -t dialogue.json -r speaker0.wav speaker1.wav-t,--transcription: (Required) Path to the input JSON transcription file.-r,--references: (Required) A list of reference.wavfiles. The order must correspond to the speaker IDs (e.g., the first file forSPEAKER_00, the second forSPEAKER_01, and so on).--exaggeration: (Optional) The exaggeration factor for TTS generation. Defaults to0.6.--cfg_weight: (Optional) The CFG weight for TTS generation. Defaults to0.7.
The script generates two sets of outputs in the same directory as your transcription file:
- Numbered Audio Files: A
.wavfile for each segment (e.g.,000.wav,001.wav, ...). - Manifest File: A text file named
<your_transcription>_manifest.txt.
Example dialogue_manifest.txt:
[2.866s–15.285s] (SPEAKER_00) /path/to/your/project/000.wav
[16.100s–22.500s] (SPEAKER_01) /path/to/your/project/001.wav