This repository builds train-ready Image+Audio→Text instruction data by extending an existing image–text instruction corpus into a multimodal format with spoken user prompts. The pipeline combines (i) reproducible multilingual TTS benchmarking with benchmark-time model/voice selection, (ii) query rewriting into colloquial spoken-style prompts, and (iii) large-scale synthesis plus triplet packaging to produce standardized training triplets, manifests/metadata, and benchmark tables suitable for downstream mid-training (e.g., Apertus 1.5).
flowchart TB
A["Source data<br/>image + user + assistant"]
subgraph P["Parallel"]
direction LR
subgraph BA["A: TTS benchmark"]
direction TB
BA1["Gen audio<br/>m x lang x voice x id"]
BA2["Score<br/>Sim + UTMOSv2"]
BA3["Select configs<br/>best m-lang-voice"]
BA1 --> BA2 --> BA3
end
subgraph BB["B: Rephrase"]
direction TB
BB1["Rephrase user turns"]
BB2["Spoken style<br/>meaning preserved"]
BB1 --> BB2
end
end
M["Merge<br/>configs + rephrased"]
S["Synthesize at scale<br/>multi-lang, multi-voice"]
T["Package triplets<br/>(image,audio)->text<br/>+ manifests"]
A --> P
BA3 --> M
BB2 --> M
M --> S --> T
Inputs
image(or image reference/path)user_text(written prompt/instruction/query)assistant_text(ground-truth target response)
Outputs
- Train-ready triplets:
(image, audio_prompt) -> target_text - Audio files (WAV, standardized sampling rate as configured)
- Manifests / metadata (JSONL/CSV with paths and per-sample attributes)
- Benchmark artifacts (tables for model/language/voice comparison and selected configurations)
- Setup:
benchmark/set_up.md - Benchmark configs and runners:
benchmark/configs/,benchmark/models/ - Benchmark summary tables:
benchmark/model_comparison.md - Rephrasing:
rephrase/rephrase.md - Generation inputs:
generate/input/text/
This repository follows a module-first layout aligned with the project pipeline:
benchmarking → model selection → rephrasing → synthesis → QC/evaluation → packaging.
Environment setup (Conda, Slurm GPU usage, common CSCS pitfalls) is documented here:
-
benchmark/configs/— benchmark and generation configs (language/model/voices/audio format).
See:benchmark/configs/ -
benchmark/input/— benchmark prompt sets, translated prompts, and optional voice assets (when applicable).
See:benchmark/input/ -
benchmark/models/— per-model runners and model-specific guides.
Start here if you want to run a specific TTS model: -
benchmark/output/— generated audio (WAV), logs, and manifests (metadata_*.jsonl,failed_*.jsonl).
See:benchmark/output/ -
QC and evaluation (Whisper→SBERT similarity, UTMOSv2, human labels, selector outputs, aggregated tables).
This module rewrites user turns only into a more colloquial spoken style while keeping assistant answers unchanged.
- Documentation:
rephrase/rephrase.md - Dataset dump script:
rephrase/dump_clevr_1000.py - Rephrasing script:
rephrase/rephrase_clevr_1000.py - Example artifacts (JSONL):
rephrase/clevr_first1000_raw.jsonl,rephrase/clevr_first1000_rephrased.jsonl
See: rephrase/
This module stores (and/or produces) inputs for large-scale audio synthesis after selecting the best model per language.
- TTS input texts (per language):
generate/input/text/tts_inputs_en.jsonl,tts_inputs_zh.jsonl,tts_inputs_ja.jsonl,tts_inputs_fr.jsonl,tts_inputs_de.jsonl
See: generate/
How to run a model: open the corresponding folder under
benchmark/models/<model>/and follow its guide.
How to rephrase prompts: followrephrase/rephrase.md.
How to locate synthesis inputs: seegenerate/input/text/.
- Benchmark and compare models: start at benchmark/, then see
benchmark/classification.ipynbfor QC/aggregation. - Rephrase user queries (spoken-style):
rephrase/rephrase.md - Run a specific TTS model: pick a folder under
benchmark/models/ - Use generation inputs (per language JSONL):
generate/input/text/
We benchmark multiple open-source TTS models under multilingual and multi-voice settings to select robust model–language configurations before large-scale synthesis. Selection is performed at benchmark time using a small set of human-labeled samples to calibrate an automatic quality gate, combining two automatic signals (Whisper→SBERT similarity and UTMOSv2) with a lightweight logistic-regression selector, rather than relying on post-hoc filtering of the final dataset.
Automatic signals
- Audio accuracy: Whisper ASR transcription → SBERT semantic similarity to the original prompt
- Naturalness/quality: UTMOSv2 MOS predictor score
| Language | Selected TTS model |
|---|---|
| English | CosyVoice |
| French | CosyVoice |
| Japanese | Chatterbox |
| German | Chatterbox |
| Chinese | Index-TTS |
Full benchmark tables and comparisons: benchmark/model_comparison.md
Reproducibility (navigation)
- Benchmark configs:
benchmark/configs/ - Per-model runners and docs:
benchmark/models/ - QC + aggregation outputs:
benchmark/eval/
Many instruction datasets contain user prompts written in a formal or templated style, which can lead to unnatural prosody when directly synthesized by TTS. We therefore rephrase user turns only into a more colloquial, spoken style while preserving meaning and keeping assistant answers unchanged.
- Data source: the CLEVR configuration is streamed from the Hugging Face dataset
mvp-lab/LLaVA-OneVision-1.5-Instruct-Dataand dumped deterministically (first N samples). - Model:
Qwen/Qwen2.5-7B-Instruct(Transformers), deterministic decoding. - Artifacts:
*_raw.jsonland*_rephrased.jsonlwith identical structure; only user text is rewritten.
Reproduce: rephrase/rephrase.md
Scripts: rephrase/dump_clevr_1000.py, rephrase/rephrase_clevr_1000.py
This stage merges:
- the benchmark-selected TTS configuration (best model per language), and
- the rephrased spoken-style prompts,
to synthesize audio prompts at scale and package train-ready (image, audio_prompt) → target_text triplets with manifests.
- Per-language TTS input JSONL:
generate/input/text/
(tts_inputs_en.jsonl,tts_inputs_fr.jsonl,tts_inputs_de.jsonl,tts_inputs_ja.jsonl,tts_inputs_zh.jsonl) - Rephrased prompts (if used as source for building TTS inputs):
rephrase/
Generation uses the best-performing model per language:
- English / French: CosyVoice
- Japanese / German: Chatterbox
- Chinese: Index-TTS
Full benchmark tables and comparisons: benchmark/model_comparison.md
Each selected backend is executed via its model-specific runner under benchmark/models/<model>/.
-
CosyVoice
-
Chatterbox
-
Index-TTS
Generated audio and manifests follow a standardized layout:
- WAV audio files (organized by language / voice)
metadata_*.jsonlandfailed_*.jsonl- logs for reproducibility
Benchmark outputs live under: benchmark/output/
Final triplet dataset (Hugging Face): [TBD](https://huggingface.co/datasets/kkkyao/triplets_audio_image_text_v1)
Triplet format
- Each example is packaged as:
(image, audio_prompt) → target_text target_textis copied verbatim from the source instruction dataset (assistant response); only the user prompt is converted into speech.
Language & voice expansion
- 5 languages: Chinese (zh), English (en), Japanese (ja), French (fr), German (de)
- Up to 5 voices per configuration
- If a backend supports voice cloning, we synthesize multiple speaker styles using a shared set of reference voices.
- Otherwise, we use the model’s native speaker presets (when available).
Benchmark-driven selection
- For large-scale synthesis, we use the per-language best backend selected from the benchmarking stage (see the table in TTS Benchmarking above), rather than synthesizing with all candidate models.
This repository builds on prior work and third-party tools/datasets. Please cite the original sources when appropriate.
Full BibTeX entries are provided here: docs/references.bib
- Visual Instruction Tuning (LLaVA):
arXiv:2304.08485 - Scaling Speech-Text Pre-training with Synthetic Interleaved Data:
arXiv:2411.17607 - Qwen2.5 Technical Report:
arXiv:2412.15115 - Whisper (Robust Speech Recognition):
ICML 2023 - Sentence-BERT:
arXiv:1908.10084
- Sentence-Transformers (SBERT similarity):
Documentation - UTMOSv2 (MOS prediction for TTS):
GitHub - UTMOS / VoiceMOS Challenge system description:
arXiv:2204.02152
-
Instruction dataset (CLEVR configuration):
mvp-lab/LLaVA-OneVision-1.5-Instruct-Data -
Speech/voice datasets used for reference voice assets and/or auxiliary examples:
- TTS Arena V2 (HF Space):
TTS-AGI/TTS-Arena-V2 - Open Source TTS Gallery (HF Space):
Inferless/Open-Source-TTS-Gallary
- Licensing: Datasets, model weights, and third-party code used by this project are subject to their original licenses. Users are responsible for ensuring compliance with all upstream terms when reproducing results or redistributing artifacts.
- Reference voice assets: To reduce privacy/impersonation risk, reference voice recordings used for voice cloning are not included in this repository (and should not be redistributed unless you have explicit rights to do so).
- Intended use: Outputs are intended for research and training-data construction in an Image+Audio→Text setting (e.g., mid-training / alignment of multimodal instruction-following models). This pipeline is not intended for generating deceptive or identity-impersonating audio.