Skip to content

swiss-ai/project2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Towards Train-Ready Image+Audio→Text Instruction Data

Reproducible TTS Benchmarking, Query Rewriting, and Triplet Construction

This repository builds train-ready Image+Audio→Text instruction data by extending an existing image–text instruction corpus into a multimodal format with spoken user prompts. The pipeline combines (i) reproducible multilingual TTS benchmarking with benchmark-time model/voice selection, (ii) query rewriting into colloquial spoken-style prompts, and (iii) large-scale synthesis plus triplet packaging to produce standardized training triplets, manifests/metadata, and benchmark tables suitable for downstream mid-training (e.g., Apertus 1.5).


1) Pipeline at a Glance

flowchart TB
  A["Source data<br/>image + user + assistant"]

  subgraph P["Parallel"]
    direction LR

    subgraph BA["A: TTS benchmark"]
      direction TB
      BA1["Gen audio<br/>m x lang x voice x id"]
      BA2["Score<br/>Sim + UTMOSv2"]
      BA3["Select configs<br/>best m-lang-voice"]
      BA1 --> BA2 --> BA3
    end

    subgraph BB["B: Rephrase"]
      direction TB
      BB1["Rephrase user turns"]
      BB2["Spoken style<br/>meaning preserved"]
      BB1 --> BB2
    end
  end

  M["Merge<br/>configs + rephrased"]
  S["Synthesize at scale<br/>multi-lang, multi-voice"]
  T["Package triplets<br/>(image,audio)->text<br/>+ manifests"]

  A --> P
  BA3 --> M
  BB2 --> M
  M --> S --> T
Loading

Inputs

  • image (or image reference/path)
  • user_text (written prompt/instruction/query)
  • assistant_text (ground-truth target response)

Outputs

  • Train-ready triplets: (image, audio_prompt) -> target_text
  • Audio files (WAV, standardized sampling rate as configured)
  • Manifests / metadata (JSONL/CSV with paths and per-sample attributes)
  • Benchmark artifacts (tables for model/language/voice comparison and selected configurations)

Reproducible Entry Points (Recommended Order)

  1. Setup: benchmark/set_up.md
  2. Benchmark configs and runners: benchmark/configs/, benchmark/models/
  3. Benchmark summary tables: benchmark/model_comparison.md
  4. Rephrasing: rephrase/rephrase.md
  5. Generation inputs: generate/input/text/

2) Repository Structure

This repository follows a module-first layout aligned with the project pipeline:
benchmarking → model selection → rephrasing → synthesis → QC/evaluation → packaging.

Setup (HPC / Bristen)

Environment setup (Conda, Slurm GPU usage, common CSCS pitfalls) is documented here:

benchmark/ — Benchmarking, model runners, evaluation, and outputs

rephrase/ — Query rephrasing (spoken-style prompts)

This module rewrites user turns only into a more colloquial spoken style while keeping assistant answers unchanged.

See: rephrase/

generate/ — Large-scale synthesis inputs and generation artifacts

This module stores (and/or produces) inputs for large-scale audio synthesis after selecting the best model per language.

  • TTS input texts (per language): generate/input/text/
    • tts_inputs_en.jsonl, tts_inputs_zh.jsonl, tts_inputs_ja.jsonl, tts_inputs_fr.jsonl, tts_inputs_de.jsonl

See: generate/

How to run a model: open the corresponding folder under benchmark/models/<model>/ and follow its guide.
How to rephrase prompts: follow rephrase/rephrase.md.
How to locate synthesis inputs: see generate/input/text/.

Where to Start (Documentation Map)

3)TTS Benchmarking (Model Selection for Scalable Synthesis)

We benchmark multiple open-source TTS models under multilingual and multi-voice settings to select robust model–language configurations before large-scale synthesis. Selection is performed at benchmark time using a small set of human-labeled samples to calibrate an automatic quality gate, combining two automatic signals (Whisper→SBERT similarity and UTMOSv2) with a lightweight logistic-regression selector, rather than relying on post-hoc filtering of the final dataset.

Automatic signals

  • Audio accuracy: Whisper ASR transcription → SBERT semantic similarity to the original prompt
  • Naturalness/quality: UTMOSv2 MOS predictor score

Selected model per language (used for large-scale generation)

Language Selected TTS model
English CosyVoice
French CosyVoice
Japanese Chatterbox
German Chatterbox
Chinese Index-TTS

Full benchmark tables and comparisons: benchmark/model_comparison.md

Reproducibility (navigation)

Query Rephrasing (Spoken-Style Prompts)

Many instruction datasets contain user prompts written in a formal or templated style, which can lead to unnatural prosody when directly synthesized by TTS. We therefore rephrase user turns only into a more colloquial, spoken style while preserving meaning and keeping assistant answers unchanged.

  • Data source: the CLEVR configuration is streamed from the Hugging Face dataset mvp-lab/LLaVA-OneVision-1.5-Instruct-Data and dumped deterministically (first N samples).
  • Model: Qwen/Qwen2.5-7B-Instruct (Transformers), deterministic decoding.
  • Artifacts: *_raw.jsonl and *_rephrased.jsonl with identical structure; only user text is rewritten.

Reproduce: rephrase/rephrase.md
Scripts: rephrase/dump_clevr_1000.py, rephrase/rephrase_clevr_1000.py

Large-Scale Generation (Synthesis) and Triplet Construction

This stage merges:

  1. the benchmark-selected TTS configuration (best model per language), and
  2. the rephrased spoken-style prompts,
    to synthesize audio prompts at scale and package train-ready (image, audio_prompt) → target_text triplets with manifests.

Inputs

  • Per-language TTS input JSONL: generate/input/text/
    (tts_inputs_en.jsonl, tts_inputs_fr.jsonl, tts_inputs_de.jsonl, tts_inputs_ja.jsonl, tts_inputs_zh.jsonl)
  • Rephrased prompts (if used as source for building TTS inputs): rephrase/

Selected TTS backends (from benchmarking)

Generation uses the best-performing model per language:

Full benchmark tables and comparisons: benchmark/model_comparison.md

Where to run (model docs + runner scripts)

Each selected backend is executed via its model-specific runner under benchmark/models/<model>/.

Outputs

Generated audio and manifests follow a standardized layout:

  • WAV audio files (organized by language / voice)
  • metadata_*.jsonl and failed_*.jsonl
  • logs for reproducibility

Benchmark outputs live under: benchmark/output/ Final triplet dataset (Hugging Face): [TBD](https://huggingface.co/datasets/kkkyao/triplets_audio_image_text_v1)

Triplet Dataset Summary (Train-Ready Artifacts)

Triplet format

  • Each example is packaged as: (image, audio_prompt) → target_text
  • target_text is copied verbatim from the source instruction dataset (assistant response); only the user prompt is converted into speech.

Language & voice expansion

  • 5 languages: Chinese (zh), English (en), Japanese (ja), French (fr), German (de)
  • Up to 5 voices per configuration
    • If a backend supports voice cloning, we synthesize multiple speaker styles using a shared set of reference voices.
    • Otherwise, we use the model’s native speaker presets (when available).

Benchmark-driven selection

  • For large-scale synthesis, we use the per-language best backend selected from the benchmarking stage (see the table in TTS Benchmarking above), rather than synthesizing with all candidate models.

References and Citations

This repository builds on prior work and third-party tools/datasets. Please cite the original sources when appropriate.
Full BibTeX entries are provided here: docs/references.bib

Key References (Papers)

Tools and Libraries Referenced

External Datasets (Hugging Face)

Community Resources (Candidate Discovery)

Licensing, Voice Assets, and Intended Use

  • Licensing: Datasets, model weights, and third-party code used by this project are subject to their original licenses. Users are responsible for ensuring compliance with all upstream terms when reproducing results or redistributing artifacts.
  • Reference voice assets: To reduce privacy/impersonation risk, reference voice recordings used for voice cloning are not included in this repository (and should not be redistributed unless you have explicit rights to do so).
  • Intended use: Outputs are intended for research and training-data construction in an Image+Audio→Text setting (e.g., mid-training / alignment of multimodal instruction-following models). This pipeline is not intended for generating deceptive or identity-impersonating audio.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •