@@ -7,7 +7,7 @@ and accuracy (WER, MMSU, MMMU) across supported modality combinations.
77
88```
99benchmarks/
10- ├── tasks/ # Per-task logic (tts, mmsu , visual_understand)
10+ ├── tasks/ # Per-task logic (tts, audio_understanding , visual_understand)
1111├── metrics/ # Metric computation (performance, accuracy)
1212├── dataset/ # Dataset loaders + download helpers
1313├── benchmarker/ # Framework: runner, data structures, utilities
@@ -29,6 +29,10 @@ python -m sglang_omni.cli.cli serve \
2929 --model-path fishaudio/s2-pro \
3030 --config examples/configs/s2pro_tts.yaml --port 8000
3131
32+ # Voxtral-4B-TTS — for section 2d (plain TTS, no voice cloning)
33+ python -m sglang_omni.cli.cli serve \
34+ --model-path mistralai/Voxtral-4B-TTS-2603 --port 8000
35+
3236# Qwen3-Omni, speech mode — for section 3 (SeedTTS; multi-GPU)
3337python -m sglang_omni.cli.cli serve \
3438 --model-path Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8000
@@ -56,11 +60,35 @@ python -m benchmarks.eval.benchmark_tts_seedtts \
5660 --model fishaudio/s2-pro \
5761 --output-dir results/s2pro_en --lang en --device cuda:0
5862
59- # 3. Qwen3-Omni — same two-phase pipeline
63+ # 2d. Voxtral — full pipeline without voice cloning
64+ python -m benchmarks.eval.benchmark_tts_seedtts \
65+ --meta seedtts_testset/en/meta.lst \
66+ --model mistralai/Voxtral-4B-TTS-2603 --port 8000 \
67+ --max-concurrency 16 \
68+ --output-dir results/voxtral_en --lang en --max-samples 50 \
69+ --no-ref-audio --voice cheerful_female
70+
71+ # 3a. Qwen3-Omni — full pipeline (generate + transcribe)
6072python -m benchmarks.eval.benchmark_omni_seedtts \
6173 --meta seedtts_testset/en/meta.lst \
62- --model qwen3-omni --port 8000 \
63- --output-dir results/qwen3_omni_en --max-samples 50
74+ --output-dir results/qwen3_omni_en \
75+ --max-concurrency 16 \
76+ --model qwen3-omni --port 8000 --max-samples 50
77+
78+ # 3b. Qwen3-Omni — generate only (server required; use in CI to split phases)
79+ python -m benchmarks.eval.benchmark_omni_seedtts \
80+ --generate-only \
81+ --meta seedtts_testset/en/meta.lst \
82+ --output-dir results/qwen3_omni_en \
83+ --max-concurrency 16 \
84+ --model qwen3-omni --port 8000 --max-samples 50
85+
86+ # 3c. Qwen3-Omni — transcribe only (reuses audio; no server)
87+ python -m benchmarks.eval.benchmark_omni_seedtts \
88+ --transcribe-only \
89+ --meta seedtts_testset/en/meta.lst \
90+ --output-dir results/qwen3_omni_en \
91+ --model qwen3-omni --lang en --device cuda:0
6492
6593# 4. Qwen3-Omni — MMSU (audio comprehension)
6694python -m benchmarks.eval.benchmark_omni_mmsu \
@@ -76,7 +104,7 @@ python -m benchmarks.eval.benchmark_omni_mmmu \
76104
77105| Script | Task | Model | API |
78106| --------| ------| -------| -----|
79- | ` eval/benchmark_tts_seedtts.py ` | TTS speed + WER (unified) | S2-Pro | ` /v1/audio/speech ` |
107+ | ` eval/benchmark_tts_seedtts.py ` | TTS speed + WER (unified) | e.g. S2-Pro, Voxtral | ` /v1/audio/speech ` |
80108| ` eval/benchmark_omni_seedtts.py ` | TTS speed + WER (unified) | Qwen3-Omni | ` /v1/chat/completions ` |
81109| ` eval/benchmark_omni_mmsu.py ` | MMSU (audio comprehension) | Qwen3-Omni | ` /v1/chat/completions ` |
82110| ` eval/benchmark_omni_mmmu.py ` | MMMU (VLM accuracy + speed) | Qwen3-Omni | ` /v1/chat/completions ` |
@@ -85,7 +113,10 @@ The two `*_seedtts.py` scripts merge the previous `benchmark_*_tts_speed.py`
85113and ` voice_clone_*_wer.py ` pairs into a single two-phase pipeline: phase 1
86114generates + persists WAVs while the server runs, phase 2 transcribes offline
87115to avoid GPU contention with the server. Use ` --generate-only ` or
88- ` --transcribe-only ` to run a single phase.
116+ ` --transcribe-only ` to run a single phase. For TTS, ` --concurrency ` and
117+ ` --max-concurrency ` are equivalent (see ` benchmark_tts_seedtts.py ` ).
118+ ` benchmark_omni_seedtts.py ` documents local vs CI GPU usage in its module
119+ docstring (sequential phases on CI to reduce OOM risk).
89120
90121## Adding a New Model or Task
91122
@@ -104,5 +135,12 @@ Download helpers live in `benchmarks/dataset/prepare.py`:
104135python -m benchmarks.dataset.prepare --dataset seedtts # full SeedTTS
105136python -m benchmarks.dataset.prepare --dataset seedtts-mini # smoke-test subset
106137python -m benchmarks.dataset.prepare --dataset seedtts-50 # 50-sample subset
138+ python -m benchmarks.dataset.prepare --dataset mmmu # full MMMU (30 subjects)
107139python -m benchmarks.dataset.prepare --dataset mmmu-ci-50 # MMMU CI subset
140+ python -m benchmarks.dataset.prepare --dataset mmsu # full MMSU (ddwang2000/MMSU)
108141```
142+
143+ SeedTTS datasets are materialized into ` ./seedtts_testset/ ` (override with
144+ ` --local-dir ` ). MMMU/MMSU datasets are pre-warmed into the default
145+ HuggingFace cache and consumed via ` datasets.load_dataset(repo_id) ` , so
146+ ` --local-dir ` is a no-op for them.
0 commit comments