Skip to content
Open
Show file tree
Hide file tree
Changes from 178 commits
Commits
Show all changes
239 commits
Select commit Hold shift + click to select a range
db6aeec
update design doc (#711)
hsliuustc0106 Jan 9, 2026
254b720
init
qibaoyuan Jan 10, 2026
9bce494
Refactor talker_mtp condition for clarity
qibaoyuan Jan 11, 2026
6e68f46
[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni (#560)
gcanlin Jan 10, 2026
06daf28
[Doc]: update vllm serve param and base64 data truncation (#718)
nuclearwu Jan 10, 2026
619497b
code format
qibaoyuan Jan 11, 2026
16eb429
[Perf][Qwen3-Omni] Fuse QKV linear and gate_up proj (#734)
gcanlin Jan 11, 2026
d7383b0
fix yaml config
qibaoyuan Jan 12, 2026
5d0538e
add offline example
qibaoyuan Jan 12, 2026
c345d9e
add offline example
qibaoyuan Jan 12, 2026
4a5fdea
add offline example
qibaoyuan Jan 12, 2026
0de14f2
add online example
qibaoyuan Jan 12, 2026
4e81962
modify online example
qibaoyuan Jan 12, 2026
8d184ec
modify online example
qibaoyuan Jan 12, 2026
bf027d0
modify online example
qibaoyuan Jan 12, 2026
42a16ea
[Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) (#735)
dongbo910220 Jan 12, 2026
9b6e08e
[Bugfix] Fix multi-audio input shape alignment for Qwen3-Omni Thinker…
LJH-LBJ Jan 12, 2026
0886c70
[ROCm] [CI] Add More Tests (#542)
tjtanaa Jan 12, 2026
f98ece3
[Docs] update design doc templated in RFC (#746)
hsliuustc0106 Jan 12, 2026
2d9ca87
Add description of code version for bug report (#745)
yenuo26 Jan 12, 2026
fcc2de3
docs: add specific chat template
Jan 12, 2026
62aae30
[misc] fix rfc template (#748)
hsliuustc0106 Jan 12, 2026
f6e6dd0
fix:#issue 432 (#517)
GG-li Jan 12, 2026
56535a4
[Diffusion][Feature] Implement SP support in LongCatImageTransformer …
mxuax Jan 12, 2026
af5fd2a
Merge branch 'main' into feature_mimo_audio
qibaoyuan Jan 12, 2026
90e0274
Clarify audio dialogue task description in README
qibaoyuan Jan 12, 2026
4516b91
update design doc (#711)
hsliuustc0106 Jan 9, 2026
44bf712
init
qibaoyuan Jan 10, 2026
7ec1143
Refactor talker_mtp condition for clarity
qibaoyuan Jan 11, 2026
ba0b4fc
[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni (#560)
gcanlin Jan 10, 2026
d1937a5
[Doc]: update vllm serve param and base64 data truncation (#718)
nuclearwu Jan 10, 2026
5a3a853
code format
qibaoyuan Jan 11, 2026
4597350
[Perf][Qwen3-Omni] Fuse QKV linear and gate_up proj (#734)
gcanlin Jan 11, 2026
c051b1e
fix yaml config
qibaoyuan Jan 12, 2026
79c3286
add offline example
qibaoyuan Jan 12, 2026
6f1bc42
add offline example
qibaoyuan Jan 12, 2026
2d94fb4
add offline example
qibaoyuan Jan 12, 2026
fcfd700
add online example
qibaoyuan Jan 12, 2026
d237706
modify online example
qibaoyuan Jan 12, 2026
4b2f0a5
modify online example
qibaoyuan Jan 12, 2026
6846a97
modify online example
qibaoyuan Jan 12, 2026
beb38a7
[Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) (#735)
dongbo910220 Jan 12, 2026
657c267
[Bugfix] Fix multi-audio input shape alignment for Qwen3-Omni Thinker…
LJH-LBJ Jan 12, 2026
0e92408
[ROCm] [CI] Add More Tests (#542)
tjtanaa Jan 12, 2026
282735f
[Docs] update design doc templated in RFC (#746)
hsliuustc0106 Jan 12, 2026
ba46010
Add description of code version for bug report (#745)
yenuo26 Jan 12, 2026
a5ff60e
docs: add specific chat template
Jan 12, 2026
422a7f2
[misc] fix rfc template (#748)
hsliuustc0106 Jan 12, 2026
7e672b3
fix:#issue 432 (#517)
GG-li Jan 12, 2026
06059db
[Diffusion][Feature] Implement SP support in LongCatImageTransformer …
mxuax Jan 12, 2026
bac71be
Clarify audio dialogue task description in README
qibaoyuan Jan 12, 2026
ab5b365
remove config
qibaoyuan Jan 12, 2026
1564dbb
remove config
qibaoyuan Jan 12, 2026
6d4ac27
remove config
qibaoyuan Jan 12, 2026
7b03a1d
[Debug] Clean code in Qwen 3 Omni and add warning for talker temperat…
tzhouam Jan 12, 2026
1bc045b
[feature] cpu offloading support for diffusion (#497)
LawJarp-A Jan 12, 2026
25d0074
remove config
qibaoyuan Jan 12, 2026
4fa693d
remove config
qibaoyuan Jan 12, 2026
96d42be
remove config
qibaoyuan Jan 13, 2026
abc0971
remove config
qibaoyuan Jan 12, 2026
4280e9a
Merge branch 'main' into feature_mimo_audio
qibaoyuan Jan 12, 2026
7842f3a
remove config
qibaoyuan Jan 12, 2026
e70cdee
Merge remote-tracking branch 'origin/feature_mimo_audio' into feature…
qibaoyuan Jan 12, 2026
4c846c4
format
qibaoyuan Jan 12, 2026
9c35167
format
qibaoyuan Jan 12, 2026
aa2f61e
Revert "remove config"
qibaoyuan Jan 13, 2026
15416bd
remove config
qibaoyuan Jan 13, 2026
417fb35
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 13, 2026
487653b
refactor: format codes and solve errors
Dovis01 Jan 13, 2026
bcc33f0
code format
qibaoyuan Jan 13, 2026
fa14126
Merge branch 'main' into feature_mimo_audio
qibaoyuan Jan 13, 2026
949b730
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 13, 2026
d1f3689
[bugfix] fix config path
qibaoyuan Jan 13, 2026
cf7a9b1
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 13, 2026
278ed17
[bugfix] code key check
qibaoyuan Jan 13, 2026
ed955e9
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 14, 2026
cf2243a
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 15, 2026
b72023a
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 15, 2026
c70a0f4
[example] add more examples
qibaoyuan Jan 15, 2026
9155296
[example] comment
qibaoyuan Jan 15, 2026
91e4d77
[example] add missing file
qibaoyuan Jan 15, 2026
f53c95f
[example] tts with ref audio
qibaoyuan Jan 15, 2026
1564a00
[example] fix: tts without ref audio
qibaoyuan Jan 16, 2026
367107d
[example] fix: tts without instruct
qibaoyuan Jan 16, 2026
e0cb70c
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 16, 2026
1e7cc96
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 16, 2026
e3d30ba
feat: add customized preprocess func for MiMo
Dovis01 Jan 16, 2026
b42d3f8
[mimo-audio] add preprocess, avoiding pass req_id
qibaoyuan Jan 16, 2026
a96eebb
refactor: format codes
Dovis01 Jan 16, 2026
7a3b803
fix: mm-embedding is none
Dovis01 Jan 16, 2026
32b1777
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 16, 2026
59dbaab
[mimo-audio] files
qibaoyuan Jan 16, 2026
583dc43
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 18, 2026
e764450
feat: done for replace llm
Dovis01 Jan 18, 2026
e92a011
Merge pull request #4 from qibaoyuan/zsj_dev/feat_mimo_audio_vllm_model
qibaoyuan Jan 18, 2026
6acd81c
[mimo-audio] code format
qibaoyuan Jan 18, 2026
fe82b4a
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 19, 2026
2da2577
[mimo-audio] local decode cudagraph
qibaoyuan Jan 20, 2026
4a036a8
add implementation source annotation
Jan 20, 2026
638139a
[mimo-audio] online example settings
qibaoyuan Jan 20, 2026
2b777cb
Merge pull request #5 from qibaoyuan/dn/mimo_cg
qibaoyuan Jan 20, 2026
18b27fe
[mimo-audio] code format
qibaoyuan Jan 20, 2026
f2e1a72
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 20, 2026
00085a0
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 20, 2026
8f5113f
[mimo-audio] set default model name
qibaoyuan Jan 20, 2026
7c858c0
[mimo-audio] disable cuda graph in localforward for further optimization
qibaoyuan Jan 20, 2026
f6af4f0
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 21, 2026
0693c81
[mimo-audio] adapt to vllm0.14.0
qibaoyuan Jan 21, 2026
d6edd8a
[mimo-audio] revert
qibaoyuan Jan 21, 2026
4faae11
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 21, 2026
8634cc1
[mimo-audio] comment
qibaoyuan Jan 21, 2026
963f36b
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 21, 2026
d492998
[mimo-audio] pre-commit error fix
qibaoyuan Jan 21, 2026
4a94c12
[mimo-audio] english comment
qibaoyuan Jan 21, 2026
26172a5
[mimo-audio] pre-commit error fix
qibaoyuan Jan 21, 2026
6098fa6
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 22, 2026
c43b4b0
[mimo-audio] more template
qibaoyuan Jan 22, 2026
a59e74a
[mimo-audio] use current gpu
qibaoyuan Jan 22, 2026
cd55822
[mimo-audio] rename
qibaoyuan Jan 22, 2026
3971d24
[mimo-audio] rename
qibaoyuan Jan 22, 2026
06959f4
[mimo-audio] flat-attn import fix
qibaoyuan Jan 22, 2026
3ba923e
[mimo-audio] move custom fuc to main class
qibaoyuan Jan 22, 2026
39cdbc4
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 22, 2026
3fe9484
feat: done for replace llm
Dovis01 Jan 18, 2026
9366109
feat: add audio transcribing example
Dovis01 Jan 18, 2026
1b462e6
feat([batch]): done for batch 2
Dovis01 Jan 22, 2026
6ecdc54
Merge branch 'feature_mimo_audio' into zsj_dev/feat_mimo_audio_vllm_m…
Dovis01 Jan 22, 2026
41d0e96
Merge pull request #6 from qibaoyuan/zsj_dev/feat_mimo_audio_vllm_model
qibaoyuan Jan 22, 2026
0549846
[mimo-audio] code format
qibaoyuan Jan 22, 2026
ae8c256
[mimo-audio] code format
qibaoyuan Jan 22, 2026
340b257
Merge branch 'main' into feature_mimo_audio
qibaoyuan Jan 23, 2026
65d1645
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 23, 2026
e189317
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 23, 2026
0c9fa9d
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 23, 2026
35af870
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 23, 2026
66659b2
feat: done for upgrading 14
Dovis01 Jan 23, 2026
980f93d
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 24, 2026
2eb12ea
Merge pull request #8 from qibaoyuan/zsj_dev/feature_mimo_14
qibaoyuan Jan 24, 2026
76996b3
[mimo-audio] code format
qibaoyuan Jan 24, 2026
45d4ae8
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 26, 2026
516cbeb
[mimo-audio] value check
qibaoyuan Jan 26, 2026
024f213
Merge branch 'main' into feature_mimo_audio
qibaoyuan Jan 26, 2026
75b3737
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 26, 2026
9a852d2
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 26, 2026
f8c639a
feat([batch]): done for higher batching size
Dovis01 Jan 27, 2026
c13b499
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 27, 2026
7089d13
Merge branch 'feature_mimo_audio' into zsj_dev/feature_mimo_14
qibaoyuan Jan 27, 2026
b7f7774
Merge pull request #9 from qibaoyuan/zsj_dev/feature_mimo_14
qibaoyuan Jan 27, 2026
8c5ad26
[mimo-audio] code format
qibaoyuan Jan 27, 2026
2388f7f
Revert "[mimo-audio] code format"
qibaoyuan Jan 27, 2026
8ed7f0b
[mimo-audio] code format
qibaoyuan Jan 27, 2026
962aee8
feat: add new supplimental configs
Dovis01 Jan 27, 2026
c4649f0
[mimo-audio] code format
qibaoyuan Jan 27, 2026
0e03317
[mimo-audio] yaml comment
qibaoyuan Jan 27, 2026
a015530
feat: terminate output audio for ASR
Dovis01 Jan 27, 2026
00cde49
[mimo-audio] code format
qibaoyuan Jan 27, 2026
5c0e367
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 27, 2026
ad4b60b
[mimo-audio] local forward & input local transformer cudagraph
Jan 27, 2026
2a59811
feat: change to none returning
Dovis01 Jan 27, 2026
dde676d
[mimo-audio] code format
qibaoyuan Jan 27, 2026
821e81e
[mimo-audio] 去掉 replay 里的 torch.cuda.synchronize()
Jan 27, 2026
045264b
Merge pull request #10 from qibaoyuan/dn/feature_mimo_auido_local_for…
qibaoyuan Jan 27, 2026
31a5097
[mimo-audio] code format
qibaoyuan Jan 27, 2026
aabaa4a
[mimo-audio] code format
qibaoyuan Jan 27, 2026
f1474ed
[mimo-audio] code format
qibaoyuan Jan 27, 2026
be77436
[mimo-audio] readme
qibaoyuan Jan 27, 2026
330f5b2
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 27, 2026
a92b110
[mimo-audio] use english
qibaoyuan Jan 27, 2026
93080da
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 27, 2026
14eac8b
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 28, 2026
a45a21e
[mimo-audio] use en instead of chi
qibaoyuan Jan 28, 2026
e8d9f25
refactor: add comments
Dovis01 Jan 28, 2026
ae23b97
[mimo-audio] format
qibaoyuan Jan 28, 2026
cd2d138
docs: add MiMo-Audio Offline README
Dovis01 Jan 28, 2026
5512694
[mimo-audio] use default batch-size
qibaoyuan Jan 28, 2026
d727972
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 28, 2026
4a58ae7
docs: add extra instruction
Dovis01 Jan 28, 2026
113e554
refactor: merge two config related py files
Dovis01 Jan 28, 2026
4e2cc79
[mimo-audio] code format
qibaoyuan Jan 28, 2026
157fbe1
[mimo-audio] logger
qibaoyuan Jan 28, 2026
44f8e32
[mimo-audio] return val
qibaoyuan Jan 28, 2026
7cfe57a
[mimo-audio] error check
qibaoyuan Jan 28, 2026
4d0b524
[mimo-audio] error check
qibaoyuan Jan 28, 2026
952c0c3
Update vllm_omni/model_executor/models/mimo_audio/mimo_audio_code2wav.py
qibaoyuan Jan 28, 2026
41d0630
Update vllm_omni/model_executor/models/mimo_audio/mimo_audio_code2wav.py
qibaoyuan Jan 28, 2026
c4e6558
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 28, 2026
a5232cc
fix: solve compatibility issues on stages passing
Dovis01 Jan 28, 2026
5677c89
[mimo-audio] chi
qibaoyuan Jan 28, 2026
f56a4c1
Merge remote-tracking branch 'origin/feature_mimo_audio' into feature…
qibaoyuan Jan 28, 2026
9726769
[mimo-audio] fix path
qibaoyuan Jan 28, 2026
019f139
[mimo-audio] fix return val
qibaoyuan Jan 28, 2026
64235e3
[mimo-audio] fix return val
qibaoyuan Jan 28, 2026
df72fdb
Merge branch 'main' into feature_mimo_audio
qibaoyuan Jan 28, 2026
e9d1e57
[mimo-audio] fix english
qibaoyuan Jan 28, 2026
ad2abb4
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 28, 2026
07e954e
[mimo-audio] mimo-audio check and unit test
qibaoyuan Jan 28, 2026
247f9d4
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 28, 2026
09e5670
chore: change to original sampling params
Dovis01 Jan 28, 2026
aedba9f
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 29, 2026
d60a749
[mimo-audio] remove debug
qibaoyuan Jan 29, 2026
5300820
[mimo-audio] update device def
qibaoyuan Jan 29, 2026
77fb0f3
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 29, 2026
2f106db
[mimo-audio] incr token len
qibaoyuan Jan 29, 2026
cd864bc
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 29, 2026
8244e62
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 29, 2026
5ed5686
[mimo-audio] gpu mem adapt on h20
qibaoyuan Jan 29, 2026
1876af3
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 30, 2026
1c8cc79
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 30, 2026
2427fe8
[mimo-audio] mimo-audio check
qibaoyuan Jan 30, 2026
df7626a
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Jan 30, 2026
c5b67e0
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 1, 2026
00f3988
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 2, 2026
d6ebaa5
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 3, 2026
1a54954
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 3, 2026
f97de11
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 3, 2026
6d92db4
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 3, 2026
b35768c
Merge branch 'main' into feature_mimo_audio
qibaoyuan Feb 3, 2026
910f222
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 4, 2026
cbec602
fix: Solve correct code merging
Dovis01 Feb 4, 2026
973a6eb
Merge branch 'main' into feature_mimo_audio
qibaoyuan Feb 4, 2026
c8b4fa9
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 4, 2026
ae9b3d2
feat[Streaming]: Done for streaming output on MiMo
Dovis01 Feb 4, 2026
eddfcdd
[mimo-audio] code format
qibaoyuan Feb 4, 2026
a756fd5
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 4, 2026
87fa83e
[mimo-audio] add async_chunk adn sync yaml file
qibaoyuan Feb 5, 2026
1b6d6b0
[mimo-audio] add async_chunk adn sync yaml file
qibaoyuan Feb 5, 2026
da87773
[mimo-audio] add async_chunk adn sync yaml file
qibaoyuan Feb 5, 2026
fdeac93
[mimo-audio] adapt vllm15
qibaoyuan Feb 5, 2026
0eb52a9
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 5, 2026
9831fb2
[MiMo-Audio] fix None value due to upgrading to 15
Dovis01 Feb 5, 2026
9af6ef9
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 5, 2026
e9fa9fd
[mimo-audio] code wrap
qibaoyuan Feb 5, 2026
e103f1f
fix[Streaming]: Solve non-finished req status
Dovis01 Feb 5, 2026
cd3b88d
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 5, 2026
f71483b
[mimo-audio] system prompt reset
qibaoyuan Feb 6, 2026
e3d60cc
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan Feb 6, 2026
b021820
[mimo-audio] revert
qibaoyuan Feb 6, 2026
315959d
feat([Streaming]): Done for batching streaming
Dovis01 Feb 6, 2026
a08e38c
[mimo-audio] print more
qibaoyuan Feb 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 193 additions & 0 deletions examples/offline_inference/mimo_audio/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# MiMo-Audio Offline Inference

This directory contains an offline demo for running MiMo-Audio models with vLLM Omni. It builds task-specific inputs and generates WAV files or text outputs locally.

## Model Overview

MiMo-Audio provides multiple task variants for audio understanding and generation:

- **tts_sft**: Basic text-to-speech generation from text input.
- **tts_sft_with_instruct**: TTS generation with explicit voice style instructions.
- **tts_sft_with_audio**: TTS generation with audio reference for voice cloning.
- **tts_sft_with_natural_instruction**: TTS generation from natural language descriptions embedded in text.
- **audio_trancribing_sft**: Transcribe audio to text (speech-to-text).
- **audio_understanding_sft**: Understand and analyze audio content with text queries.
- **audio_understanding_sft_with_thinking**: Audio understanding with reasoning chain.
- **spoken_dialogue_sft_multiturn**: Multi-turn spoken dialogue with audio input/output.
- **speech2text_dialogue_sft_multiturn**: Multi-turn dialogue converting speech to text.
- **text_dialogue_sft_multiturn**: Multi-turn text-only dialogue.

## Setup

Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.

### Environment Variables

The `MIMO_AUDIO_TOKENIZER_PATH` environment variable is mandatory due to the specialized architecture:

```bash
export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer"
```

## Quick Start

Run a single sample for basic TTS:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft
```

Run batch samples for basic TTS:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft \
--num-prompts {batch_size}
```

When enabling multi-batch processing, if the total number of tokens passed to the next stage exceeds the `max_model_len` value in the `mimo_audio.yaml` configuration file, you must also synchronously update the `max_position_embeddings` value in `MiMo-Audio-7B-Instruct/config.json` to match the modified value.

Generated audio files are saved to `output_audio/` by default. `--num-prompts` also can be used to all tasks below.

## Task Usage

### tts_sft (Basic Text-to-Speech)

Generate speech from text input:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft \
--text "The weather is so nice today."
```

### tts_sft_with_instruct (TTS with Voice Instructions)

Generate speech with explicit voice style instructions:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft_with_instruct \
--text "The weather is so nice today." \
--instruct "Speak happily in a child's voice"
```

### tts_sft_with_audio (TTS with Audio Reference)

Generate speech using an audio reference for voice cloning:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft_with_audio \
--text "The weather is so nice today." \
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
```

### tts_sft_with_natural_instruction (Natural Language TTS)

Generate speech from text containing natural voice descriptions:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type tts_sft_with_natural_instruction \
--text "In a panting young male voice, he said: I can't run anymore, wait for me!"
```

### audio_trancribing_sft (Speech-to-Text)

Transcribe audio to text:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type audio_trancribing_sft \
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
```

### audio_understanding_sft (Audio Understanding)

Understand and analyze audio content with text queries:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type audio_understanding_sft \
--text "Summarize the audio." \
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
```

### audio_understanding_sft_with_thinking (Audio Understanding with Reasoning)

Audio understanding with reasoning chain:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type audio_understanding_sft_with_thinking \
--text "Summarize the audio." \
--audio-path "./spoken_dialogue_assistant_turn_1.wav"
```

### spoken_dialogue_sft_multiturn (Multi-turn Spoken Dialogue)

Multi-turn dialogue with audio input and output:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type spoken_dialogue_sft_multiturn \
--audio-path "./prompt_speech_zh_m.wav"
```

Note: This task uses hardcoded audio files in the script. The audio files used in examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples

### speech2text_dialogue_sft_multiturn (Speech-to-Text Dialogue)

Multi-turn dialogue converting speech to text:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type speech2text_dialogue_sft_multiturn
```

Note: This task uses hardcoded audio files and message lists in the script.

### text_dialogue_sft_multiturn (Text Dialogue)

Multi-turn text-only dialogue:

```bash
python3 -u end2end.py \
--stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \
--model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \
--query-type text_dialogue_sft_multiturn
```

Note: This task uses hardcoded message lists in the script.

## Notes

- The script uses default model paths and audio files embedded in `end2end.py`. Update them if your local cache path differs.
- Use `--output-dir` to change the output folder (default: `./output_audio`).
- Use `--num-prompts` to generate multiple prompts in one run (default: 1).
- Audio files used in multi-turn dialogue examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples
- The script supports various configuration options for initialization timeouts, batch timeouts, and shared memory thresholds. See `--help` for details.
Loading
Loading