-
Notifications
You must be signed in to change notification settings - Fork 395
[New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support #750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
qibaoyuan
wants to merge
239
commits into
vllm-project:main
Choose a base branch
from
qibaoyuan:feature_mimo_audio
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+7,979
−0
Open
Changes from 185 commits
Commits
Show all changes
239 commits
Select commit
Hold shift + click to select a range
db6aeec
update design doc (#711)
hsliuustc0106 254b720
init
qibaoyuan 9bce494
Refactor talker_mtp condition for clarity
qibaoyuan 6e68f46
[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni (#560)
gcanlin 06daf28
[Doc]: update vllm serve param and base64 data truncation (#718)
nuclearwu 619497b
code format
qibaoyuan 16eb429
[Perf][Qwen3-Omni] Fuse QKV linear and gate_up proj (#734)
gcanlin d7383b0
fix yaml config
qibaoyuan 5d0538e
add offline example
qibaoyuan c345d9e
add offline example
qibaoyuan 4a5fdea
add offline example
qibaoyuan 0de14f2
add online example
qibaoyuan 4e81962
modify online example
qibaoyuan 8d184ec
modify online example
qibaoyuan bf027d0
modify online example
qibaoyuan 42a16ea
[Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) (#735)
dongbo910220 9b6e08e
[Bugfix] Fix multi-audio input shape alignment for Qwen3-Omni Thinker…
LJH-LBJ 0886c70
[ROCm] [CI] Add More Tests (#542)
tjtanaa f98ece3
[Docs] update design doc templated in RFC (#746)
hsliuustc0106 2d9ca87
Add description of code version for bug report (#745)
yenuo26 fcc2de3
docs: add specific chat template
62aae30
[misc] fix rfc template (#748)
hsliuustc0106 f6e6dd0
fix:#issue 432 (#517)
GG-li 56535a4
[Diffusion][Feature] Implement SP support in LongCatImageTransformer …
mxuax af5fd2a
Merge branch 'main' into feature_mimo_audio
qibaoyuan 90e0274
Clarify audio dialogue task description in README
qibaoyuan 4516b91
update design doc (#711)
hsliuustc0106 44bf712
init
qibaoyuan 7ec1143
Refactor talker_mtp condition for clarity
qibaoyuan ba0b4fc
[Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni (#560)
gcanlin d1937a5
[Doc]: update vllm serve param and base64 data truncation (#718)
nuclearwu 5a3a853
code format
qibaoyuan 4597350
[Perf][Qwen3-Omni] Fuse QKV linear and gate_up proj (#734)
gcanlin c051b1e
fix yaml config
qibaoyuan 79c3286
add offline example
qibaoyuan 6f1bc42
add offline example
qibaoyuan 2d94fb4
add offline example
qibaoyuan fcfd700
add online example
qibaoyuan d237706
modify online example
qibaoyuan 4b2f0a5
modify online example
qibaoyuan 6846a97
modify online example
qibaoyuan beb38a7
[Feat] Enable DiT tensor parallel for Diffusion Pipeline(Z-Image) (#735)
dongbo910220 657c267
[Bugfix] Fix multi-audio input shape alignment for Qwen3-Omni Thinker…
LJH-LBJ 0e92408
[ROCm] [CI] Add More Tests (#542)
tjtanaa 282735f
[Docs] update design doc templated in RFC (#746)
hsliuustc0106 ba46010
Add description of code version for bug report (#745)
yenuo26 a5ff60e
docs: add specific chat template
422a7f2
[misc] fix rfc template (#748)
hsliuustc0106 7e672b3
fix:#issue 432 (#517)
GG-li 06059db
[Diffusion][Feature] Implement SP support in LongCatImageTransformer …
mxuax bac71be
Clarify audio dialogue task description in README
qibaoyuan ab5b365
remove config
qibaoyuan 1564dbb
remove config
qibaoyuan 6d4ac27
remove config
qibaoyuan 7b03a1d
[Debug] Clean code in Qwen 3 Omni and add warning for talker temperat…
tzhouam 1bc045b
[feature] cpu offloading support for diffusion (#497)
LawJarp-A 25d0074
remove config
qibaoyuan 4fa693d
remove config
qibaoyuan 96d42be
remove config
qibaoyuan abc0971
remove config
qibaoyuan 4280e9a
Merge branch 'main' into feature_mimo_audio
qibaoyuan 7842f3a
remove config
qibaoyuan e70cdee
Merge remote-tracking branch 'origin/feature_mimo_audio' into feature…
qibaoyuan 4c846c4
format
qibaoyuan 9c35167
format
qibaoyuan aa2f61e
Revert "remove config"
qibaoyuan 15416bd
remove config
qibaoyuan 417fb35
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 487653b
refactor: format codes and solve errors
Dovis01 bcc33f0
code format
qibaoyuan fa14126
Merge branch 'main' into feature_mimo_audio
qibaoyuan 949b730
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan d1f3689
[bugfix] fix config path
qibaoyuan cf7a9b1
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 278ed17
[bugfix] code key check
qibaoyuan ed955e9
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan cf2243a
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan b72023a
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan c70a0f4
[example] add more examples
qibaoyuan 9155296
[example] comment
qibaoyuan 91e4d77
[example] add missing file
qibaoyuan f53c95f
[example] tts with ref audio
qibaoyuan 1564a00
[example] fix: tts without ref audio
qibaoyuan 367107d
[example] fix: tts without instruct
qibaoyuan e0cb70c
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 1e7cc96
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan e3d30ba
feat: add customized preprocess func for MiMo
Dovis01 b42d3f8
[mimo-audio] add preprocess, avoiding pass req_id
qibaoyuan a96eebb
refactor: format codes
Dovis01 7a3b803
fix: mm-embedding is none
Dovis01 32b1777
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 59dbaab
[mimo-audio] files
qibaoyuan 583dc43
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan e764450
feat: done for replace llm
Dovis01 e92a011
Merge pull request #4 from qibaoyuan/zsj_dev/feat_mimo_audio_vllm_model
qibaoyuan 6acd81c
[mimo-audio] code format
qibaoyuan fe82b4a
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 2da2577
[mimo-audio] local decode cudagraph
qibaoyuan 4a036a8
add implementation source annotation
638139a
[mimo-audio] online example settings
qibaoyuan 2b777cb
Merge pull request #5 from qibaoyuan/dn/mimo_cg
qibaoyuan 18b27fe
[mimo-audio] code format
qibaoyuan f2e1a72
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 00085a0
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 8f5113f
[mimo-audio] set default model name
qibaoyuan 7c858c0
[mimo-audio] disable cuda graph in localforward for further optimization
qibaoyuan f6af4f0
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 0693c81
[mimo-audio] adapt to vllm0.14.0
qibaoyuan d6edd8a
[mimo-audio] revert
qibaoyuan 4faae11
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 8634cc1
[mimo-audio] comment
qibaoyuan 963f36b
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan d492998
[mimo-audio] pre-commit error fix
qibaoyuan 4a94c12
[mimo-audio] english comment
qibaoyuan 26172a5
[mimo-audio] pre-commit error fix
qibaoyuan 6098fa6
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan c43b4b0
[mimo-audio] more template
qibaoyuan a59e74a
[mimo-audio] use current gpu
qibaoyuan cd55822
[mimo-audio] rename
qibaoyuan 3971d24
[mimo-audio] rename
qibaoyuan 06959f4
[mimo-audio] flat-attn import fix
qibaoyuan 3ba923e
[mimo-audio] move custom fuc to main class
qibaoyuan 39cdbc4
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 3fe9484
feat: done for replace llm
Dovis01 9366109
feat: add audio transcribing example
Dovis01 1b462e6
feat([batch]): done for batch 2
Dovis01 6ecdc54
Merge branch 'feature_mimo_audio' into zsj_dev/feat_mimo_audio_vllm_m…
Dovis01 41d0e96
Merge pull request #6 from qibaoyuan/zsj_dev/feat_mimo_audio_vllm_model
qibaoyuan 0549846
[mimo-audio] code format
qibaoyuan ae8c256
[mimo-audio] code format
qibaoyuan 340b257
Merge branch 'main' into feature_mimo_audio
qibaoyuan 65d1645
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan e189317
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 0c9fa9d
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 35af870
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 66659b2
feat: done for upgrading 14
Dovis01 980f93d
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 2eb12ea
Merge pull request #8 from qibaoyuan/zsj_dev/feature_mimo_14
qibaoyuan 76996b3
[mimo-audio] code format
qibaoyuan 45d4ae8
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 516cbeb
[mimo-audio] value check
qibaoyuan 024f213
Merge branch 'main' into feature_mimo_audio
qibaoyuan 75b3737
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 9a852d2
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan f8c639a
feat([batch]): done for higher batching size
Dovis01 c13b499
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 7089d13
Merge branch 'feature_mimo_audio' into zsj_dev/feature_mimo_14
qibaoyuan b7f7774
Merge pull request #9 from qibaoyuan/zsj_dev/feature_mimo_14
qibaoyuan 8c5ad26
[mimo-audio] code format
qibaoyuan 2388f7f
Revert "[mimo-audio] code format"
qibaoyuan 8ed7f0b
[mimo-audio] code format
qibaoyuan 962aee8
feat: add new supplimental configs
Dovis01 c4649f0
[mimo-audio] code format
qibaoyuan 0e03317
[mimo-audio] yaml comment
qibaoyuan a015530
feat: terminate output audio for ASR
Dovis01 00cde49
[mimo-audio] code format
qibaoyuan 5c0e367
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan ad4b60b
[mimo-audio] local forward & input local transformer cudagraph
2a59811
feat: change to none returning
Dovis01 dde676d
[mimo-audio] code format
qibaoyuan 821e81e
[mimo-audio] 去掉 replay 里的 torch.cuda.synchronize()
045264b
Merge pull request #10 from qibaoyuan/dn/feature_mimo_auido_local_for…
qibaoyuan 31a5097
[mimo-audio] code format
qibaoyuan aabaa4a
[mimo-audio] code format
qibaoyuan f1474ed
[mimo-audio] code format
qibaoyuan be77436
[mimo-audio] readme
qibaoyuan 330f5b2
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan a92b110
[mimo-audio] use english
qibaoyuan 93080da
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 14eac8b
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan a45a21e
[mimo-audio] use en instead of chi
qibaoyuan e8d9f25
refactor: add comments
Dovis01 ae23b97
[mimo-audio] format
qibaoyuan cd2d138
docs: add MiMo-Audio Offline README
Dovis01 5512694
[mimo-audio] use default batch-size
qibaoyuan d727972
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 4a58ae7
docs: add extra instruction
Dovis01 113e554
refactor: merge two config related py files
Dovis01 4e2cc79
[mimo-audio] code format
qibaoyuan 157fbe1
[mimo-audio] logger
qibaoyuan 44f8e32
[mimo-audio] return val
qibaoyuan 7cfe57a
[mimo-audio] error check
qibaoyuan 4d0b524
[mimo-audio] error check
qibaoyuan 952c0c3
Update vllm_omni/model_executor/models/mimo_audio/mimo_audio_code2wav.py
qibaoyuan 41d0630
Update vllm_omni/model_executor/models/mimo_audio/mimo_audio_code2wav.py
qibaoyuan c4e6558
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan a5232cc
fix: solve compatibility issues on stages passing
Dovis01 5677c89
[mimo-audio] chi
qibaoyuan f56a4c1
Merge remote-tracking branch 'origin/feature_mimo_audio' into feature…
qibaoyuan 9726769
[mimo-audio] fix path
qibaoyuan 019f139
[mimo-audio] fix return val
qibaoyuan 64235e3
[mimo-audio] fix return val
qibaoyuan df72fdb
Merge branch 'main' into feature_mimo_audio
qibaoyuan e9d1e57
[mimo-audio] fix english
qibaoyuan ad2abb4
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 07e954e
[mimo-audio] mimo-audio check and unit test
qibaoyuan 247f9d4
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 09e5670
chore: change to original sampling params
Dovis01 aedba9f
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan d60a749
[mimo-audio] remove debug
qibaoyuan 5300820
[mimo-audio] update device def
qibaoyuan 77fb0f3
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 2f106db
[mimo-audio] incr token len
qibaoyuan cd864bc
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 8244e62
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 5ed5686
[mimo-audio] gpu mem adapt on h20
qibaoyuan 1876af3
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 1c8cc79
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 2427fe8
[mimo-audio] mimo-audio check
qibaoyuan df7626a
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan c5b67e0
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 00f3988
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan d6ebaa5
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 1a54954
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan f97de11
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 6d92db4
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan b35768c
Merge branch 'main' into feature_mimo_audio
qibaoyuan 910f222
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan cbec602
fix: Solve correct code merging
Dovis01 973a6eb
Merge branch 'main' into feature_mimo_audio
qibaoyuan c8b4fa9
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan ae9b3d2
feat[Streaming]: Done for streaming output on MiMo
Dovis01 eddfcdd
[mimo-audio] code format
qibaoyuan a756fd5
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 87fa83e
[mimo-audio] add async_chunk adn sync yaml file
qibaoyuan 1b6d6b0
[mimo-audio] add async_chunk adn sync yaml file
qibaoyuan da87773
[mimo-audio] add async_chunk adn sync yaml file
qibaoyuan fdeac93
[mimo-audio] adapt vllm15
qibaoyuan 0eb52a9
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan 9831fb2
[MiMo-Audio] fix None value due to upgrading to 15
Dovis01 9af6ef9
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan e9fa9fd
[mimo-audio] code wrap
qibaoyuan e103f1f
fix[Streaming]: Solve non-finished req status
Dovis01 cd3b88d
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan f71483b
[mimo-audio] system prompt reset
qibaoyuan e3d60cc
Merge branch 'vllm-project:main' into feature_mimo_audio
qibaoyuan b021820
[mimo-audio] revert
qibaoyuan 315959d
feat([Streaming]): Done for batching streaming
Dovis01 a08e38c
[mimo-audio] print more
qibaoyuan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,193 @@ | ||
| # MiMo-Audio Offline Inference | ||
|
|
||
| This directory contains an offline demo for running MiMo-Audio models with vLLM Omni. It builds task-specific inputs and generates WAV files or text outputs locally. | ||
|
|
||
| ## Model Overview | ||
|
|
||
| MiMo-Audio provides multiple task variants for audio understanding and generation: | ||
|
|
||
| - **tts_sft**: Basic text-to-speech generation from text input. | ||
| - **tts_sft_with_instruct**: TTS generation with explicit voice style instructions. | ||
| - **tts_sft_with_audio**: TTS generation with audio reference for voice cloning. | ||
| - **tts_sft_with_natural_instruction**: TTS generation from natural language descriptions embedded in text. | ||
| - **audio_trancribing_sft**: Transcribe audio to text (speech-to-text). | ||
| - **audio_understanding_sft**: Understand and analyze audio content with text queries. | ||
| - **audio_understanding_sft_with_thinking**: Audio understanding with reasoning chain. | ||
| - **spoken_dialogue_sft_multiturn**: Multi-turn spoken dialogue with audio input/output. | ||
| - **speech2text_dialogue_sft_multiturn**: Multi-turn dialogue converting speech to text. | ||
| - **text_dialogue_sft_multiturn**: Multi-turn text-only dialogue. | ||
|
|
||
| ## Setup | ||
|
|
||
| Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup. | ||
|
|
||
| ### Environment Variables | ||
|
|
||
| The `MIMO_AUDIO_TOKENIZER_PATH` environment variable is mandatory due to the specialized architecture: | ||
|
|
||
| ```bash | ||
| export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer" | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| Run a single sample for basic TTS: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft | ||
| ``` | ||
|
|
||
| Run batch samples for basic TTS: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft \ | ||
| --num-prompts {batch_size} | ||
| ``` | ||
|
|
||
| When enabling multi-batch processing, if the total number of tokens passed to the next stage exceeds the `max_model_len` value in the `mimo_audio.yaml` configuration file, you must also synchronously update the `max_position_embeddings` value in `MiMo-Audio-7B-Instruct/config.json` to match the modified value. | ||
|
|
||
| Generated audio files are saved to `output_audio/` by default. `--num-prompts` also can be used to all tasks below. | ||
|
|
||
| ## Task Usage | ||
|
|
||
| ### tts_sft (Basic Text-to-Speech) | ||
|
|
||
| Generate speech from text input: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft \ | ||
| --text "The weather is so nice today." | ||
| ``` | ||
|
|
||
| ### tts_sft_with_instruct (TTS with Voice Instructions) | ||
|
|
||
| Generate speech with explicit voice style instructions: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft_with_instruct \ | ||
| --text "The weather is so nice today." \ | ||
| --instruct "Speak happily in a child's voice" | ||
| ``` | ||
|
|
||
| ### tts_sft_with_audio (TTS with Audio Reference) | ||
|
|
||
| Generate speech using an audio reference for voice cloning: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft_with_audio \ | ||
| --text "The weather is so nice today." \ | ||
| --audio-path "./spoken_dialogue_assistant_turn_1.wav" | ||
| ``` | ||
|
|
||
| ### tts_sft_with_natural_instruction (Natural Language TTS) | ||
|
|
||
| Generate speech from text containing natural voice descriptions: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type tts_sft_with_natural_instruction \ | ||
| --text "In a panting young male voice, he said: I can't run anymore, wait for me!" | ||
| ``` | ||
|
|
||
| ### audio_trancribing_sft (Speech-to-Text) | ||
|
|
||
| Transcribe audio to text: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type audio_trancribing_sft \ | ||
| --audio-path "./spoken_dialogue_assistant_turn_1.wav" | ||
| ``` | ||
|
|
||
| ### audio_understanding_sft (Audio Understanding) | ||
|
|
||
| Understand and analyze audio content with text queries: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type audio_understanding_sft \ | ||
| --text "Summarize the audio." \ | ||
| --audio-path "./spoken_dialogue_assistant_turn_1.wav" | ||
| ``` | ||
|
|
||
| ### audio_understanding_sft_with_thinking (Audio Understanding with Reasoning) | ||
|
|
||
| Audio understanding with reasoning chain: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type audio_understanding_sft_with_thinking \ | ||
| --text "Summarize the audio." \ | ||
| --audio-path "./spoken_dialogue_assistant_turn_1.wav" | ||
| ``` | ||
|
|
||
| ### spoken_dialogue_sft_multiturn (Multi-turn Spoken Dialogue) | ||
|
|
||
| Multi-turn dialogue with audio input and output: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type spoken_dialogue_sft_multiturn \ | ||
| --audio-path "./prompt_speech_zh_m.wav" | ||
| ``` | ||
|
|
||
| Note: This task uses hardcoded audio files in the script. The audio files used in examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples | ||
|
|
||
| ### speech2text_dialogue_sft_multiturn (Speech-to-Text Dialogue) | ||
|
|
||
| Multi-turn dialogue converting speech to text: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type speech2text_dialogue_sft_multiturn | ||
| ``` | ||
|
|
||
| Note: This task uses hardcoded audio files and message lists in the script. | ||
|
|
||
| ### text_dialogue_sft_multiturn (Text Dialogue) | ||
|
|
||
| Multi-turn text-only dialogue: | ||
|
|
||
| ```bash | ||
| python3 -u end2end.py \ | ||
| --stage-configs-path vllm_omni/model_executor/stage_configs/mimo_audio.yaml \ | ||
| --model-name XiaomiMiMo/MiMo-Audio-7B-Instruct \ | ||
| --query-type text_dialogue_sft_multiturn | ||
| ``` | ||
|
|
||
| Note: This task uses hardcoded message lists in the script. | ||
|
|
||
| ## Notes | ||
|
|
||
| - The script uses default model paths and audio files embedded in `end2end.py`. Update them if your local cache path differs. | ||
| - Use `--output-dir` to change the output folder (default: `./output_audio`). | ||
| - Use `--num-prompts` to generate multiple prompts in one run (default: 1). | ||
| - Audio files used in multi-turn dialogue examples are available at: https://github.com/XiaomiMiMo/MiMo-Audio/tree/main/examples | ||
| - The script supports various configuration options for initialization timeouts, batch timeouts, and shared memory thresholds. See `--help` for details. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.