[WIP] [Model] Step-Audio2 by wuli666 · Pull Request #464 · vllm-project/vllm-omni

wuli666 · 2025-12-24T14:19:07Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Resolves #271 . Integrates Step-Audio2 model https://github.com/stepfun-ai/Step-Audio2

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

hsliuustc0106 · 2025-12-24T14:27:58Z

@linyueqian PTAL

vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py

vllm_omni/model_executor/models/step_audio2/step_audio2.py

vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py

linyueqian

add test ci

vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-29T16:08:20Z

vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py

+        generated_speech_tokens: torch.Tensor | list,
+        prompt_wav: str,
+        return_bytes: bool = True,


Token2Wav stage fails without missing prompt_wav argument

The Token2Wav wrapper mandates a prompt_wav argument with no default (step_audio2_token2wav.py lines 304-306), but the Stage 0→1 input processor only forwards audio token IDs and never adds a speaker path, and the stage config uses the generic GPUGenerationWorker. When Stage 1 runs it will call this forward with only the generated tokens, leading to a TypeError: forward() missing required argument 'prompt_wav' before any audio synthesis, so the two-stage pipeline cannot execute.

Useful? React with 👍 / 👎.

hsliuustc0106

please follow this guideline:
https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/model/adding_omni_model/

wuli666 · 2026-01-04T14:09:40Z

please follow this guideline: https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/model/adding_omni_model/

ok, I'll submit a version as soon as possible.

hsliuustc0106 · 2026-01-04T14:27:56Z

I think this PR is almost ready for test and then merged, let's push it faster.

hsliuustc0106 · 2026-01-04T14:30:14Z

@Bounty-hunter @wuhang2014 PTAL

examples/offline_inference/step_audio2/README.md

vllm_omni/model_executor/models/step_audio2/step_audio2_processor.py

vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py

vllm_omni/model_executor/stage_configs/step_audio_2.yaml

vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py

vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py

hsliuustc0106 · 2026-01-14T12:33:41Z

fix DCO&docs&pre-commits
add hardware usage guidance in your yaml
test whether both offline and online support
rm unnecessary comments
test the e2e speedup compared with original implementation from stepfun

vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py

vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py

examples/offline_inference/step_audio2/README.md

examples/offline_inference/step_audio2/end2end.py

Signed-off-by: wuli666 <421774554@qq.com>

@linyueqian

Fixes the following review comments from @linyueqian: 1. Remove duplicate multimodal processor registration in step_audio2_thinker.py - Processor is already registered in step_audio2.py 2. Fix hardcoded 100 token placeholder - Now dynamically calculates audio feature length based on audio_lens - Formula: (audio_len - 1) // 8 + 1 (after encoder + adapter processing) 3. Change kwargs.pop to kwargs.get in _parse_and_validate_audio_input - Avoids modifying original kwargs dict 4. Replace hardcoded .cuda() calls with configurable device - Added device parameter to StepAudio2Token2WavCore - Device is now obtained from vllm_config.device_config - All .cuda() calls replaced with .to(self.device) - torch.amp.autocast now uses dynamic device type 5. Remove redundant logger initialization in StepAudio2Token2WavForConditionalGenerationVLLM - Module-level logger already defined Signed-off-by: wuli666 <421774554@qq.com>

Signed-off-by: wuli666 <421774554@qq.com> fix: use default prompt wav for Step-Audio2 token2wav Signed-off-by: wuli666 <421774554@qq.com>

- Add online serving examples for step_audio2 model - Refactor attention to use F.scaled_dot_product_attention for better performance - Remove redundant comments and clean up code Signed-off-by: wuli666 <421774554@qq.com>

Signed-off-by: wuli666 <421774554@qq.com>

…chmarks Signed-off-by: wuli666 <421774554@qq.com>

Signed-off-by: wuli666 <421774554@qq.com>

Signed-off-by: wuli666 <49897769+wuli666@users.noreply.github.com>

hsliuustc0106 · 2026-02-03T13:09:39Z

can async_chunk help accelerate this model as well? #962

wuli666 · 2026-02-04T03:31:26Z

can async_chunk help accelerate this model as well? #962

I think it should help here too，we may just want to split it into two PRs？

Signed-off-by: wuli666 <421774554@qq.com>

wuli666 mentioned this pull request Dec 24, 2025

[New Model]:Step Audio 2 #271

Open

1 task