Skip to content

[WIP] [Model] Step-Audio2 #464

Open
wuli666 wants to merge 18 commits intovllm-project:mainfrom
wuli666:feature/step-audio2-integration
Open

[WIP] [Model] Step-Audio2 #464
wuli666 wants to merge 18 commits intovllm-project:mainfrom
wuli666:feature/step-audio2-integration

Conversation

@wuli666
Copy link

@wuli666 wuli666 commented Dec 24, 2025

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Resolves #271 . Integrates Step-Audio2 model https://github.com/stepfun-ai/Step-Audio2

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@wuli666 wuli666 mentioned this pull request Dec 24, 2025
1 task
@hsliuustc0106
Copy link
Collaborator

@linyueqian PTAL

Copy link
Contributor

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test ci

@wuli666 wuli666 force-pushed the feature/step-audio2-integration branch 3 times, most recently from 18e9df2 to f072adc Compare December 29, 2025 14:44
@wuli666 wuli666 marked this pull request as ready for review December 29, 2025 16:03
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 304 to 306
generated_speech_tokens: torch.Tensor | list,
prompt_wav: str,
return_bytes: bool = True,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Token2Wav stage fails without missing prompt_wav argument

The Token2Wav wrapper mandates a prompt_wav argument with no default (step_audio2_token2wav.py lines 304-306), but the Stage 0→1 input processor only forwards audio token IDs and never adds a speaker path, and the stage config uses the generic GPUGenerationWorker. When Stage 1 runs it will call this forward with only the generated tokens, leading to a TypeError: forward() missing required argument 'prompt_wav' before any audio synthesis, so the two-stage pipeline cannot execute.

Useful? React with 👍 / 👎.

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuli666
Copy link
Author

wuli666 commented Jan 4, 2026

please follow this guideline: https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/model/adding_omni_model/

ok, I'll submit a version as soon as possible.

@hsliuustc0106
Copy link
Collaborator

I think this PR is almost ready for test and then merged, let's push it faster.

@hsliuustc0106
Copy link
Collaborator

@Bounty-hunter @wuhang2014 PTAL

@wuli666 wuli666 force-pushed the feature/step-audio2-integration branch 2 times, most recently from 4f94598 to a8d982b Compare January 7, 2026 08:48
@hsliuustc0106
Copy link
Collaborator

  • fix DCO&docs&pre-commits
  • add hardware usage guidance in your yaml
  • test whether both offline and online support
  • rm unnecessary comments
  • test the e2e speedup compared with original implementation from stepfun

@david6666666 david6666666 mentioned this pull request Jan 16, 2026
51 tasks
@wuli666 wuli666 force-pushed the feature/step-audio2-integration branch 2 times, most recently from 6b48cd6 to c2cdcb4 Compare January 19, 2026 14:38
Signed-off-by: wuli666 <421774554@qq.com>
Fixes the following review comments from @linyueqian:

1. Remove duplicate multimodal processor registration in step_audio2_thinker.py
   - Processor is already registered in step_audio2.py

2. Fix hardcoded 100 token placeholder
   - Now dynamically calculates audio feature length based on audio_lens
   - Formula: (audio_len - 1) // 8 + 1 (after encoder + adapter processing)

3. Change kwargs.pop to kwargs.get in _parse_and_validate_audio_input
   - Avoids modifying original kwargs dict

4. Replace hardcoded .cuda() calls with configurable device
   - Added device parameter to StepAudio2Token2WavCore
   - Device is now obtained from vllm_config.device_config
   - All .cuda() calls replaced with .to(self.device)
   - torch.amp.autocast now uses dynamic device type

5. Remove redundant logger initialization in StepAudio2Token2WavForConditionalGenerationVLLM
   - Module-level logger already defined

Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <421774554@qq.com>

 fix: use default prompt wav for Step-Audio2 token2wav

Signed-off-by: wuli666 <421774554@qq.com>
- Add online serving examples for step_audio2 model
- Refactor attention to use F.scaled_dot_product_attention for better performance
- Remove redundant comments and clean up code

Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <421774554@qq.com>
@wuli666 wuli666 force-pushed the feature/step-audio2-integration branch from c2cdcb4 to 01aa912 Compare January 19, 2026 14:39
wuli666 and others added 7 commits January 19, 2026 22:42
Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <421774554@qq.com>
…chmarks

Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <49897769+wuli666@users.noreply.github.com>
@hsliuustc0106
Copy link
Collaborator

can async_chunk help accelerate this model as well? #962

@wuli666
Copy link
Author

wuli666 commented Feb 4, 2026

can async_chunk help accelerate this model as well? #962

I think it should help here too,we may just want to split it into two PRs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]:Step Audio 2

5 participants