Conversation
|
@linyueqian PTAL |
vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py
Outdated
Show resolved
Hide resolved
18e9df2 to
f072adc
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| generated_speech_tokens: torch.Tensor | list, | ||
| prompt_wav: str, | ||
| return_bytes: bool = True, |
There was a problem hiding this comment.
Token2Wav stage fails without missing prompt_wav argument
The Token2Wav wrapper mandates a prompt_wav argument with no default (step_audio2_token2wav.py lines 304-306), but the Stage 0→1 input processor only forwards audio token IDs and never adds a speaker path, and the stage config uses the generic GPUGenerationWorker. When Stage 1 runs it will call this forward with only the generated tokens, leading to a TypeError: forward() missing required argument 'prompt_wav' before any audio synthesis, so the two-stage pipeline cannot execute.
Useful? React with 👍 / 👎.
hsliuustc0106
left a comment
There was a problem hiding this comment.
please follow this guideline:
https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/model/adding_omni_model/
ok, I'll submit a version as soon as possible. |
|
I think this PR is almost ready for test and then merged, let's push it faster. |
vllm_omni/model_executor/models/step_audio2/step_audio2_processor.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py
Outdated
Show resolved
Hide resolved
4f94598 to
a8d982b
Compare
|
vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/step_audio2/step_audio2_token2wav.py
Outdated
Show resolved
Hide resolved
vllm_omni/model_executor/models/step_audio2/step_audio2_thinker.py
Outdated
Show resolved
Hide resolved
6b48cd6 to
c2cdcb4
Compare
Signed-off-by: wuli666 <421774554@qq.com>
Fixes the following review comments from @linyueqian: 1. Remove duplicate multimodal processor registration in step_audio2_thinker.py - Processor is already registered in step_audio2.py 2. Fix hardcoded 100 token placeholder - Now dynamically calculates audio feature length based on audio_lens - Formula: (audio_len - 1) // 8 + 1 (after encoder + adapter processing) 3. Change kwargs.pop to kwargs.get in _parse_and_validate_audio_input - Avoids modifying original kwargs dict 4. Replace hardcoded .cuda() calls with configurable device - Added device parameter to StepAudio2Token2WavCore - Device is now obtained from vllm_config.device_config - All .cuda() calls replaced with .to(self.device) - torch.amp.autocast now uses dynamic device type 5. Remove redundant logger initialization in StepAudio2Token2WavForConditionalGenerationVLLM - Module-level logger already defined Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <421774554@qq.com> fix: use default prompt wav for Step-Audio2 token2wav Signed-off-by: wuli666 <421774554@qq.com>
- Add online serving examples for step_audio2 model - Refactor attention to use F.scaled_dot_product_attention for better performance - Remove redundant comments and clean up code Signed-off-by: wuli666 <421774554@qq.com>
c2cdcb4 to
01aa912
Compare
Signed-off-by: wuli666 <421774554@qq.com>
…chmarks Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <421774554@qq.com>
Signed-off-by: wuli666 <49897769+wuli666@users.noreply.github.com>
|
can async_chunk help accelerate this model as well? #962 |
I think it should help here too,we may just want to split it into two PRs? |
Signed-off-by: wuli666 <421774554@qq.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Resolves #271 . Integrates Step-Audio2 model https://github.com/stepfun-ai/Step-Audio2
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)