[Frontend] Add speaker_embedding passthrough to /v1/audio/speech API by marksverdhei · Pull Request #1227 · vllm-project/vllm-omni

marksverdhei · 2026-02-05T16:26:56Z

Summary

Add speaker_embedding field to OpenAICreateSpeechRequest — accepts a pre-computed 1024-dim float vector that bypasses the ECAPA-TDNN speaker encoder extraction step
Validate mutual exclusivity with ref_audio, enforce Base task requirement, auto-set x_vector_only_mode=True
Handle embedding passthrough in generate_voice_clone() by constructing VoiceClonePromptItem directly from the tensor
Add --speaker-embedding CLI flag to the example client (accepts JSON file path)
Add speaker_embedding_interpolation.py example script with offline ECAPA-TDNN extraction, SLERP interpolation, and API integration

This enables embedding interpolation (SLERP/LERP) between voices, embedding caching, and programmatic voice manipulation without requiring reference audio at inference time.

Test plan

Verify speaker_embedding + ref_audio together returns validation error
Verify speaker_embedding without task_type=Base returns validation error
Verify Base task with only speaker_embedding (no ref_audio) generates audio successfully
Extract embeddings from two reference voices, SLERP at t=0.5, confirm blended output
Compare pure embedding outputs (t=0.0, t=1.0) against direct ref_audio outputs for consistency

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e686677cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/model_executor/models/qwen3_tts/qwen3_tts.py

vllm_omni/entrypoints/openai/serving_speech.py

Allow users to pass a pre-computed 1024-dim speaker embedding vector directly to the speech endpoint, bypassing ECAPA-TDNN extraction from reference audio. This enables embedding interpolation (SLERP/LERP) between voices, embedding caching, and programmatic voice manipulation. - Add speaker_embedding field to OpenAICreateSpeechRequest - Validate mutual exclusivity with ref_audio, Base task requirement - Auto-set x_vector_only_mode=True when speaker_embedding is provided - Handle embedding in generate_voice_clone() to construct prompt directly - Add --speaker-embedding flag to example client - Add speaker_embedding_interpolation.py example with SLERP demo Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: marksverdhei <marksverdhei@hotmail.com>

The inline audio extraction logic only checked two locations for multimodal output. Refactor into _extract_audio_output() which also checks output.request_output.outputs[i].multimodal_output (CompletionOutput level, set via setattr by the output processor) and normalises the "model_outputs" key to "audio" for consistent access. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: marksverdhei <marksverdhei@hotmail.com>

- Remove hardcoded float32 dtype from speaker embedding tensor creation, letting downstream .to(self.talker.dtype) handle conversion (P1) - Add length validation for speaker_embedding (64-8192 range) to catch malformed vectors before they reach model execution (P2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: marksverdhei <marksverdhei@hotmail.com>

Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: marksverdhei <marksverdhei@hotmail.com>

linyueqian · 2026-02-05T22:48:32Z

Heads up — PR #1201 adds a voice upload API (POST /v1/audio/voices) and modifies the same validation/param-building code in serving_speech.py. To avoid conflicts, could you wait for #1201 to merge first, then rebase this PR on top?

After rebasing you'd need to:

Unify the Base task validation to handle all three voice sources (ref_audio, speaker_embedding, and uploaded voices)
Align _build_tts_params with feat(tts): add voice upload API for Qwen3-TTS #1201's uploaded voice handling
Consider also accepting speaker_embedding in POST /v1/audio/voices for registering embeddings as named voices

marksverdhei requested a review from hsliuustc0106 as a code owner February 5, 2026 16:26

chatgpt-codex-connector bot reviewed Feb 5, 2026

View reviewed changes

vllm_omni/model_executor/models/qwen3_tts/qwen3_tts.py Outdated Show resolved Hide resolved

vllm_omni/entrypoints/openai/serving_speech.py Show resolved Hide resolved

marksverdhei mentioned this pull request Feb 5, 2026

[RFC]: Qwen3-TTS Production Ready - February Milestone #938

Open

marksverdhei force-pushed the feat/speaker-embedding-passthrough branch from 6cdea6e to 3bc6f0b Compare February 5, 2026 18:03

marksverdhei changed the title ~~feat: add speaker_embedding passthrough to /v1/audio/speech API~~ [Frontend] Add speaker_embedding passthrough to /v1/audio/speech API Feb 5, 2026

marksverdhei and others added 4 commits February 5, 2026 19:22

style: fix ruff formatting in speaker_embedding_interpolation example

d290fdb

Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: marksverdhei <marksverdhei@hotmail.com>

marksverdhei force-pushed the feat/speaker-embedding-passthrough branch from 3bc6f0b to d290fdb Compare February 5, 2026 18:22

marksverdhei mentioned this pull request Feb 5, 2026

Add speaker_embedding examples and sync with upstream PR heiervang-technologies/ht-vllm-omni#11

Merged

3 tasks

linyueqian mentioned this pull request Feb 5, 2026

feat(tts): add voice upload API for Qwen3-TTS #1201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] Add speaker_embedding passthrough to /v1/audio/speech API#1227

[Frontend] Add speaker_embedding passthrough to /v1/audio/speech API#1227
marksverdhei wants to merge 4 commits intovllm-project:mainfrom
marksverdhei:feat/speaker-embedding-passthrough

marksverdhei commented Feb 5, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

linyueqian commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marksverdhei commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

linyueqian commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marksverdhei commented Feb 5, 2026 •

edited

Loading