Skip to content

[Frontend] Add speaker_embedding passthrough to /v1/audio/speech API#1227

Open
marksverdhei wants to merge 4 commits intovllm-project:mainfrom
marksverdhei:feat/speaker-embedding-passthrough
Open

[Frontend] Add speaker_embedding passthrough to /v1/audio/speech API#1227
marksverdhei wants to merge 4 commits intovllm-project:mainfrom
marksverdhei:feat/speaker-embedding-passthrough

Conversation

@marksverdhei
Copy link
Contributor

@marksverdhei marksverdhei commented Feb 5, 2026

Summary

  • Add speaker_embedding field to OpenAICreateSpeechRequest — accepts a pre-computed 1024-dim float vector that bypasses the ECAPA-TDNN speaker encoder extraction step
  • Validate mutual exclusivity with ref_audio, enforce Base task requirement, auto-set x_vector_only_mode=True
  • Handle embedding passthrough in generate_voice_clone() by constructing VoiceClonePromptItem directly from the tensor
  • Add --speaker-embedding CLI flag to the example client (accepts JSON file path)
  • Add speaker_embedding_interpolation.py example script with offline ECAPA-TDNN extraction, SLERP interpolation, and API integration

This enables embedding interpolation (SLERP/LERP) between voices, embedding caching, and programmatic voice manipulation without requiring reference audio at inference time.

Test plan

  • Verify speaker_embedding + ref_audio together returns validation error
  • Verify speaker_embedding without task_type=Base returns validation error
  • Verify Base task with only speaker_embedding (no ref_audio) generates audio successfully
  • Extract embeddings from two reference voices, SLERP at t=0.5, confirm blended output
  • Compare pure embedding outputs (t=0.0, t=1.0) against direct ref_audio outputs for consistency

🤖 Generated with Claude Code

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e686677cd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@marksverdhei marksverdhei force-pushed the feat/speaker-embedding-passthrough branch from 6cdea6e to 3bc6f0b Compare February 5, 2026 18:03
@marksverdhei marksverdhei changed the title feat: add speaker_embedding passthrough to /v1/audio/speech API [Frontend] Add speaker_embedding passthrough to /v1/audio/speech API Feb 5, 2026
marksverdhei and others added 4 commits February 5, 2026 19:22
Allow users to pass a pre-computed 1024-dim speaker embedding vector
directly to the speech endpoint, bypassing ECAPA-TDNN extraction from
reference audio. This enables embedding interpolation (SLERP/LERP)
between voices, embedding caching, and programmatic voice manipulation.

- Add speaker_embedding field to OpenAICreateSpeechRequest
- Validate mutual exclusivity with ref_audio, Base task requirement
- Auto-set x_vector_only_mode=True when speaker_embedding is provided
- Handle embedding in generate_voice_clone() to construct prompt directly
- Add --speaker-embedding flag to example client
- Add speaker_embedding_interpolation.py example with SLERP demo

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
The inline audio extraction logic only checked two locations for
multimodal output. Refactor into _extract_audio_output() which also
checks output.request_output.outputs[i].multimodal_output
(CompletionOutput level, set via setattr by the output processor) and
normalises the "model_outputs" key to "audio" for consistent access.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
- Remove hardcoded float32 dtype from speaker embedding tensor creation,
  letting downstream .to(self.talker.dtype) handle conversion (P1)
- Add length validation for speaker_embedding (64-8192 range) to catch
  malformed vectors before they reach model execution (P2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: marksverdhei <marksverdhei@hotmail.com>
@linyueqian
Copy link
Contributor

Heads up — PR #1201 adds a voice upload API (POST /v1/audio/voices) and modifies the same validation/param-building code in serving_speech.py. To avoid conflicts, could you wait for #1201 to merge first, then rebase this PR on top?

After rebasing you'd need to:

  • Unify the Base task validation to handle all three voice sources (ref_audio, speaker_embedding, and uploaded voices)
  • Align _build_tts_params with feat(tts): add voice upload API for Qwen3-TTS #1201's uploaded voice handling
  • Consider also accepting speaker_embedding in POST /v1/audio/voices for registering embeddings as named voices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants