feat(tts): add voice upload API for Qwen3-TTS#1201
feat(tts): add voice upload API for Qwen3-TTS#1201zhaotyer wants to merge 5 commits intovllm-project:mainfrom
Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e405d2eff
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Validate Base task requirements | ||
| if task_type == "Base": | ||
| if task_type == "Base" and request.voice is None: | ||
| if request.ref_audio is None: | ||
| return "Base task requires 'ref_audio' for voice cloning" |
There was a problem hiding this comment.
Require ref_audio for Base when voice isn't uploaded
The new Base-task validation only enforces ref_audio when voice is missing, so a request like task_type=Base with a built-in speaker name but no ref_audio now passes validation. In that case _build_tts_params will send no ref_audio to the model (because the auto-fill only happens for uploaded voices), which breaks the Base task’s voice-cloning requirement and likely yields a model error or incorrect output. Consider requiring ref_audio unless voice refers to an uploaded speaker that will be auto-populated.
Useful? React with 👍 / 👎.
| # Generate filename | ||
| timestamp = int(time.time()) | ||
| file_ext = audio_file.filename.split('.')[-1] if '.' in audio_file.filename else "wav" | ||
| filename = f"{name}_{consent}_{timestamp}.{file_ext}" | ||
| file_path = self.uploaded_speakers_dir / filename | ||
|
|
There was a problem hiding this comment.
Prevent path traversal in uploaded voice filename
The upload endpoint builds filename directly from untrusted name and consent and then writes file_path = self.uploaded_speakers_dir / filename. If either field contains path separators or .., the resulting path can escape /tmp/voice_samples and overwrite arbitrary files on the host. This is a security issue that can be triggered by a client POSTing a crafted name/consent. Sanitize these inputs (e.g., allowlist safe characters) or normalize and validate that the resolved path stays within the upload directory.
Useful? React with 👍 / 👎.
|
A few thoughts: (1) Consider supporting pre-extracted embedding uploads (.pt/.npy) in addition to audio files to skip extraction overhead at inference time. (2) The /tmp/voice_samples storage is volatile. Maybe make this path configurable or document this limitation. (3) Missing a DELETE endpoint to remove uploaded voices. |
|
please also update the docs as well in apiserver |
we probably need to merge #1206 first |
There was a problem hiding this comment.
Pull request overview
This pull request adds voice upload functionality for Qwen3-TTS, allowing users to upload custom voice samples for voice cloning. The implementation adds new API endpoints for uploading and listing voice samples, along with automatic integration into the TTS workflow.
Changes:
- Added POST /v1/audio/voices endpoint for uploading custom voice samples (max 10MB)
- Modified GET /v1/audio/voices endpoint to return both built-in and uploaded voices
- Implemented auto-set behavior that automatically uses uploaded voice audio for Base task TTS requests
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 20 comments.
| File | Description |
|---|---|
| vllm_omni/entrypoints/openai/serving_speech.py | Core voice upload logic including file storage, metadata management, and auto-set ref_audio behavior for uploaded voices |
| vllm_omni/entrypoints/openai/api_server.py | API endpoint definitions for voice upload and enhanced voice listing with uploaded voice details |
| examples/online_serving/qwen3_tts/README.md | Documentation for new voice management endpoints with usage examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| # Validate Base task requirements | ||
| if task_type == "Base": | ||
| if task_type == "Base" and request.voice is None: |
There was a problem hiding this comment.
The validation doesn't check if an uploaded voice file actually exists when using Base task with an uploaded voice. If task_type is "Base" and voice is an uploaded voice name, but the audio file is missing or unreadable, the auto-set logic at lines 320-325 will silently fail (returning None from _get_uploaded_audio_data), and the Base task will proceed without ref_audio, potentially causing downstream errors. Consider adding validation to ensure uploaded voices have accessible audio files, especially for Base task.
| if task_type == "Base" and request.voice is None: | |
| if task_type == "Base": | |
| # Base task always requires explicit ref_audio to avoid relying on | |
| # potentially failing auto-set logic from uploaded voices. |
| def __init__(self, *args, **kwargs): | ||
| super().__init__(*args, **kwargs) | ||
| # Initialize uploaded speakers storage | ||
| self.uploaded_speakers_dir = Path("/tmp/voice_samples") |
There was a problem hiding this comment.
Using a hardcoded path '/tmp/voice_samples' poses several issues:
- Security: Multiple users/deployments on the same system will share this directory
- Persistence: Files in /tmp may be deleted by system cleanup processes
- Portability: This path may not work on all operating systems (e.g., Windows)
Consider using a configurable directory path that can be set via environment variable or configuration parameter, and ensure proper isolation for multi-tenant scenarios.
| self.uploaded_speakers_dir = Path("/tmp/voice_samples") | |
| base_dir_env = os.getenv("VLLM_OMNI_VOICE_SAMPLES_DIR") | |
| if base_dir_env: | |
| self.uploaded_speakers_dir = Path(base_dir_env) | |
| else: | |
| # Use a portable, user-specific cache directory by default | |
| xdg_cache_home = os.getenv("XDG_CACHE_HOME") | |
| if xdg_cache_home: | |
| cache_base = Path(xdg_cache_home) | |
| else: | |
| cache_base = Path.home() / ".cache" | |
| self.uploaded_speakers_dir = cache_base / "vllm_omni" / "voice_samples" |
|
|
||
| # Save audio file | ||
| try: | ||
| with open(file_path, 'wb') as f: | ||
| content = await audio_file.read() | ||
| f.write(content) | ||
| except Exception as e: | ||
| raise ValueError(f"Failed to save audio file: {e}") | ||
|
|
||
| # Update metadata | ||
| self.uploaded_speakers[voice_name_lower] = { | ||
| "name": name, | ||
| "consent": consent, | ||
| "file_path": str(file_path), | ||
| "created_at": timestamp, | ||
| "mime_type": mime_type, | ||
| "original_filename": audio_file.filename, | ||
| "file_size": file_size | ||
| } | ||
|
|
||
| # Update supported speakers | ||
| self.supported_speakers.add(voice_name_lower) | ||
|
|
||
| # Save metadata | ||
| self._save_uploaded_speakers() | ||
|
|
||
| logger.info(f"Uploaded new voice '{name}' with consent ID '{consent}'") | ||
|
|
||
| return { | ||
| "name": name, | ||
| "consent": consent, | ||
| "file_path": str(file_path), | ||
| "created_at": timestamp, | ||
| "mime_type": mime_type, | ||
| "file_size": file_size |
There was a problem hiding this comment.
There's a potential race condition: if the file is successfully written but saving metadata fails, the uploaded file becomes orphaned. Consider using a transaction-like pattern where you first save the file with a temporary name, then update metadata, and only rename to final name if both succeed. Also consider cleanup of orphaned files on initialization.
| # Save audio file | |
| try: | |
| with open(file_path, 'wb') as f: | |
| content = await audio_file.read() | |
| f.write(content) | |
| except Exception as e: | |
| raise ValueError(f"Failed to save audio file: {e}") | |
| # Update metadata | |
| self.uploaded_speakers[voice_name_lower] = { | |
| "name": name, | |
| "consent": consent, | |
| "file_path": str(file_path), | |
| "created_at": timestamp, | |
| "mime_type": mime_type, | |
| "original_filename": audio_file.filename, | |
| "file_size": file_size | |
| } | |
| # Update supported speakers | |
| self.supported_speakers.add(voice_name_lower) | |
| # Save metadata | |
| self._save_uploaded_speakers() | |
| logger.info(f"Uploaded new voice '{name}' with consent ID '{consent}'") | |
| return { | |
| "name": name, | |
| "consent": consent, | |
| "file_path": str(file_path), | |
| "created_at": timestamp, | |
| "mime_type": mime_type, | |
| "file_size": file_size | |
| temp_file_path = self.uploaded_speakers_dir / f"{filename}.tmp" | |
| # Save audio file to a temporary path first to avoid orphaned files | |
| try: | |
| content = await audio_file.read() | |
| with open(temp_file_path, "wb") as f: | |
| f.write(content) | |
| # Update metadata in memory | |
| self.uploaded_speakers[voice_name_lower] = { | |
| "name": name, | |
| "consent": consent, | |
| "file_path": str(file_path), | |
| "created_at": timestamp, | |
| "mime_type": mime_type, | |
| "original_filename": audio_file.filename, | |
| "file_size": file_size, | |
| } | |
| # Update supported speakers | |
| self.supported_speakers.add(voice_name_lower) | |
| # Persist metadata | |
| self._save_uploaded_speakers() | |
| # Atomically move the temp file to its final location | |
| os.replace(temp_file_path, file_path) | |
| except Exception as e: | |
| # Clean up temp file and roll back in-memory state on failure | |
| try: | |
| if isinstance(temp_file_path, Path): | |
| if temp_file_path.exists(): | |
| temp_file_path.unlink() | |
| else: | |
| if os.path.exists(temp_file_path): | |
| os.remove(temp_file_path) | |
| except Exception: | |
| # Best-effort cleanup; ignore secondary errors | |
| pass | |
| # Roll back any partially updated metadata | |
| if hasattr(self, "uploaded_speakers"): | |
| self.uploaded_speakers.pop(voice_name_lower, None) | |
| if hasattr(self, "supported_speakers"): | |
| try: | |
| self.supported_speakers.discard(voice_name_lower) | |
| except AttributeError: | |
| # In case supported_speakers is not a set-like object | |
| try: | |
| self.supported_speakers.remove(voice_name_lower) | |
| except Exception: | |
| pass | |
| raise ValueError(f"Failed to upload voice: {e}") | |
| logger.info(f"Uploaded new voice '{name}' with consent ID '{consent}'") | |
| return { | |
| "name": name, | |
| "consent": consent, | |
| "file_path": str(file_path), | |
| "created_at": timestamp, | |
| "mime_type": mime_type, | |
| "file_size": file_size, |
| def _save_uploaded_speakers(self) -> None: | ||
| """Save uploaded speakers to metadata file.""" | ||
| try: | ||
| metadata = {"uploaded_speakers": self.uploaded_speakers} | ||
| with open(self.metadata_file, 'w') as f: | ||
| json.dump(metadata, f, indent=2) | ||
| except Exception as e: | ||
| logger.error(f"Could not save uploaded speakers metadata: {e}") |
There was a problem hiding this comment.
The metadata.json file could grow unbounded as users upload more voices. There's no mechanism to limit the number of uploaded voices or to delete old voices. Consider implementing:
- A maximum number of uploaded voices per instance
- An API endpoint to delete uploaded voices
- A cleanup mechanism for old/unused voices
| def _save_uploaded_speakers(self) -> None: | ||
| """Save uploaded speakers to metadata file.""" | ||
| try: | ||
| metadata = {"uploaded_speakers": self.uploaded_speakers} | ||
| with open(self.metadata_file, 'w') as f: | ||
| json.dump(metadata, f, indent=2) | ||
| except Exception as e: | ||
| logger.error(f"Could not save uploaded speakers metadata: {e}") |
There was a problem hiding this comment.
The metadata file is not protected by any locking mechanism. In a multi-process or multi-threaded environment, concurrent uploads could lead to race conditions where:
- Two processes read the same metadata
- Both add their voice
- One overwrites the other's changes when saving
Consider using file locking (e.g., fcntl on Unix, msvcrt on Windows) or a database for thread-safe metadata storage.
| @@ -1,7 +1,11 @@ | |||
| import asyncio | |||
| import json | |||
| import os | |||
There was a problem hiding this comment.
The 'os' module is imported but never used in the code. This import should be removed to keep the codebase clean.
| import os |
| "voice": { | ||
| "name": "custom_voice_1", | ||
| "consent": "user_consent_id", | ||
| "file_path": "/tmp/voice_samples/custom_voice_1_user_consent_id_1738660000.wav", |
There was a problem hiding this comment.
The documentation exposes the internal file path '/tmp/voice_samples/' in the response example. This is a potential information disclosure issue as it reveals the server's internal directory structure. Consider either:
- Not returning the file_path in the API response
- Sanitizing the path to not reveal absolute server paths
- Returning a relative or opaque identifier instead
| "file_path": "/tmp/voice_samples/custom_voice_1_user_consent_id_1738660000.wav", | |
| "file_path": "custom_voice_1_user_consent_id_1738660000.wav", |
|
|
||
| # Generate filename | ||
| timestamp = int(time.time()) | ||
| file_ext = audio_file.filename.split('.')[-1] if '.' in audio_file.filename else "wav" |
There was a problem hiding this comment.
The file extension extraction logic is fragile. If the filename has no extension or multiple dots (e.g., 'my.voice.sample.wav'), splitting by '.' and taking the last element works, but if there's no dot in the filename, the entire filename becomes the extension. This should be handled more robustly, perhaps by using Path(audio_file.filename).suffix or providing a default extension if none is found.
| file_ext = audio_file.filename.split('.')[-1] if '.' in audio_file.filename else "wav" | |
| raw_filename = audio_file.filename or "" | |
| suffix = Path(raw_filename).suffix.lstrip(".") | |
| file_ext = suffix if suffix else "wav" |
| consent: str = Form(...), | ||
| name: str = Form(...), |
There was a problem hiding this comment.
The consent parameter is stored but never validated or used for any authorization checks. If consent is meant to represent user consent for voice cloning, there should be validation logic to verify:
- The consent ID format/validity
- Whether the consent is still active
- Logging/audit trail for consent usage
Without proper consent validation, this could lead to compliance issues with privacy regulations.
|
|
||
| #### POST /v1/audio/voices | ||
|
|
||
| Upload a new voice sample for voice cloning in Base task TTS requests. |
There was a problem hiding this comment.
The documentation states that uploaded voices can be used "for voice cloning in Base task TTS requests", but the implementation doesn't enforce that uploaded voices are only used with Base task. An uploaded voice can be used with any task type due to the auto-set logic at lines 320-325, which could lead to unexpected behavior. Consider either:
- Clarifying in the documentation that uploaded voices work with any task type
- Restricting uploaded voices to Base task only in the code
- Making the auto-set behavior conditional on task_type being "Base"
| Upload a new voice sample for voice cloning in Base task TTS requests. | |
| Upload a new voice sample that can be used for voice cloning in subsequent TTS requests with any supported task type. |
already add docs in apiserver, copy from #1206 |
|
A few issues from the Copilot review still look unaddressed after the latest commit: Security (must fix before merge):
Logic bugs (must fix):
Minor (nice to have):
Also heads up: PR #1227 adds |
you are right,i will fix it |
Port the voice upload API (POST /v1/audio/voices) from upstream vllm-project#1201 into the HT branch, adapted to coexist with HT's existing streaming and audio extraction changes. - Add upload_voice(), _load/_save_uploaded_speakers() to serving_speech - Add POST /v1/audio/voices endpoint to api_server - Modify GET /v1/audio/voices to include uploaded voice details - Auto-set ref_audio for uploaded voices in Base task - Add docs/serving/speech_api.md documentation Note: Known upstream review issues (path traversal, metadata locking, validation bypass for built-in voices) are carried as-is for parity and will be addressed in a follow-up. Upstream-PR: vllm-project#1201 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes security and logic issues flagged in upstream PR vllm-project#1201 review: Security: - Sanitize name/consent to alphanumeric/underscore/hyphen only - Validate resolved path stays within upload directory - Remove file_path from API responses (information disclosure) Logic bugs: - Base task validation now correctly requires ref_audio unless voice is specifically an uploaded voice (not just any voice name) - _get_uploaded_audio_data raises ValueError instead of returning None when audio file is missing, preventing silent failures Robustness: - Atomic metadata writes via tempfile + os.replace - File locking (fcntl.flock) on metadata.json reads and writes - Use Path().suffix for file extension extraction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Prevent path traversal attacks in filename handling - Remove file_path from API responses - Update documentation - Improve validation logic
The new endpoints allow users to:
Purpose
support add voice upload API for Qwen3-TTS
Test Plan
Test Result
Details
Files changed: - vllm_omni/entrypoints/openai/api_server.py - vllm_omni/entrypoints/openai/serving_speech.py - examples/online_serving/qwen3_tts/README.mdBEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)