-
Notifications
You must be signed in to change notification settings - Fork 395
feat(tts): add voice upload API for Qwen3-TTS #1201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
9e405d2
70c380d
1f93f21
90704e9
87a2933
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -82,12 +82,63 @@ curl http://localhost:8000/v1/audio/voices | |||||
|
|
||||||
| ## API Reference | ||||||
|
|
||||||
| ### Endpoint | ||||||
| ### Endpoints | ||||||
| #### GET /v1/audio/voices | ||||||
|
|
||||||
| List all available voices/speakers from the loaded model, including both built-in model voices and uploaded custom voices. | ||||||
|
|
||||||
| **Response Example:** | ||||||
| ```json | ||||||
| { | ||||||
| "voices": ["vivian", "ryan", "custom_voice_1"], | ||||||
| "uploaded_voices": [ | ||||||
| { | ||||||
| "name": "custom_voice_1", | ||||||
| "consent": "user_consent_id", | ||||||
| "created_at": 1738660000, | ||||||
| "file_size": 1024000, | ||||||
| "mime_type": "audio/wav" | ||||||
| } | ||||||
| ] | ||||||
| } | ||||||
| ``` | ||||||
| POST /v1/audio/speech | ||||||
|
|
||||||
| #### POST /v1/audio/voices | ||||||
|
|
||||||
| Upload a new voice sample for voice cloning in Base task TTS requests. | ||||||
|
|
||||||
| **Form Parameters:** | ||||||
| - `audio_sample` (required): Audio file (max 10MB, supported formats: wav, mp3, flac, ogg, aac, webm, mp4) | ||||||
| - `consent` (required): Consent recording ID | ||||||
| - `name` (required): Name for the new voice | ||||||
|
|
||||||
| **Response Example:** | ||||||
| ```json | ||||||
| { | ||||||
| "success": true, | ||||||
| "voice": { | ||||||
| "name": "custom_voice_1", | ||||||
| "consent": "user_consent_id", | ||||||
| "file_path": "/tmp/voice_samples/custom_voice_1_user_consent_id_1738660000.wav", | ||||||
|
||||||
| "file_path": "/tmp/voice_samples/custom_voice_1_user_consent_id_1738660000.wav", | |
| "file_path": "custom_voice_1_user_consent_id_1738660000.wav", |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -815,8 +815,80 @@ async def list_voices(raw_request: Request): | |||||||||||||
| if handler is None: | ||||||||||||||
| return base(raw_request).create_error_response(message="The model does not support Speech API") | ||||||||||||||
|
|
||||||||||||||
| # Get all speakers (both model built-in and uploaded) | ||||||||||||||
| speakers = sorted(handler.supported_speakers) if handler.supported_speakers else [] | ||||||||||||||
| return JSONResponse(content={"voices": speakers}) | ||||||||||||||
|
|
||||||||||||||
| # Get uploaded speakers details | ||||||||||||||
| uploaded_speakers = [] | ||||||||||||||
| if hasattr(handler, 'uploaded_speakers'): | ||||||||||||||
| for voice_name, info in handler.uploaded_speakers.items(): | ||||||||||||||
| uploaded_speakers.append({ | ||||||||||||||
| "name": info.get("name", voice_name), | ||||||||||||||
| "consent": info.get("consent", ""), | ||||||||||||||
| "created_at": info.get("created_at", 0), | ||||||||||||||
| "file_size": info.get("file_size", 0), | ||||||||||||||
| "mime_type": info.get("mime_type", "") | ||||||||||||||
| }) | ||||||||||||||
|
|
||||||||||||||
| return JSONResponse(content={ | ||||||||||||||
| "voices": speakers, | ||||||||||||||
| "uploaded_voices": uploaded_speakers | ||||||||||||||
| }) | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| @router.post( | ||||||||||||||
| "/v1/audio/voices", | ||||||||||||||
| responses={ | ||||||||||||||
| HTTPStatus.OK.value: {"model": dict}, | ||||||||||||||
| HTTPStatus.BAD_REQUEST.value: {"model": ErrorResponse}, | ||||||||||||||
| HTTPStatus.INTERNAL_SERVER_ERROR.value: {"model": ErrorResponse}, | ||||||||||||||
| }, | ||||||||||||||
| ) | ||||||||||||||
| async def upload_voice( | ||||||||||||||
| raw_request: Request, | ||||||||||||||
| audio_sample: UploadFile = File(...), | ||||||||||||||
| consent: str = Form(...), | ||||||||||||||
| name: str = Form(...), | ||||||||||||||
|
Comment on lines
+850
to
+851
|
||||||||||||||
| ): | ||||||||||||||
| """Upload a new voice sample for voice cloning. | ||||||||||||||
|
|
||||||||||||||
| Uploads an audio file that can be used as a reference for voice cloning | ||||||||||||||
| in Base task TTS requests. The voice can then be referenced by name | ||||||||||||||
| in subsequent TTS requests. | ||||||||||||||
|
|
||||||||||||||
| Args: | ||||||||||||||
| audio_sample: Audio file (max 10MB) | ||||||||||||||
| consent: Consent recording ID | ||||||||||||||
| name: Name for the new voice | ||||||||||||||
| raw_request: Raw FastAPI request | ||||||||||||||
|
|
||||||||||||||
| Returns: | ||||||||||||||
| JSON response with voice information | ||||||||||||||
| """ | ||||||||||||||
| handler = Omnispeech(raw_request) | ||||||||||||||
| if handler is None: | ||||||||||||||
| return base(raw_request).create_error_response(message="The model does not support Speech API") | ||||||||||||||
|
|
||||||||||||||
| try: | ||||||||||||||
| # Validate required parameters | ||||||||||||||
| if not consent: | ||||||||||||||
| return base(raw_request).create_error_response(message="consent is required") | ||||||||||||||
| if not name: | ||||||||||||||
| return base(raw_request).create_error_response(message="name is required") | ||||||||||||||
|
|
||||||||||||||
|
Comment on lines
+873
to
+878
|
||||||||||||||
| # Validate required parameters | |
| if not consent: | |
| return base(raw_request).create_error_response(message="consent is required") | |
| if not name: | |
| return base(raw_request).create_error_response(message="name is required") | |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,7 +1,11 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import asyncio | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import json | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import os | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import os |
Outdated
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a hardcoded path '/tmp/voice_samples' poses several issues:
- Security: Multiple users/deployments on the same system will share this directory
- Persistence: Files in /tmp may be deleted by system cleanup processes
- Portability: This path may not work on all operating systems (e.g., Windows)
Consider using a configurable directory path that can be set via environment variable or configuration parameter, and ensure proper isolation for multi-tenant scenarios.
| self.uploaded_speakers_dir = Path("/tmp/voice_samples") | |
| base_dir_env = os.getenv("VLLM_OMNI_VOICE_SAMPLES_DIR") | |
| if base_dir_env: | |
| self.uploaded_speakers_dir = Path(base_dir_env) | |
| else: | |
| # Use a portable, user-specific cache directory by default | |
| xdg_cache_home = os.getenv("XDG_CACHE_HOME") | |
| if xdg_cache_home: | |
| cache_base = Path(xdg_cache_home) | |
| else: | |
| cache_base = Path.home() / ".cache" | |
| self.uploaded_speakers_dir = cache_base / "vllm_omni" / "voice_samples" |
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inconsistency in how the original voice name is preserved. The metadata stores the original case name in the "name" field but uses lowercase as the dictionary key. However, when listing voices in the API response (api_server.py lines 824-831), it retrieves the name from info.get("name", voice_name), which means it will preserve the original case. But at line 819, the voices list contains lowercase names from self.supported_speakers. This creates an inconsistency where the main "voices" array has lowercase names but "uploaded_voices" has original case names. Consider either:
- Storing both lowercase and original case names separately
- Standardizing on one format for the API response
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata.json file could grow unbounded as users upload more voices. There's no mechanism to limit the number of uploaded voices or to delete old voices. Consider implementing:
- A maximum number of uploaded voices per instance
- An API endpoint to delete uploaded voices
- A cleanup mechanism for old/unused voices
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metadata file is not protected by any locking mechanism. In a multi-process or multi-threaded environment, concurrent uploads could lead to race conditions where:
- Two processes read the same metadata
- Both add their voice
- One overwrites the other's changes when saving
Consider using file locking (e.g., fcntl on Unix, msvcrt on Windows) or a database for thread-safe metadata storage.
Outdated
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base64 module should be imported at the top of the file with other imports, not within the method. This is a standard Python convention and improves code maintainability.
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uploaded audio file is read into memory completely when used. For a 10MB file, this is acceptable, but this could be optimized by caching the base64-encoded data in memory after first access or storing it in the metadata to avoid repeated file I/O operations.
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no input validation for the 'name' parameter. Malicious users could provide names containing path traversal characters (e.g., '../../../etc/passwd') or special characters that could cause issues with file operations. The name should be sanitized to allow only alphanumeric characters, underscores, and hyphens before use.
Outdated
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file extension extraction logic is fragile. If the filename has no extension or multiple dots (e.g., 'my.voice.sample.wav'), splitting by '.' and taking the last element works, but if there's no dot in the filename, the entire filename becomes the extension. This should be handled more robustly, perhaps by using Path(audio_file.filename).suffix or providing a default extension if none is found.
| file_ext = audio_file.filename.split('.')[-1] if '.' in audio_file.filename else "wav" | |
| raw_filename = audio_file.filename or "" | |
| suffix = Path(raw_filename).suffix.lstrip(".") | |
| file_ext = suffix if suffix else "wav" |
Outdated
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filename construction using user-provided 'name' and 'consent' without sanitization creates a security vulnerability. Both parameters should be validated/sanitized to prevent path traversal attacks. Additionally, the file extension is taken directly from the uploaded filename without validation, which could lead to unexpected behavior if the filename doesn't contain an extension or contains multiple dots.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevent path traversal in uploaded voice filename
The upload endpoint builds filename directly from untrusted name and consent and then writes file_path = self.uploaded_speakers_dir / filename. If either field contains path separators or .., the resulting path can escape /tmp/voice_samples and overwrite arbitrary files on the host. This is a security issue that can be triggered by a client POSTing a crafted name/consent. Sanitize these inputs (e.g., allowlist safe characters) or normalize and validate that the resolved path stays within the upload directory.
Useful? React with 👍 / 👎.
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no check for available disk space before writing the file. If the disk is full, the file write will fail with a potentially unclear error message. Consider either:
- Checking available disk space before attempting to save
- Providing a more specific error message for disk-full scenarios
This is especially important since files can be up to 10MB and multiple users may be uploading simultaneously.
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition: if the file is successfully written but saving metadata fails, the uploaded file becomes orphaned. Consider using a transaction-like pattern where you first save the file with a temporary name, then update metadata, and only rename to final name if both succeed. Also consider cleanup of orphaned files on initialization.
| # Save audio file | |
| try: | |
| with open(file_path, 'wb') as f: | |
| content = await audio_file.read() | |
| f.write(content) | |
| except Exception as e: | |
| raise ValueError(f"Failed to save audio file: {e}") | |
| # Update metadata | |
| self.uploaded_speakers[voice_name_lower] = { | |
| "name": name, | |
| "consent": consent, | |
| "file_path": str(file_path), | |
| "created_at": timestamp, | |
| "mime_type": mime_type, | |
| "original_filename": audio_file.filename, | |
| "file_size": file_size | |
| } | |
| # Update supported speakers | |
| self.supported_speakers.add(voice_name_lower) | |
| # Save metadata | |
| self._save_uploaded_speakers() | |
| logger.info(f"Uploaded new voice '{name}' with consent ID '{consent}'") | |
| return { | |
| "name": name, | |
| "consent": consent, | |
| "file_path": str(file_path), | |
| "created_at": timestamp, | |
| "mime_type": mime_type, | |
| "file_size": file_size | |
| temp_file_path = self.uploaded_speakers_dir / f"{filename}.tmp" | |
| # Save audio file to a temporary path first to avoid orphaned files | |
| try: | |
| content = await audio_file.read() | |
| with open(temp_file_path, "wb") as f: | |
| f.write(content) | |
| # Update metadata in memory | |
| self.uploaded_speakers[voice_name_lower] = { | |
| "name": name, | |
| "consent": consent, | |
| "file_path": str(file_path), | |
| "created_at": timestamp, | |
| "mime_type": mime_type, | |
| "original_filename": audio_file.filename, | |
| "file_size": file_size, | |
| } | |
| # Update supported speakers | |
| self.supported_speakers.add(voice_name_lower) | |
| # Persist metadata | |
| self._save_uploaded_speakers() | |
| # Atomically move the temp file to its final location | |
| os.replace(temp_file_path, file_path) | |
| except Exception as e: | |
| # Clean up temp file and roll back in-memory state on failure | |
| try: | |
| if isinstance(temp_file_path, Path): | |
| if temp_file_path.exists(): | |
| temp_file_path.unlink() | |
| else: | |
| if os.path.exists(temp_file_path): | |
| os.remove(temp_file_path) | |
| except Exception: | |
| # Best-effort cleanup; ignore secondary errors | |
| pass | |
| # Roll back any partially updated metadata | |
| if hasattr(self, "uploaded_speakers"): | |
| self.uploaded_speakers.pop(voice_name_lower, None) | |
| if hasattr(self, "supported_speakers"): | |
| try: | |
| self.supported_speakers.discard(voice_name_lower) | |
| except AttributeError: | |
| # In case supported_speakers is not a set-like object | |
| try: | |
| self.supported_speakers.remove(voice_name_lower) | |
| except Exception: | |
| pass | |
| raise ValueError(f"Failed to upload voice: {e}") | |
| logger.info(f"Uploaded new voice '{name}' with consent ID '{consent}'") | |
| return { | |
| "name": name, | |
| "consent": consent, | |
| "file_path": str(file_path), | |
| "created_at": timestamp, | |
| "mime_type": mime_type, | |
| "file_size": file_size, |
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new voice upload functionality lacks test coverage. The existing test file tests/entrypoints/openai_api/test_serving_speech.py tests the list_voices endpoint but doesn't include tests for:
- The POST /v1/audio/voices upload endpoint
- File validation (size limits, MIME types)
- Voice name collision handling
- The auto-set ref_audio behavior for uploaded voices
- Error scenarios (disk full, invalid files, etc.)
Consider adding comprehensive tests for this new functionality.
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API returns the full server file path in the response (line 219 and line 883 in api_server.py). This is an information disclosure vulnerability as it exposes the server's internal directory structure to clients. Consider either:
- Not returning file_path at all (clients don't need it)
- Returning only the voice name/ID as an identifier
- Returning an opaque reference that doesn't reveal the actual path
This also applies to the 'uploaded_voices' response at line 826 in api_server.py which indirectly exposes paths through the metadata.
Outdated
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation doesn't check if an uploaded voice file actually exists when using Base task with an uploaded voice. If task_type is "Base" and voice is an uploaded voice name, but the audio file is missing or unreadable, the auto-set logic at lines 320-325 will silently fail (returning None from _get_uploaded_audio_data), and the Base task will proceed without ref_audio, potentially causing downstream errors. Consider adding validation to ensure uploaded voices have accessible audio files, especially for Base task.
| if task_type == "Base" and request.voice is None: | |
| if task_type == "Base": | |
| # Base task always requires explicit ref_audio to avoid relying on | |
| # potentially failing auto-set logic from uploaded voices. |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Require ref_audio for Base when voice isn't uploaded
The new Base-task validation only enforces ref_audio when voice is missing, so a request like task_type=Base with a built-in speaker name but no ref_audio now passes validation. In that case _build_tts_params will send no ref_audio to the model (because the auto-fill only happens for uploaded voices), which breaks the Base task’s voice-cloning requirement and likely yields a model error or incorrect output. Consider requiring ref_audio unless voice refers to an uploaded speaker that will be auto-populated.
Useful? React with 👍 / 👎.
Copilot
AI
Feb 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auto-set logic modifies the request parameters silently. When an uploaded voice is used, ref_audio and x_vector_only_mode are automatically set without informing the user. This could cause confusion if a user explicitly passes ref_audio with an uploaded voice - the user's ref_audio will be silently ignored. Consider:
- Logging a warning if user provides ref_audio for an uploaded voice
- Documenting this auto-set behavior clearly
- Only auto-setting if both voice is uploaded AND ref_audio is None (which is already done, but should be clarified)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states that uploaded voices can be used "for voice cloning in Base task TTS requests", but the implementation doesn't enforce that uploaded voices are only used with Base task. An uploaded voice can be used with any task type due to the auto-set logic at lines 320-325, which could lead to unexpected behavior. Consider either: