Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 53 additions & 2 deletions examples/online_serving/qwen3_tts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,12 +82,63 @@ curl http://localhost:8000/v1/audio/voices

## API Reference

### Endpoint
### Endpoints
#### GET /v1/audio/voices

List all available voices/speakers from the loaded model, including both built-in model voices and uploaded custom voices.

**Response Example:**
```json
{
"voices": ["vivian", "ryan", "custom_voice_1"],
"uploaded_voices": [
{
"name": "custom_voice_1",
"consent": "user_consent_id",
"created_at": 1738660000,
"file_size": 1024000,
"mime_type": "audio/wav"
}
]
}
```
POST /v1/audio/speech

#### POST /v1/audio/voices

Upload a new voice sample for voice cloning in Base task TTS requests.
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states that uploaded voices can be used "for voice cloning in Base task TTS requests", but the implementation doesn't enforce that uploaded voices are only used with Base task. An uploaded voice can be used with any task type due to the auto-set logic at lines 320-325, which could lead to unexpected behavior. Consider either:

  1. Clarifying in the documentation that uploaded voices work with any task type
  2. Restricting uploaded voices to Base task only in the code
  3. Making the auto-set behavior conditional on task_type being "Base"
Suggested change
Upload a new voice sample for voice cloning in Base task TTS requests.
Upload a new voice sample that can be used for voice cloning in subsequent TTS requests with any supported task type.

Copilot uses AI. Check for mistakes.

**Form Parameters:**
- `audio_sample` (required): Audio file (max 10MB, supported formats: wav, mp3, flac, ogg, aac, webm, mp4)
- `consent` (required): Consent recording ID
- `name` (required): Name for the new voice

**Response Example:**
```json
{
"success": true,
"voice": {
"name": "custom_voice_1",
"consent": "user_consent_id",
"file_path": "/tmp/voice_samples/custom_voice_1_user_consent_id_1738660000.wav",
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation exposes the internal file path '/tmp/voice_samples/' in the response example. This is a potential information disclosure issue as it reveals the server's internal directory structure. Consider either:

  1. Not returning the file_path in the API response
  2. Sanitizing the path to not reveal absolute server paths
  3. Returning a relative or opaque identifier instead
Suggested change
"file_path": "/tmp/voice_samples/custom_voice_1_user_consent_id_1738660000.wav",
"file_path": "custom_voice_1_user_consent_id_1738660000.wav",

Copilot uses AI. Check for mistakes.
"created_at": 1738660000,
"mime_type": "audio/wav",
"file_size": 1024000
}
}
```

**Usage Example:**
```bash
curl -X POST http://localhost:8000/v1/audio/voices \
-F "audio_sample=@/path/to/voice_sample.wav" \
-F "consent=user_consent_id" \
-F "name=custom_voice_1"
```


#### POST /v1/audio/speech


This endpoint follows the [OpenAI Audio Speech API](https://platform.openai.com/docs/api-reference/audio/createSpeech) format with additional Qwen3-TTS parameters.

### Request Body
Expand Down
74 changes: 73 additions & 1 deletion vllm_omni/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -815,8 +815,80 @@ async def list_voices(raw_request: Request):
if handler is None:
return base(raw_request).create_error_response(message="The model does not support Speech API")

# Get all speakers (both model built-in and uploaded)
speakers = sorted(handler.supported_speakers) if handler.supported_speakers else []
return JSONResponse(content={"voices": speakers})

# Get uploaded speakers details
uploaded_speakers = []
if hasattr(handler, 'uploaded_speakers'):
for voice_name, info in handler.uploaded_speakers.items():
uploaded_speakers.append({
"name": info.get("name", voice_name),
"consent": info.get("consent", ""),
"created_at": info.get("created_at", 0),
"file_size": info.get("file_size", 0),
"mime_type": info.get("mime_type", "")
})

return JSONResponse(content={
"voices": speakers,
"uploaded_voices": uploaded_speakers
})


@router.post(
"/v1/audio/voices",
responses={
HTTPStatus.OK.value: {"model": dict},
HTTPStatus.BAD_REQUEST.value: {"model": ErrorResponse},
HTTPStatus.INTERNAL_SERVER_ERROR.value: {"model": ErrorResponse},
},
)
async def upload_voice(
raw_request: Request,
audio_sample: UploadFile = File(...),
consent: str = Form(...),
name: str = Form(...),
Comment on lines +850 to +851
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consent parameter is stored but never validated or used for any authorization checks. If consent is meant to represent user consent for voice cloning, there should be validation logic to verify:

  1. The consent ID format/validity
  2. Whether the consent is still active
  3. Logging/audit trail for consent usage

Without proper consent validation, this could lead to compliance issues with privacy regulations.

Copilot uses AI. Check for mistakes.
):
"""Upload a new voice sample for voice cloning.

Uploads an audio file that can be used as a reference for voice cloning
in Base task TTS requests. The voice can then be referenced by name
in subsequent TTS requests.

Args:
audio_sample: Audio file (max 10MB)
consent: Consent recording ID
name: Name for the new voice
raw_request: Raw FastAPI request

Returns:
JSON response with voice information
"""
handler = Omnispeech(raw_request)
if handler is None:
return base(raw_request).create_error_response(message="The model does not support Speech API")

try:
# Validate required parameters
if not consent:
return base(raw_request).create_error_response(message="consent is required")
if not name:
return base(raw_request).create_error_response(message="name is required")

Comment on lines +873 to +878
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling for empty consent/name is redundant because Form(...) already enforces that these fields are required. FastAPI will return a 422 error if these fields are missing. These checks at lines 874-877 will never be reached and should be removed to avoid confusion.

Suggested change
# Validate required parameters
if not consent:
return base(raw_request).create_error_response(message="consent is required")
if not name:
return base(raw_request).create_error_response(message="name is required")

Copilot uses AI. Check for mistakes.
# Upload the voice
result = await handler.upload_voice(audio_sample, consent, name)

return JSONResponse(content={
"success": True,
"voice": result
})

except ValueError as e:
return base(raw_request).create_error_response(message=str(e))
except Exception as e:
logger.exception(f"Failed to upload voice: {e}")
return base(raw_request).create_error_response(message=f"Failed to upload voice: {str(e)}")


# Health and Model endpoints for diffusion mode
Expand Down
172 changes: 170 additions & 2 deletions vllm_omni/entrypoints/openai/serving_speech.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
import asyncio
import json
import os
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'os' module is imported but never used in the code. This import should be removed to keep the codebase clean.

Suggested change
import os

Copilot uses AI. Check for mistakes.
import time
from pathlib import Path
from typing import Any

from fastapi import Request
from fastapi import Request, UploadFile
from fastapi.responses import Response
from vllm.entrypoints.openai.engine.serving import OpenAIServing
from vllm.logger import init_logger
Expand Down Expand Up @@ -40,9 +44,20 @@
class OmniOpenAIServingSpeech(OpenAIServing, AudioMixin):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Initialize uploaded speakers storage
self.uploaded_speakers_dir = Path("/tmp/voice_samples")
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a hardcoded path '/tmp/voice_samples' poses several issues:

  1. Security: Multiple users/deployments on the same system will share this directory
  2. Persistence: Files in /tmp may be deleted by system cleanup processes
  3. Portability: This path may not work on all operating systems (e.g., Windows)

Consider using a configurable directory path that can be set via environment variable or configuration parameter, and ensure proper isolation for multi-tenant scenarios.

Suggested change
self.uploaded_speakers_dir = Path("/tmp/voice_samples")
base_dir_env = os.getenv("VLLM_OMNI_VOICE_SAMPLES_DIR")
if base_dir_env:
self.uploaded_speakers_dir = Path(base_dir_env)
else:
# Use a portable, user-specific cache directory by default
xdg_cache_home = os.getenv("XDG_CACHE_HOME")
if xdg_cache_home:
cache_base = Path(xdg_cache_home)
else:
cache_base = Path.home() / ".cache"
self.uploaded_speakers_dir = cache_base / "vllm_omni" / "voice_samples"

Copilot uses AI. Check for mistakes.
self.uploaded_speakers_dir.mkdir(parents=True, exist_ok=True)
self.metadata_file = self.uploaded_speakers_dir / "metadata.json"

# Load supported speakers
self.supported_speakers = self._load_supported_speakers()
# Load uploaded speakers
self.uploaded_speakers = self._load_uploaded_speakers()
# Merge supported speakers with uploaded speakers
self.supported_speakers.update(self.uploaded_speakers.keys())
Comment on lines 91 to +95
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an inconsistency in how the original voice name is preserved. The metadata stores the original case name in the "name" field but uses lowercase as the dictionary key. However, when listing voices in the API response (api_server.py lines 824-831), it retrieves the name from info.get("name", voice_name), which means it will preserve the original case. But at line 819, the voices list contains lowercase names from self.supported_speakers. This creates an inconsistency where the main "voices" array has lowercase names but "uploaded_voices" has original case names. Consider either:

  1. Storing both lowercase and original case names separately
  2. Standardizing on one format for the API response

Copilot uses AI. Check for mistakes.

logger.info(f"Loaded {len(self.supported_speakers)} supported speakers: {sorted(self.supported_speakers)}")
logger.info(f"Loaded {len(self.uploaded_speakers)} uploaded speakers")

def _load_supported_speakers(self) -> set[str]:
"""Load supported speakers (case-insensitive) from the model configuration."""
Expand All @@ -62,6 +77,151 @@ def _load_supported_speakers(self) -> set[str]:

return set()

def _load_uploaded_speakers(self) -> dict[str, dict]:
"""Load uploaded speakers from metadata file."""
if not self.metadata_file.exists():
return {}

try:
with open(self.metadata_file, 'r') as f:
metadata = json.load(f)
return metadata.get("uploaded_speakers", {})
except Exception as e:
logger.warning(f"Could not load uploaded speakers metadata: {e}")
return {}

def _save_uploaded_speakers(self) -> None:
"""Save uploaded speakers to metadata file."""
try:
metadata = {"uploaded_speakers": self.uploaded_speakers}
with open(self.metadata_file, 'w') as f:
json.dump(metadata, f, indent=2)
except Exception as e:
logger.error(f"Could not save uploaded speakers metadata: {e}")
Comment on lines +131 to +138
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata.json file could grow unbounded as users upload more voices. There's no mechanism to limit the number of uploaded voices or to delete old voices. Consider implementing:

  1. A maximum number of uploaded voices per instance
  2. An API endpoint to delete uploaded voices
  3. A cleanup mechanism for old/unused voices

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +138
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata file is not protected by any locking mechanism. In a multi-process or multi-threaded environment, concurrent uploads could lead to race conditions where:

  1. Two processes read the same metadata
  2. Both add their voice
  3. One overwrites the other's changes when saving

Consider using file locking (e.g., fcntl on Unix, msvcrt on Windows) or a database for thread-safe metadata storage.

Copilot uses AI. Check for mistakes.

def _get_uploaded_audio_data(self, voice_name: str) -> str | None:
"""Get base64 encoded audio data for uploaded voice."""
voice_name_lower = voice_name.lower()
if voice_name_lower not in self.uploaded_speakers:
return None

speaker_info = self.uploaded_speakers[voice_name_lower]
file_path = Path(speaker_info["file_path"])

if not file_path.exists():
logger.warning(f"Audio file not found for voice {voice_name}: {file_path}")
return None

try:
import base64
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base64 module should be imported at the top of the file with other imports, not within the method. This is a standard Python convention and improves code maintainability.

Copilot uses AI. Check for mistakes.

# Read audio file
with open(file_path, 'rb') as f:
audio_bytes = f.read()

# Encode to base64
audio_b64 = base64.b64encode(audio_bytes).decode('utf-8')

# Get MIME type from file extension
mime_type = speaker_info.get("mime_type", "audio/wav")

# Return as data URL
return f"data:{mime_type};base64,{audio_b64}"
except Exception as e:
logger.error(f"Could not read audio file for voice {voice_name}: {e}")
return None
Comment on lines 153 to 168
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uploaded audio file is read into memory completely when used. For a 10MB file, this is acceptable, but this could be optimized by caching the base64-encoded data in memory after first access or storing it in the metadata to avoid repeated file I/O operations.

Copilot uses AI. Check for mistakes.

async def upload_voice(self, audio_file: UploadFile, consent: str, name: str) -> dict:
"""Upload a new voice sample."""
# Validate file size (max 10MB)
MAX_FILE_SIZE = 10 * 1024 * 1024 # 10MB
audio_file.file.seek(0, 2) # Seek to end
file_size = audio_file.file.tell()
audio_file.file.seek(0) # Reset to beginning

if file_size > MAX_FILE_SIZE:
raise ValueError(f"File size exceeds maximum limit of 10MB. Got {file_size} bytes.")

# Detect MIME type from filename if content_type is generic
mime_type = audio_file.content_type
if mime_type == "application/octet-stream":
# Simple MIME type detection based on file extension
filename_lower = audio_file.filename.lower()
if filename_lower.endswith(".wav"):
mime_type = "audio/wav"
elif filename_lower.endswith((".mp3", ".mpeg")):
mime_type = "audio/mpeg"
elif filename_lower.endswith(".flac"):
mime_type = "audio/flac"
elif filename_lower.endswith(".ogg"):
mime_type = "audio/ogg"
elif filename_lower.endswith(".aac"):
mime_type = "audio/aac"
elif filename_lower.endswith(".webm"):
mime_type = "audio/webm"
elif filename_lower.endswith(".mp4"):
mime_type = "audio/mp4"
else:
mime_type = "audio/wav" # Default

# Validate MIME type
allowed_mime_types = {
"audio/mpeg", "audio/wav", "audio/x-wav", "audio/ogg",
"audio/aac", "audio/flac", "audio/webm", "audio/mp4"
}

if mime_type not in allowed_mime_types:
raise ValueError(f"Unsupported MIME type: {mime_type}. Allowed: {allowed_mime_types}")

# Normalize voice name
voice_name_lower = name.lower()
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no input validation for the 'name' parameter. Malicious users could provide names containing path traversal characters (e.g., '../../../etc/passwd') or special characters that could cause issues with file operations. The name should be sanitized to allow only alphanumeric characters, underscores, and hyphens before use.

Copilot uses AI. Check for mistakes.

# Check if voice already exists
if voice_name_lower in self.uploaded_speakers:
raise ValueError(f"Voice '{name}' already exists")

# Generate filename
timestamp = int(time.time())
file_ext = audio_file.filename.split('.')[-1] if '.' in audio_file.filename else "wav"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file extension extraction logic is fragile. If the filename has no extension or multiple dots (e.g., 'my.voice.sample.wav'), splitting by '.' and taking the last element works, but if there's no dot in the filename, the entire filename becomes the extension. This should be handled more robustly, perhaps by using Path(audio_file.filename).suffix or providing a default extension if none is found.

Suggested change
file_ext = audio_file.filename.split('.')[-1] if '.' in audio_file.filename else "wav"
raw_filename = audio_file.filename or ""
suffix = Path(raw_filename).suffix.lstrip(".")
file_ext = suffix if suffix else "wav"

Copilot uses AI. Check for mistakes.
filename = f"{name}_{consent}_{timestamp}.{file_ext}"
file_path = self.uploaded_speakers_dir / filename
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename construction using user-provided 'name' and 'consent' without sanitization creates a security vulnerability. Both parameters should be validated/sanitized to prevent path traversal attacks. Additionally, the file extension is taken directly from the uploaded filename without validation, which could lead to unexpected behavior if the filename doesn't contain an extension or contains multiple dots.

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent path traversal in uploaded voice filename

The upload endpoint builds filename directly from untrusted name and consent and then writes file_path = self.uploaded_speakers_dir / filename. If either field contains path separators or .., the resulting path can escape /tmp/voice_samples and overwrite arbitrary files on the host. This is a security issue that can be triggered by a client POSTing a crafted name/consent. Sanitize these inputs (e.g., allowlist safe characters) or normalize and validate that the resolved path stays within the upload directory.

Useful? React with 👍 / 👎.

# Save audio file
try:
with open(file_path, 'wb') as f:
content = await audio_file.read()
f.write(content)
except Exception as e:
raise ValueError(f"Failed to save audio file: {e}")
Comment on lines +239 to +245
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no check for available disk space before writing the file. If the disk is full, the file write will fail with a potentially unclear error message. Consider either:

  1. Checking available disk space before attempting to save
  2. Providing a more specific error message for disk-full scenarios
    This is especially important since files can be up to 10MB and multiple users may be uploading simultaneously.

Copilot uses AI. Check for mistakes.

# Update metadata
self.uploaded_speakers[voice_name_lower] = {
"name": name,
"consent": consent,
"file_path": str(file_path),
"created_at": timestamp,
"mime_type": mime_type,
"original_filename": audio_file.filename,
"file_size": file_size
}

# Update supported speakers
self.supported_speakers.add(voice_name_lower)

# Save metadata
self._save_uploaded_speakers()

logger.info(f"Uploaded new voice '{name}' with consent ID '{consent}'")

return {
"name": name,
"consent": consent,
"file_path": str(file_path),
"created_at": timestamp,
"mime_type": mime_type,
"file_size": file_size
Comment on lines 234 to 272
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition: if the file is successfully written but saving metadata fails, the uploaded file becomes orphaned. Consider using a transaction-like pattern where you first save the file with a temporary name, then update metadata, and only rename to final name if both succeed. Also consider cleanup of orphaned files on initialization.

Suggested change
# Save audio file
try:
with open(file_path, 'wb') as f:
content = await audio_file.read()
f.write(content)
except Exception as e:
raise ValueError(f"Failed to save audio file: {e}")
# Update metadata
self.uploaded_speakers[voice_name_lower] = {
"name": name,
"consent": consent,
"file_path": str(file_path),
"created_at": timestamp,
"mime_type": mime_type,
"original_filename": audio_file.filename,
"file_size": file_size
}
# Update supported speakers
self.supported_speakers.add(voice_name_lower)
# Save metadata
self._save_uploaded_speakers()
logger.info(f"Uploaded new voice '{name}' with consent ID '{consent}'")
return {
"name": name,
"consent": consent,
"file_path": str(file_path),
"created_at": timestamp,
"mime_type": mime_type,
"file_size": file_size
temp_file_path = self.uploaded_speakers_dir / f"{filename}.tmp"
# Save audio file to a temporary path first to avoid orphaned files
try:
content = await audio_file.read()
with open(temp_file_path, "wb") as f:
f.write(content)
# Update metadata in memory
self.uploaded_speakers[voice_name_lower] = {
"name": name,
"consent": consent,
"file_path": str(file_path),
"created_at": timestamp,
"mime_type": mime_type,
"original_filename": audio_file.filename,
"file_size": file_size,
}
# Update supported speakers
self.supported_speakers.add(voice_name_lower)
# Persist metadata
self._save_uploaded_speakers()
# Atomically move the temp file to its final location
os.replace(temp_file_path, file_path)
except Exception as e:
# Clean up temp file and roll back in-memory state on failure
try:
if isinstance(temp_file_path, Path):
if temp_file_path.exists():
temp_file_path.unlink()
else:
if os.path.exists(temp_file_path):
os.remove(temp_file_path)
except Exception:
# Best-effort cleanup; ignore secondary errors
pass
# Roll back any partially updated metadata
if hasattr(self, "uploaded_speakers"):
self.uploaded_speakers.pop(voice_name_lower, None)
if hasattr(self, "supported_speakers"):
try:
self.supported_speakers.discard(voice_name_lower)
except AttributeError:
# In case supported_speakers is not a set-like object
try:
self.supported_speakers.remove(voice_name_lower)
except Exception:
pass
raise ValueError(f"Failed to upload voice: {e}")
logger.info(f"Uploaded new voice '{name}' with consent ID '{consent}'")
return {
"name": name,
"consent": consent,
"file_path": str(file_path),
"created_at": timestamp,
"mime_type": mime_type,
"file_size": file_size,

Copilot uses AI. Check for mistakes.
}
Comment on lines 170 to 273
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new voice upload functionality lacks test coverage. The existing test file tests/entrypoints/openai_api/test_serving_speech.py tests the list_voices endpoint but doesn't include tests for:

  1. The POST /v1/audio/voices upload endpoint
  2. File validation (size limits, MIME types)
  3. Voice name collision handling
  4. The auto-set ref_audio behavior for uploaded voices
  5. Error scenarios (disk full, invalid files, etc.)

Consider adding comprehensive tests for this new functionality.

Copilot uses AI. Check for mistakes.
Comment on lines 267 to 273
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API returns the full server file path in the response (line 219 and line 883 in api_server.py). This is an information disclosure vulnerability as it exposes the server's internal directory structure to clients. Consider either:

  1. Not returning file_path at all (clients don't need it)
  2. Returning only the voice name/ID as an identifier
  3. Returning an opaque reference that doesn't reveal the actual path

This also applies to the 'uploaded_voices' response at line 826 in api_server.py which indirectly exposes paths through the metadata.

Copilot uses AI. Check for mistakes.

def _is_tts_model(self) -> bool:
"""Check if the current model is a supported TTS model."""
stage_list = getattr(self.engine_client, "stage_list", None)
Expand Down Expand Up @@ -94,7 +254,7 @@ def _validate_tts_request(self, request: OpenAICreateSpeechRequest) -> str | Non
return f"Invalid speaker '{request.voice}'. Supported: {', '.join(sorted(self.supported_speakers))}"

# Validate Base task requirements
if task_type == "Base":
if task_type == "Base" and request.voice is None:
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation doesn't check if an uploaded voice file actually exists when using Base task with an uploaded voice. If task_type is "Base" and voice is an uploaded voice name, but the audio file is missing or unreadable, the auto-set logic at lines 320-325 will silently fail (returning None from _get_uploaded_audio_data), and the Base task will proceed without ref_audio, potentially causing downstream errors. Consider adding validation to ensure uploaded voices have accessible audio files, especially for Base task.

Suggested change
if task_type == "Base" and request.voice is None:
if task_type == "Base":
# Base task always requires explicit ref_audio to avoid relying on
# potentially failing auto-set logic from uploaded voices.

Copilot uses AI. Check for mistakes.
if request.ref_audio is None:
return "Base task requires 'ref_audio' for voice cloning"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require ref_audio for Base when voice isn't uploaded

The new Base-task validation only enforces ref_audio when voice is missing, so a request like task_type=Base with a built-in speaker name but no ref_audio now passes validation. In that case _build_tts_params will send no ref_audio to the model (because the auto-fill only happens for uploaded voices), which breaks the Base task’s voice-cloning requirement and likely yields a model error or incorrect output. Consider requiring ref_audio unless voice refers to an uploaded speaker that will be auto-populated.

Useful? React with 👍 / 👎.

# Validate ref_audio format
Expand Down Expand Up @@ -155,6 +315,14 @@ def _build_tts_params(self, request: OpenAICreateSpeechRequest) -> dict[str, Any
# Speaker (voice)
if request.voice is not None:
params["speaker"] = [request.voice]

# If voice is an uploaded speaker and no ref_audio provided, auto-set it
if request.voice.lower() in self.uploaded_speakers and request.ref_audio is None:
audio_data = self._get_uploaded_audio_data(request.voice)
if audio_data:
params["ref_audio"] = [audio_data]
params["x_vector_only_mode"] = [True]
logger.info(f"Auto-set ref_audio for uploaded voice: {request.voice}")
Comment on lines +379 to +385
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-set logic modifies the request parameters silently. When an uploaded voice is used, ref_audio and x_vector_only_mode are automatically set without informing the user. This could cause confusion if a user explicitly passes ref_audio with an uploaded voice - the user's ref_audio will be silently ignored. Consider:

  1. Logging a warning if user provides ref_audio for an uploaded voice
  2. Documenting this auto-set behavior clearly
  3. Only auto-setting if both voice is uploaded AND ref_audio is None (which is already done, but should be clarified)

Copilot uses AI. Check for mistakes.
elif params["task_type"][0] == "CustomVoice":
params["speaker"] = ["Vivian"] # Default for CustomVoice

Expand Down
Loading