Skip to content

Commit 7ffb87a

Browse files
marksverdheiclaude
andcommitted
feat(tts): integrate voice upload API from upstream PR vllm-project#1201
Port the voice upload API (POST /v1/audio/voices) from upstream vllm-project#1201 into the HT branch, adapted to coexist with HT's existing streaming and audio extraction changes. - Add upload_voice(), _load/_save_uploaded_speakers() to serving_speech - Add POST /v1/audio/voices endpoint to api_server - Modify GET /v1/audio/voices to include uploaded voice details - Auto-set ref_audio for uploaded voices in Base task - Add docs/serving/speech_api.md documentation Note: Known upstream review issues (path traversal, metadata locking, validation bypass for built-in voices) are carried as-is for parity and will be addressed in a follow-up. Upstream-PR: vllm-project#1201 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 3cd7943 commit 7ffb87a

File tree

4 files changed

+581
-5
lines changed

4 files changed

+581
-5
lines changed

docs/serving/speech_api.md

Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
# Speech API
2+
3+
vLLM-Omni provides an OpenAI-compatible API for text-to-speech (TTS) generation using Qwen3-TTS models.
4+
5+
Each server instance runs a single model (specified at startup via `vllm serve <model> --omni`).
6+
7+
## Quick Start
8+
9+
### Start the Server
10+
11+
```bash
12+
# CustomVoice model (predefined speakers)
13+
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
14+
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
15+
--omni --port 8000 --trust-remote-code --enforce-eager
16+
```
17+
18+
### Generate Speech
19+
20+
**Using curl:**
21+
22+
```bash
23+
curl -X POST http://localhost:8000/v1/audio/speech \
24+
-H "Content-Type: application/json" \
25+
-d '{
26+
"input": "Hello, how are you?",
27+
"voice": "vivian",
28+
"language": "English"
29+
}' --output output.wav
30+
```
31+
32+
**Using Python:**
33+
34+
```python
35+
import httpx
36+
37+
response = httpx.post(
38+
"http://localhost:8000/v1/audio/speech",
39+
json={
40+
"input": "Hello, how are you?",
41+
"voice": "vivian",
42+
"language": "English",
43+
},
44+
timeout=300.0,
45+
)
46+
47+
with open("output.wav", "wb") as f:
48+
f.write(response.content)
49+
```
50+
51+
**Using OpenAI SDK:**
52+
53+
```python
54+
from openai import OpenAI
55+
56+
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
57+
58+
response = client.audio.speech.create(
59+
model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
60+
voice="vivian",
61+
input="Hello, how are you?",
62+
)
63+
64+
response.stream_to_file("output.wav")
65+
```
66+
67+
## API Reference
68+
69+
### Endpoint
70+
71+
```
72+
POST /v1/audio/speech
73+
Content-Type: application/json
74+
```
75+
76+
### Request Parameters
77+
78+
#### OpenAI Standard Parameters
79+
80+
| Parameter | Type | Default | Description |
81+
|-----------|------|---------|-------------|
82+
| `input` | string | **required** | The text to synthesize into speech |
83+
| `model` | string | server's model | Model to use (optional, should match server if specified) |
84+
| `voice` | string | "vivian" | Speaker name (e.g., vivian, ryan, aiden) |
85+
| `response_format` | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus |
86+
| `speed` | float | 1.0 | Playback speed (0.25-4.0) |
87+
88+
#### vLLM-Omni Extension Parameters
89+
90+
| Parameter | Type | Default | Description |
91+
|-----------|------|---------|-------------|
92+
| `task_type` | string | "CustomVoice" | TTS task type: CustomVoice, VoiceDesign, or Base |
93+
| `language` | string | "Auto" | Language (see supported languages below) |
94+
| `instructions` | string | "" | Voice style/emotion instructions |
95+
| `max_new_tokens` | integer | 2048 | Maximum tokens to generate |
96+
97+
**Supported languages:** Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
98+
99+
#### Voice Clone Parameters (Base task)
100+
101+
| Parameter | Type | Default | Description |
102+
|-----------|------|---------|-------------|
103+
| `ref_audio` | string | null | Reference audio (URL or base64 data URL) |
104+
| `ref_text` | string | null | Transcript of reference audio |
105+
| `x_vector_only_mode` | bool | null | Use speaker embedding only (no ICL) |
106+
107+
### Response Format
108+
Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/wav`).
109+
### Voices Endpoint
110+
```
111+
GET /v1/audio/voices
112+
```
113+
114+
Lists available voices for the loaded model.
115+
116+
```json
117+
{
118+
"voices": ["aiden", "dylan", "eric", "ono_anna", "ryan", "serena", "sohee", "uncle_fu", "vivian"]
119+
}
120+
```
121+
122+
```
123+
POST /v1/audio/voices
124+
Content-Type: multipart/form-data
125+
```
126+
127+
Upload a new voice sample for voice cloning in Base task TTS requests.
128+
129+
**Form Parameters:**
130+
131+
| Parameter | Type | Required | Description |
132+
|-----------|------|----------|-------------|
133+
| `audio_sample` | file | Yes | Audio file (max 10MB, supported formats: wav, mp3, flac, ogg, aac, webm, mp4) |
134+
| `consent` | string | Yes | Consent recording ID |
135+
| `name` | string | Yes | Name for the new voice |
136+
137+
**Response Example:**
138+
139+
```json
140+
{
141+
"success": true,
142+
"voice": {
143+
"name": "custom_voice_1",
144+
"consent": "user_consent_id",
145+
"file_path": "/tmp/voice_samples/custom_voice_1_user_consent_id_1738660000.wav",
146+
"created_at": 1738660000,
147+
"mime_type": "audio/wav",
148+
"file_size": 1024000
149+
}
150+
}
151+
```
152+
153+
**Usage Example:**
154+
155+
```bash
156+
curl -X POST http://localhost:8000/v1/audio/voices \
157+
-F "audio_sample=@/path/to/voice_sample.wav" \
158+
-F "consent=user_consent_id" \
159+
-F "name=custom_voice_1"
160+
```
161+
162+
163+
## Examples
164+
165+
### CustomVoice with Style Instruction
166+
167+
```bash
168+
curl -X POST http://localhost:8000/v1/audio/speech \
169+
-H "Content-Type: application/json" \
170+
-d '{
171+
"input": "I am so excited!",
172+
"voice": "vivian",
173+
"instructions": "Speak with great enthusiasm"
174+
}' --output excited.wav
175+
```
176+
177+
### VoiceDesign (Natural Language Voice Description)
178+
179+
```bash
180+
# Start server with VoiceDesign model first
181+
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
182+
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
183+
--omni --port 8000 --trust-remote-code --enforce-eager
184+
```
185+
186+
```bash
187+
curl -X POST http://localhost:8000/v1/audio/speech \
188+
-H "Content-Type: application/json" \
189+
-d '{
190+
"input": "Hello world",
191+
"task_type": "VoiceDesign",
192+
"instructions": "A warm, friendly female voice with a gentle tone"
193+
}' --output designed.wav
194+
```
195+
196+
### Base (Voice Cloning)
197+
198+
```bash
199+
# Start server with Base model first
200+
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \
201+
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
202+
--omni --port 8000 --trust-remote-code --enforce-eager
203+
```
204+
205+
```bash
206+
curl -X POST http://localhost:8000/v1/audio/speech \
207+
-H "Content-Type: application/json" \
208+
-d '{
209+
"input": "Hello, this is a cloned voice",
210+
"task_type": "Base",
211+
"ref_audio": "https://example.com/reference.wav",
212+
"ref_text": "Original transcript of the reference audio"
213+
}' --output cloned.wav
214+
```
215+
216+
## Supported Models
217+
218+
| Model | Task Type | Description |
219+
|-------|-----------|-------------|
220+
| `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | CustomVoice | Predefined speaker voices with optional style control |
221+
| `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | VoiceDesign | Natural language voice style description |
222+
| `Qwen/Qwen3-TTS-12Hz-1.7B-Base` | Base | Voice cloning from reference audio |
223+
| `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice` | CustomVoice | Smaller/faster variant |
224+
| `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | Base | Smaller/faster variant for voice cloning |
225+
226+
## Error Responses
227+
228+
### 400 Bad Request
229+
230+
Invalid parameters:
231+
232+
```json
233+
{
234+
"error": {
235+
"message": "Input text cannot be empty",
236+
"type": "BadRequestError",
237+
"param": null,
238+
"code": 400
239+
}
240+
}
241+
```
242+
243+
### 404 Not Found
244+
245+
Model not found:
246+
247+
```json
248+
{
249+
"error": {
250+
"message": "The model `xxx` does not exist.",
251+
"type": "NotFoundError",
252+
"param": "model",
253+
"code": 404
254+
}
255+
}
256+
```
257+
258+
## Troubleshooting
259+
260+
### "TTS model did not produce audio output"
261+
262+
Ensure you're using the correct model variant for your task type:
263+
- CustomVoice task → CustomVoice model
264+
- VoiceDesign task → VoiceDesign model
265+
- Base task → Base model
266+
267+
### Server Not Running
268+
269+
```bash
270+
# Check if server is responding
271+
curl http://localhost:8000/v1/audio/voices
272+
```
273+
274+
### Out of Memory
275+
276+
If you encounter OOM errors:
277+
1. Use smaller model variant: `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`
278+
2. Reduce `--gpu-memory-utilization`
279+
280+
### Unsupported Speaker
281+
282+
Use `/v1/audio/voices` to list available voices for the loaded model.
283+
284+
## Development
285+
286+
Enable debug logging:
287+
288+
```bash
289+
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
290+
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
291+
--omni --uvicorn-log-level debug
292+
```

examples/online_serving/qwen3_tts/README.md

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,12 +82,62 @@ curl http://localhost:8000/v1/audio/voices
8282

8383
## API Reference
8484

85-
### Endpoint
85+
### Endpoints
86+
#### GET /v1/audio/voices
8687

88+
List all available voices/speakers from the loaded model, including both built-in model voices and uploaded custom voices.
89+
90+
**Response Example:**
91+
```json
92+
{
93+
"voices": ["vivian", "ryan", "custom_voice_1"],
94+
"uploaded_voices": [
95+
{
96+
"name": "custom_voice_1",
97+
"consent": "user_consent_id",
98+
"created_at": 1738660000,
99+
"file_size": 1024000,
100+
"mime_type": "audio/wav"
101+
}
102+
]
103+
}
87104
```
88-
POST /v1/audio/speech
105+
106+
#### POST /v1/audio/voices
107+
108+
Upload a new voice sample for voice cloning in Base task TTS requests.
109+
110+
**Form Parameters:**
111+
- `audio_sample` (required): Audio file (max 10MB, supported formats: wav, mp3, flac, ogg, aac, webm, mp4)
112+
- `consent` (required): Consent recording ID
113+
- `name` (required): Name for the new voice
114+
115+
**Response Example:**
116+
```json
117+
{
118+
"success": true,
119+
"voice": {
120+
"name": "custom_voice_1",
121+
"consent": "user_consent_id",
122+
"created_at": 1738660000,
123+
"mime_type": "audio/wav",
124+
"file_size": 1024000
125+
}
126+
}
127+
```
128+
129+
**Usage Example:**
130+
```bash
131+
curl -X POST http://localhost:8000/v1/audio/voices \
132+
-F "audio_sample=@/path/to/voice_sample.wav" \
133+
-F "consent=user_consent_id" \
134+
-F "name=custom_voice_1"
89135
```
90136

137+
138+
#### POST /v1/audio/speech
139+
140+
91141
This endpoint follows the [OpenAI Audio Speech API](https://platform.openai.com/docs/api-reference/audio/createSpeech) format with additional Qwen3-TTS parameters.
92142

93143
### Request Body

0 commit comments

Comments
 (0)