vllm-project · linyueqian · Feb 5, 2026 · Feb 5, 2026 · Feb 5, 2026 · Feb 5, 2026
diff --git a/docs/user_guide/examples/online_serving/qwen3_tts.md b/docs/user_guide/examples/online_serving/qwen3_tts.md
@@ -1,9 +1,11 @@
-# Qwen3-TTS Online Serving
+# Qwen3-TTS
 
 Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen3_tts>.
 
 
-This directory contains examples for running Qwen3-TTS models with vLLM-Omni's online serving API.
+## 🛠️ Installation
+
+Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/README.md)
 
 ## Supported Models
 
@@ -12,34 +14,65 @@ This directory contains examples for running Qwen3-TTS models with vLLM-Omni's o
 | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | CustomVoice | Predefined speaker voices with optional style control |
 | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | VoiceDesign | Natural language voice style description |
 | `Qwen/Qwen3-TTS-12Hz-1.7B-Base` | Base | Voice cloning from reference audio |
+| `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice` | CustomVoice | Smaller/faster variant |
+| `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | Base | Smaller/faster variant for voice cloning |
+
+## Run examples (Qwen3-TTS)
 
-## Quick Start
+### Launch the Server
+
+```bash
+# CustomVoice model (predefined speakers)
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
+    --omni --port 8091 --trust-remote-code --enforce-eager
+
+# VoiceDesign model
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
+    --omni --port 8091 --trust-remote-code --enforce-eager
+
+# Base model (voice cloning)
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
+    --omni --port 8091 --trust-remote-code --enforce-eager
+```
 
-### 1. Start the Server
+If you have custom stage configs file, launch the server with command below
+```bash
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
+    --stage-configs-path /path/to/stage_configs_file \
+    --omni --port 8091 --trust-remote-code --enforce-eager
+```
 
+Alternatively, use the convenience script:
 ```bash
-# CustomVoice model (default)
-./run_server.sh
+./run_server.sh                  # Default: CustomVoice model
+./run_server.sh CustomVoice      # CustomVoice model
+./run_server.sh VoiceDesign      # VoiceDesign model
+./run_server.sh Base             # Base (voice clone) model
+```
 
-# Or specify task type
-./run_server.sh CustomVoice
-./run_server.sh VoiceDesign
-./run_server.sh Base
+### Send TTS Request
+
+Get into the example folder
+```bash
+cd examples/online_serving/qwen3_tts
 ```
 
-### 2. Run the Client
+####  Send request via python
 
 ```bash
 # CustomVoice: Use predefined speaker
 python openai_speech_client.py \
     --text "你好，我是通义千问" \
-    --voice Vivian \
+    --voice vivian \
     --language Chinese
 
 # CustomVoice with style instruction
 python openai_speech_client.py \
     --text "今天天气真好" \
-    --voice Ryan \
+    --voice ryan \
     --instructions "用开心的语气说"
 
 # VoiceDesign: Describe the voice style
@@ -58,29 +91,86 @@ python openai_speech_client.py \
     --ref-text "Original transcript of the reference audio"
 ```
 
-### 3. Using curl
+The Python client supports the following command-line arguments:
+
+- `--api-base`: API base URL (default: `http://localhost:8091`)
+- `--model` (or `-m`): Model name/path (default: `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`)
+- `--task-type` (or `-t`): TTS task type. Options: `CustomVoice`, `VoiceDesign`, `Base`
+- `--text`: Text to synthesize (required)
+- `--voice`: Speaker/voice name (default: `vivian`). Options: `vivian`, `ryan`, `aiden`, etc.
+- `--language`: Language. Options: `Auto`, `Chinese`, `English`, `Japanese`, `Korean`, `German`, `French`, `Russian`, `Portuguese`, `Spanish`, `Italian`
+- `--instructions`: Voice style/emotion instructions
+- `--ref-audio`: Reference audio file path or URL for voice cloning (Base task)
+- `--ref-text`: Reference audio transcript for voice cloning (Base task)
+- `--response-format`: Audio output format (default: `wav`). Options: `wav`, `mp3`, `flac`, `pcm`, `aac`, `opus`
+- `--output` (or `-o`): Output audio file path (default: `tts_output.wav`)
+
+####  Send request via curl
 
 ```bash
 # Simple TTS request
-curl -X POST http://localhost:8000/v1/audio/speech \
+curl -X POST http://localhost:8091/v1/audio/speech \
     -H "Content-Type: application/json" \
     -d '{
         "input": "Hello, how are you?",
-        "voice": "Vivian",
+        "voice": "vivian",
         "language": "English"
     }' --output output.wav
 
 # With style instruction
-curl -X POST http://localhost:8000/v1/audio/speech \
+curl -X POST http://localhost:8091/v1/audio/speech \
     -H "Content-Type: application/json" \
     -d '{
         "input": "I am so excited!",
-        "voice": "Vivian",
+        "voice": "vivian",
         "instructions": "Speak with great enthusiasm"
     }' --output excited.wav
 
 # List available voices in CustomVoice models
-curl http://localhost:8000/v1/audio/voices
+curl http://localhost:8091/v1/audio/voices
+```
+
+### Using OpenAI SDK
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
+
+response = client.audio.speech.create(
+    model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
+    voice="vivian",
+    input="Hello, how are you?",
+)
+
+response.stream_to_file("output.wav")
+```
+
+### Using Python httpx
+
+```python
+import httpx
+
+response = httpx.post(
+    "http://localhost:8091/v1/audio/speech",
+    json={
+        "input": "Hello, how are you?",
+        "voice": "vivian",
+        "language": "English",
+    },
+    timeout=300.0,
+)
+
+with open("output.wav", "wb") as f:
+    f.write(response.content)
+```
+
+### FAQ
+
+If you encounter error about backend of librosa, try to install ffmpeg with command below.
+```
+sudo apt update
+sudo apt install ffmpeg
 ```
 
 ## API Reference
@@ -89,16 +179,31 @@ curl http://localhost:8000/v1/audio/voices
 
 ```
 POST /v1/audio/speech
+Content-Type: application/json
 ```
 
 This endpoint follows the [OpenAI Audio Speech API](https://platform.openai.com/docs/api-reference/audio/createSpeech) format with additional Qwen3-TTS parameters.
 
+### Voices Endpoint
+
+```
+GET /v1/audio/voices
+```
+
+Lists available voices for the loaded model:
+
+```json
+{
+    "voices": ["aiden", "dylan", "eric", "one_anna", "ryan", "serena", "sohee", "uncle_fu", "vivian"]
+}
+```
+
 ### Request Body
 
 ```json
 {
     "input": "Text to synthesize",
-    "voice": "Vivian",
+    "voice": "vivian",
     "response_format": "wav",
     "task_type": "CustomVoice",
     "language": "Auto",
@@ -114,56 +219,38 @@ This endpoint follows the [OpenAI Audio Speech API](https://platform.openai.com/
 
 ### Response
 
-Returns audio data in the requested format (default: WAV).
+Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/wav`).
 
 ## Parameters
 
-### Standard OpenAI Parameters
+### OpenAI Standard Parameters
 
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
-| `input` | string | required | Text to synthesize |
-| `voice` | string | "Vivian" | Speaker/voice name |
+| `input` | string | **required** | Text to synthesize |
+| `model` | string | server's model | Model to use (optional, should match server if specified) |
+| `voice` | string | "vivian" | Speaker name (e.g., vivian, ryan, aiden) |
 | `response_format` | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus |
 | `speed` | float | 1.0 | Playback speed (0.25-4.0) |
-| `model` | string | optional | Model name (optional when serving single model) |
 
-### Qwen3-TTS Parameters
+### vLLM-Omni Extension Parameters
 
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
 | `task_type` | string | "CustomVoice" | Task: CustomVoice, VoiceDesign, or Base |
-| `language` | string | "Auto" | Language: Auto, Chinese, English, Japanese, Korean |
+| `language` | string | "Auto" | Language (see supported languages below) |
 | `instructions` | string | "" | Voice style/emotion instructions |
 | `max_new_tokens` | int | 2048 | Maximum tokens to generate |
 
+**Supported languages:** Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
+
 ### Voice Clone Parameters (Base task)
 
 | Parameter | Type | Required | Description |
 |-----------|------|----------|-------------|
-| `ref_audio` | string | Yes* | Reference audio (file path, URL, or base64) |
+| `ref_audio` | string | **Yes** | Reference audio (URL or base64 data URL) |
 | `ref_text` | string | No | Transcript of reference audio (for ICL mode) |
-| `x_vector_only_mode` | bool | false | Use speaker embedding only (no ICL) |
-
-## Python Usage
-
-```python
-import httpx
-
-# Simple request
-response = httpx.post(
-    "http://localhost:8000/v1/audio/speech",
-    json={
-        "model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
-        "input": "Hello world",
-        "voice": "Vivian",
-    },
-    timeout=300.0,
-)
-
-with open("output.wav", "wb") as f:
-    f.write(response.content)
-```
+| `x_vector_only_mode` | bool | No | Use speaker embedding only (no ICL) |
 
 ## Limitations
 
@@ -172,10 +259,11 @@ with open("output.wav", "wb") as f:
 
 ## Troubleshooting
 
-1. **Connection refused**: Make sure the server is running on the correct port
-2. **Out of memory**: Reduce `--gpu-memory-utilization` in run_server.sh
-3. **Unsupported speaker**: Check supported speakers via model documentation
-4. **Voice clone fails**: Ensure you're using the Base model variant for voice cloning
+1. **"TTS model did not produce audio output"**: Ensure you're using the correct model variant for your task type (CustomVoice task → CustomVoice model, etc.)
+2. **Connection refused**: Make sure the server is running on the correct port
+3. **Out of memory**: Use smaller model variant (`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`) or reduce `--gpu-memory-utilization`
+4. **Unsupported speaker**: Use `/v1/audio/voices` to list available voices for the loaded model
+5. **Voice clone fails**: Ensure you're using the Base model variant for voice cloning
 
 ## Example materials