vllm-project · linyueqian · Feb 4, 2026 · Feb 5, 2026 · Feb 5, 2026 · Feb 5, 2026
@@ -240,6 +240,25 @@ steps:
         volumes:
         - "/fsx/hf_cache:/fsx/hf_cache"
 
+  - label: "Qwen3-TTS E2E Test"
+    timeout_in_minutes: 15
+    depends_on: image-build
+    commands:
+      - export VLLM_LOGGING_LEVEL=DEBUG
+      - export VLLM_WORKER_MULTIPROC_METHOD=spawn
+      - pytest -s -v tests/e2e/online_serving/test_qwen3_tts.py
+    agents:
+      queue: "gpu_4_queue"
+    plugins:
+      - docker#v5.2.0:
+          image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
+          always-pull: true
+          propagate-environment: true
+          environment:
+            - "HF_HOME=/fsx/hf_cache"
+          volumes:
+            - "/fsx/hf_cache:/fsx/hf_cache"
+
   # - label: "Omni Model Test with H100"
   #   timeout_in_minutes: 30
   #   depends_on: image-build

diff --git a/docs/.nav.yml b/docs/.nav.yml
@@ -8,6 +8,7 @@ nav:
     - OpenAI-Compatible API:
       - Image Generation: serving/image_generation_api.md
       - Image Edit: serving/image_edit_api.md
+      - Text to Speech: serving/speech_api.md
   - Examples:
     - examples/README.md
     - Offline Inference:

diff --git a/docs/serving/speech_api.md b/docs/serving/speech_api.md
@@ -0,0 +1,254 @@
+# Speech API
+
+vLLM-Omni provides an OpenAI-compatible API for text-to-speech (TTS) generation using Qwen3-TTS models.
+
+Each server instance runs a single model (specified at startup via `vllm serve <model> --omni`).
+
+## Quick Start
+
+### Start the Server
+
+```bash
+# CustomVoice model (predefined speakers)
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
+    --omni --port 8091 --trust-remote-code --enforce-eager
+```
+
+### Generate Speech
+
+**Using curl:**
+
+```bash
+curl -X POST http://localhost:8091/v1/audio/speech \
+    -H "Content-Type: application/json" \
+    -d '{
+        "input": "Hello, how are you?",
+        "voice": "vivian",
+        "language": "English"
+    }' --output output.wav
+```
+
+**Using Python:**
+
+```python
+import httpx
+
+response = httpx.post(
+    "http://localhost:8091/v1/audio/speech",
+    json={
+        "input": "Hello, how are you?",
+        "voice": "vivian",
+        "language": "English",
+    },
+    timeout=300.0,
+)
+
+with open("output.wav", "wb") as f:
+    f.write(response.content)
+```
+
+**Using OpenAI SDK:**
+
+```python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
+
+response = client.audio.speech.create(
+    model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
+    voice="vivian",
+    input="Hello, how are you?",
+)
+
+response.stream_to_file("output.wav")
+```
+
+## API Reference
+
+### Endpoint
+
+```
+POST /v1/audio/speech
+Content-Type: application/json
+```
+
+### Request Parameters
+
+#### OpenAI Standard Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `input` | string | **required** | The text to synthesize into speech |
+| `model` | string | server's model | Model to use (optional, should match server if specified) |
+| `voice` | string | "vivian" | Speaker name (e.g., vivian, ryan, aiden) |
+| `response_format` | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus |
+| `speed` | float | 1.0 | Playback speed (0.25-4.0) |
+
+#### vLLM-Omni Extension Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `task_type` | string | "CustomVoice" | TTS task type: CustomVoice, VoiceDesign, or Base |
+| `language` | string | "Auto" | Language (see supported languages below) |
+| `instructions` | string | "" | Voice style/emotion instructions |
+| `max_new_tokens` | integer | 2048 | Maximum tokens to generate |
+
+**Supported languages:** Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
+
+#### Voice Clone Parameters (Base task)
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `ref_audio` | string | null | Reference audio (URL or base64 data URL) |
+| `ref_text` | string | null | Transcript of reference audio |
+| `x_vector_only_mode` | bool | null | Use speaker embedding only (no ICL) |
+
+### Response Format
+
+Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/wav`).
+
+### Voices Endpoint
+
+```
+GET /v1/audio/voices
+```
+
+Lists available voices for the loaded model.
+
+```json
+{
+    "voices": ["aiden", "dylan", "eric", "ono_anna", "ryan", "serena", "sohee", "uncle_fu", "vivian"]
+}
+```
+
+## Examples
+
+### CustomVoice with Style Instruction
+
+```bash
+curl -X POST http://localhost:8091/v1/audio/speech \
+    -H "Content-Type: application/json" \
+    -d '{
+        "input": "I am so excited!",
+        "voice": "vivian",
+        "instructions": "Speak with great enthusiasm"
+    }' --output excited.wav
+```
+
+### VoiceDesign (Natural Language Voice Description)
+
+```bash
+# Start server with VoiceDesign model first
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
+    --omni --port 8091 --trust-remote-code --enforce-eager
+```
+
+```bash
+curl -X POST http://localhost:8091/v1/audio/speech \
+    -H "Content-Type: application/json" \
+    -d '{
+        "input": "Hello world",
+        "task_type": "VoiceDesign",
+        "instructions": "A warm, friendly female voice with a gentle tone"
+    }' --output designed.wav
+```
+
+### Base (Voice Cloning)
+
+```bash
+# Start server with Base model first
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
+    --omni --port 8091 --trust-remote-code --enforce-eager
+```
+
+```bash
+curl -X POST http://localhost:8091/v1/audio/speech \
+    -H "Content-Type: application/json" \
+    -d '{
+        "input": "Hello, this is a cloned voice",
+        "task_type": "Base",
+        "ref_audio": "https://example.com/reference.wav",
+        "ref_text": "Original transcript of the reference audio"
+    }' --output cloned.wav
+```
+
+## Supported Models
+
+| Model | Task Type | Description |
+|-------|-----------|-------------|
+| `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | CustomVoice | Predefined speaker voices with optional style control |
+| `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | VoiceDesign | Natural language voice style description |
+| `Qwen/Qwen3-TTS-12Hz-1.7B-Base` | Base | Voice cloning from reference audio |
+| `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice` | CustomVoice | Smaller/faster variant |
+| `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | Base | Smaller/faster variant for voice cloning |
+
+## Error Responses
+
+### 400 Bad Request
+
+Invalid parameters:
+
+```json
+{
+    "error": {
+        "message": "Input text cannot be empty",
+        "type": "BadRequestError",
+        "param": null,
+        "code": 400
+    }
+}
+```
+
+### 404 Not Found
+
+Model not found:
+
+```json
+{
+    "error": {
+        "message": "The model `xxx` does not exist.",
+        "type": "NotFoundError",
+        "param": "model",
+        "code": 404
+    }
+}
+```
+
+## Troubleshooting
+
+### "TTS model did not produce audio output"
+
+Ensure you're using the correct model variant for your task type:
+- CustomVoice task → CustomVoice model
+- VoiceDesign task → VoiceDesign model
+- Base task → Base model
+
+### Server Not Running
+
+```bash
+# Check if server is responding
+curl http://localhost:8091/v1/audio/voices
+```
+
+### Out of Memory
+
+If you encounter OOM errors:
+1. Use smaller model variant: `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`
+2. Reduce `--gpu-memory-utilization`
+
+### Unsupported Speaker
+
+Use `/v1/audio/voices` to list available voices for the loaded model.
+
+## Development
+
+Enable debug logging:
+
+```bash
+vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
+    --omni --uvicorn-log-level debug
+```
@@ -199,5 +199,5 @@ extend-ignore-identifiers-re = [
     ".*nothink.*",
     ".*NOTHINK.*",
     ".*nin.*",
-    "Ono_Anna",
+    ".*[Oo]no_[Aa]nna.*",
 ]