Skip to content
19 changes: 19 additions & 0 deletions .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,25 @@ steps:
volumes:
- "/fsx/hf_cache:/fsx/hf_cache"

- label: "Qwen3-TTS E2E Test"
timeout_in_minutes: 15
depends_on: image-build
commands:
- export VLLM_LOGGING_LEVEL=DEBUG
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -s -v tests/e2e/online_serving/test_qwen3_tts.py
agents:
queue: "gpu_4_queue"
plugins:
- docker#v5.2.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
always-pull: true
propagate-environment: true
environment:
- "HF_HOME=/fsx/hf_cache"
volumes:
- "/fsx/hf_cache:/fsx/hf_cache"

# - label: "Omni Model Test with H100"
# timeout_in_minutes: 30
# depends_on: image-build
Expand Down
1 change: 1 addition & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ nav:
- OpenAI-Compatible API:
- Image Generation: serving/image_generation_api.md
- Image Edit: serving/image_edit_api.md
- Text to Speech: serving/speech_api.md
- Examples:
- examples/README.md
- Offline Inference:
Expand Down
254 changes: 254 additions & 0 deletions docs/serving/speech_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
# Speech API

vLLM-Omni provides an OpenAI-compatible API for text-to-speech (TTS) generation using Qwen3-TTS models.

Each server instance runs a single model (specified at startup via `vllm serve <model> --omni`).

## Quick Start

### Start the Server

```bash
# CustomVoice model (predefined speakers)
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
--omni --port 8091 --trust-remote-code --enforce-eager
```

### Generate Speech

**Using curl:**

```bash
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, how are you?",
"voice": "vivian",
"language": "English"
}' --output output.wav
```

**Using Python:**

```python
import httpx

response = httpx.post(
"http://localhost:8091/v1/audio/speech",
json={
"input": "Hello, how are you?",
"voice": "vivian",
"language": "English",
},
timeout=300.0,
)

with open("output.wav", "wb") as f:
f.write(response.content)
```

**Using OpenAI SDK:**

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")

response = client.audio.speech.create(
model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
voice="vivian",
input="Hello, how are you?",
)

response.stream_to_file("output.wav")
```

## API Reference

### Endpoint

```
POST /v1/audio/speech
Content-Type: application/json
```

### Request Parameters

#### OpenAI Standard Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input` | string | **required** | The text to synthesize into speech |
| `model` | string | server's model | Model to use (optional, should match server if specified) |
| `voice` | string | "vivian" | Speaker name (e.g., vivian, ryan, aiden) |
| `response_format` | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus |
| `speed` | float | 1.0 | Playback speed (0.25-4.0) |

#### vLLM-Omni Extension Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `task_type` | string | "CustomVoice" | TTS task type: CustomVoice, VoiceDesign, or Base |
| `language` | string | "Auto" | Language (see supported languages below) |
| `instructions` | string | "" | Voice style/emotion instructions |
| `max_new_tokens` | integer | 2048 | Maximum tokens to generate |

**Supported languages:** Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

#### Voice Clone Parameters (Base task)

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `ref_audio` | string | null | Reference audio (URL or base64 data URL) |
| `ref_text` | string | null | Transcript of reference audio |
| `x_vector_only_mode` | bool | null | Use speaker embedding only (no ICL) |

### Response Format

Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/wav`).

### Voices Endpoint

```
GET /v1/audio/voices
```

Lists available voices for the loaded model.

```json
{
"voices": ["aiden", "dylan", "eric", "ono_anna", "ryan", "serena", "sohee", "uncle_fu", "vivian"]
}
```

## Examples

### CustomVoice with Style Instruction

```bash
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "I am so excited!",
"voice": "vivian",
"instructions": "Speak with great enthusiasm"
}' --output excited.wav
```

### VoiceDesign (Natural Language Voice Description)

```bash
# Start server with VoiceDesign model first
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
--omni --port 8091 --trust-remote-code --enforce-eager
```

```bash
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello world",
"task_type": "VoiceDesign",
"instructions": "A warm, friendly female voice with a gentle tone"
}' --output designed.wav
```

### Base (Voice Cloning)

```bash
# Start server with Base model first
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
--omni --port 8091 --trust-remote-code --enforce-eager
```

```bash
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello, this is a cloned voice",
"task_type": "Base",
"ref_audio": "https://example.com/reference.wav",
"ref_text": "Original transcript of the reference audio"
}' --output cloned.wav
```

## Supported Models

| Model | Task Type | Description |
|-------|-----------|-------------|
| `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | CustomVoice | Predefined speaker voices with optional style control |
| `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | VoiceDesign | Natural language voice style description |
| `Qwen/Qwen3-TTS-12Hz-1.7B-Base` | Base | Voice cloning from reference audio |
| `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice` | CustomVoice | Smaller/faster variant |
| `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | Base | Smaller/faster variant for voice cloning |

## Error Responses

### 400 Bad Request

Invalid parameters:

```json
{
"error": {
"message": "Input text cannot be empty",
"type": "BadRequestError",
"param": null,
"code": 400
}
}
```

### 404 Not Found

Model not found:

```json
{
"error": {
"message": "The model `xxx` does not exist.",
"type": "NotFoundError",
"param": "model",
"code": 404
}
}
```

## Troubleshooting

### "TTS model did not produce audio output"

Ensure you're using the correct model variant for your task type:
- CustomVoice task → CustomVoice model
- VoiceDesign task → VoiceDesign model
- Base task → Base model

### Server Not Running

```bash
# Check if server is responding
curl http://localhost:8091/v1/audio/voices
```

### Out of Memory

If you encounter OOM errors:
1. Use smaller model variant: `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`
2. Reduce `--gpu-memory-utilization`

### Unsupported Speaker

Use `/v1/audio/voices` to list available voices for the loaded model.

## Development

Enable debug logging:

```bash
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
--omni --uvicorn-log-level debug
```
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -199,5 +199,5 @@ extend-ignore-identifiers-re = [
".*nothink.*",
".*NOTHINK.*",
".*nin.*",
"Ono_Anna",
".*[Oo]no_[Aa]nna.*",
]
Loading