diff --git a/docs/user_guide/examples/online_serving/qwen3_tts.md b/docs/user_guide/examples/online_serving/qwen3_tts.md index 48bd58b00c..f899e362ee 100644 --- a/docs/user_guide/examples/online_serving/qwen3_tts.md +++ b/docs/user_guide/examples/online_serving/qwen3_tts.md @@ -1,9 +1,11 @@ -# Qwen3-TTS Online Serving +# Qwen3-TTS Source . -This directory contains examples for running Qwen3-TTS models with vLLM-Omni's online serving API. +## 🛠️ Installation + +Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/README.md) ## Supported Models @@ -12,34 +14,77 @@ This directory contains examples for running Qwen3-TTS models with vLLM-Omni's o | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | CustomVoice | Predefined speaker voices with optional style control | | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | VoiceDesign | Natural language voice style description | | `Qwen/Qwen3-TTS-12Hz-1.7B-Base` | Base | Voice cloning from reference audio | +| `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice` | CustomVoice | Smaller/faster variant | +| `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | Base | Smaller/faster variant for voice cloning | + +## Run examples (Qwen3-TTS) -## Quick Start +### Launch the Server + +```bash +# CustomVoice model (predefined speakers) +vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ + --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --omni \ + --port 8091 \ + --trust-remote-code \ + --enforce-eager + +# VoiceDesign model +vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign \ + --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --omni \ + --port 8091 \ + --trust-remote-code \ + --enforce-eager + +# Base model (voice cloning) +vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \ + --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --omni \ + --port 8091 \ + --trust-remote-code \ + --enforce-eager +``` -### 1. Start the Server +If you have custom stage configs file, launch the server with command below +```bash +vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ + --stage-configs-path /path/to/stage_configs_file \ + --omni \ + --port 8091 \ + --trust-remote-code \ + --enforce-eager +``` +Alternatively, use the convenience script: ```bash -# CustomVoice model (default) -./run_server.sh +./run_server.sh # Default: CustomVoice model +./run_server.sh CustomVoice # CustomVoice model +./run_server.sh VoiceDesign # VoiceDesign model +./run_server.sh Base # Base (voice clone) model +``` -# Or specify task type -./run_server.sh CustomVoice -./run_server.sh VoiceDesign -./run_server.sh Base +### Send TTS Request + +Get into the example folder +```bash +cd examples/online_serving/qwen3_tts ``` -### 2. Run the Client +#### Send request via python ```bash # CustomVoice: Use predefined speaker python openai_speech_client.py \ --text "你好,我是通义千问" \ - --voice Vivian \ + --voice vivian \ --language Chinese # CustomVoice with style instruction python openai_speech_client.py \ --text "今天天气真好" \ - --voice Ryan \ + --voice ryan \ --instructions "用开心的语气说" # VoiceDesign: Describe the voice style @@ -58,29 +103,86 @@ python openai_speech_client.py \ --ref-text "Original transcript of the reference audio" ``` -### 3. Using curl +The Python client supports the following command-line arguments: + +- `--api-base`: API base URL (default: `http://localhost:8091`) +- `--model` (or `-m`): Model name/path (default: `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`) +- `--task-type` (or `-t`): TTS task type. Options: `CustomVoice`, `VoiceDesign`, `Base` +- `--text`: Text to synthesize (required) +- `--voice`: Speaker/voice name (default: `vivian`). Options: `vivian`, `ryan`, `aiden`, etc. +- `--language`: Language. Options: `Auto`, `Chinese`, `English`, `Japanese`, `Korean`, `German`, `French`, `Russian`, `Portuguese`, `Spanish`, `Italian` +- `--instructions`: Voice style/emotion instructions +- `--ref-audio`: Reference audio file path or URL for voice cloning (Base task) +- `--ref-text`: Reference audio transcript for voice cloning (Base task) +- `--response-format`: Audio output format (default: `wav`). Options: `wav`, `mp3`, `flac`, `pcm`, `aac`, `opus` +- `--output` (or `-o`): Output audio file path (default: `tts_output.wav`) + +#### Send request via curl ```bash # Simple TTS request -curl -X POST http://localhost:8000/v1/audio/speech \ +curl -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "Hello, how are you?", - "voice": "Vivian", + "voice": "vivian", "language": "English" }' --output output.wav # With style instruction -curl -X POST http://localhost:8000/v1/audio/speech \ +curl -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "I am so excited!", - "voice": "Vivian", + "voice": "vivian", "instructions": "Speak with great enthusiasm" }' --output excited.wav # List available voices in CustomVoice models -curl http://localhost:8000/v1/audio/voices +curl http://localhost:8091/v1/audio/voices +``` + +### Using OpenAI SDK + +```python +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:8091/v1", api_key="none") + +response = client.audio.speech.create( + model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", + voice="vivian", + input="Hello, how are you?", +) + +response.stream_to_file("output.wav") +``` + +### Using Python httpx + +```python +import httpx + +response = httpx.post( + "http://localhost:8091/v1/audio/speech", + json={ + "input": "Hello, how are you?", + "voice": "vivian", + "language": "English", + }, + timeout=300.0, +) + +with open("output.wav", "wb") as f: + f.write(response.content) +``` + +### FAQ + +If you encounter error about backend of librosa, try to install ffmpeg with command below. +``` +sudo apt update +sudo apt install ffmpeg ``` ## API Reference @@ -89,16 +191,31 @@ curl http://localhost:8000/v1/audio/voices ``` POST /v1/audio/speech +Content-Type: application/json ``` This endpoint follows the [OpenAI Audio Speech API](https://platform.openai.com/docs/api-reference/audio/createSpeech) format with additional Qwen3-TTS parameters. +### Voices Endpoint + +``` +GET /v1/audio/voices +``` + +Lists available voices for the loaded model: + +```json +{ + "voices": ["aiden", "dylan", "eric", "one_anna", "ryan", "serena", "sohee", "uncle_fu", "vivian"] +} +``` + ### Request Body ```json { "input": "Text to synthesize", - "voice": "Vivian", + "voice": "vivian", "response_format": "wav", "task_type": "CustomVoice", "language": "Auto", @@ -114,56 +231,38 @@ This endpoint follows the [OpenAI Audio Speech API](https://platform.openai.com/ ### Response -Returns audio data in the requested format (default: WAV). +Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/wav`). ## Parameters -### Standard OpenAI Parameters +### OpenAI Standard Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| -| `input` | string | required | Text to synthesize | -| `voice` | string | "Vivian" | Speaker/voice name | +| `input` | string | **required** | Text to synthesize | +| `model` | string | server's model | Model to use (optional, should match server if specified) | +| `voice` | string | "vivian" | Speaker name (e.g., vivian, ryan, aiden) | | `response_format` | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus | | `speed` | float | 1.0 | Playback speed (0.25-4.0) | -| `model` | string | optional | Model name (optional when serving single model) | -### Qwen3-TTS Parameters +### vLLM-Omni Extension Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `task_type` | string | "CustomVoice" | Task: CustomVoice, VoiceDesign, or Base | -| `language` | string | "Auto" | Language: Auto, Chinese, English, Japanese, Korean | +| `language` | string | "Auto" | Language (see supported languages below) | | `instructions` | string | "" | Voice style/emotion instructions | | `max_new_tokens` | int | 2048 | Maximum tokens to generate | +**Supported languages:** Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian + ### Voice Clone Parameters (Base task) | Parameter | Type | Required | Description | |-----------|------|----------|-------------| -| `ref_audio` | string | Yes* | Reference audio (file path, URL, or base64) | +| `ref_audio` | string | **Yes** | Reference audio (URL or base64 data URL) | | `ref_text` | string | No | Transcript of reference audio (for ICL mode) | -| `x_vector_only_mode` | bool | false | Use speaker embedding only (no ICL) | - -## Python Usage - -```python -import httpx - -# Simple request -response = httpx.post( - "http://localhost:8000/v1/audio/speech", - json={ - "model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", - "input": "Hello world", - "voice": "Vivian", - }, - timeout=300.0, -) - -with open("output.wav", "wb") as f: - f.write(response.content) -``` +| `x_vector_only_mode` | bool | No | Use speaker embedding only (no ICL) | ## Limitations @@ -172,10 +271,11 @@ with open("output.wav", "wb") as f: ## Troubleshooting -1. **Connection refused**: Make sure the server is running on the correct port -2. **Out of memory**: Reduce `--gpu-memory-utilization` in run_server.sh -3. **Unsupported speaker**: Check supported speakers via model documentation -4. **Voice clone fails**: Ensure you're using the Base model variant for voice cloning +1. **TTS model did not produce audio output**: Ensure you're using the correct model variant for your task type (CustomVoice task → CustomVoice model, etc.) +2. **Connection refused**: Make sure the server is running on the correct port +3. **Out of memory**: Use smaller model variant (`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`) or reduce `--gpu-memory-utilization` +4. **Unsupported speaker**: Use `/v1/audio/voices` to list available voices for the loaded model +5. **Voice clone fails**: Ensure you're using the Base model variant for voice cloning ## Example materials diff --git a/examples/online_serving/qwen3_tts/README.md b/examples/online_serving/qwen3_tts/README.md index 1c9bd48203..e1ad3edc20 100644 --- a/examples/online_serving/qwen3_tts/README.md +++ b/examples/online_serving/qwen3_tts/README.md @@ -9,6 +9,8 @@ This directory contains examples for running Qwen3-TTS models with vLLM-Omni's o | `Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice` | CustomVoice | Predefined speaker voices with optional style control | | `Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign` | VoiceDesign | Natural language voice style description | | `Qwen/Qwen3-TTS-12Hz-1.7B-Base` | Base | Voice cloning from reference audio | +| `Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice` | CustomVoice | Smaller/faster variant | +| `Qwen/Qwen3-TTS-12Hz-0.6B-Base` | Base | Smaller/faster variant for voice cloning | ## Quick Start @@ -24,19 +26,27 @@ This directory contains examples for running Qwen3-TTS models with vLLM-Omni's o ./run_server.sh Base ``` +Or launch directly with vllm serve: + +```bash +vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \ + --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ + --omni --port 8091 --trust-remote-code --enforce-eager +``` + ### 2. Run the Client ```bash # CustomVoice: Use predefined speaker python openai_speech_client.py \ --text "你好,我是通义千问" \ - --voice Vivian \ + --voice vivian \ --language Chinese # CustomVoice with style instruction python openai_speech_client.py \ --text "今天天气真好" \ - --voice Ryan \ + --voice ryan \ --instructions "用开心的语气说" # VoiceDesign: Describe the voice style @@ -59,25 +69,25 @@ python openai_speech_client.py \ ```bash # Simple TTS request -curl -X POST http://localhost:8000/v1/audio/speech \ +curl -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "Hello, how are you?", - "voice": "Vivian", + "voice": "vivian", "language": "English" }' --output output.wav # With style instruction -curl -X POST http://localhost:8000/v1/audio/speech \ +curl -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{ "input": "I am so excited!", - "voice": "Vivian", + "voice": "vivian", "instructions": "Speak with great enthusiasm" }' --output excited.wav # List available voices in CustomVoice models -curl http://localhost:8000/v1/audio/voices +curl http://localhost:8091/v1/audio/voices ``` ## API Reference @@ -90,12 +100,26 @@ POST /v1/audio/speech This endpoint follows the [OpenAI Audio Speech API](https://platform.openai.com/docs/api-reference/audio/createSpeech) format with additional Qwen3-TTS parameters. +### Voices Endpoint + +``` +GET /v1/audio/voices +``` + +Lists available voices for the loaded model: + +```json +{ + "voices": ["aiden", "dylan", "eric", "one_anna", "ryan", "serena", "sohee", "uncle_fu", "vivian"] +} +``` + ### Request Body ```json { "input": "Text to synthesize", - "voice": "Vivian", + "voice": "vivian", "response_format": "wav", "task_type": "CustomVoice", "language": "Auto", @@ -115,45 +139,64 @@ Returns audio data in the requested format (default: WAV). ## Parameters -### Standard OpenAI Parameters +### OpenAI Standard Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| -| `input` | string | required | Text to synthesize | -| `voice` | string | "Vivian" | Speaker/voice name | +| `input` | string | **required** | Text to synthesize | +| `model` | string | server's model | Model to use (optional, should match server if specified) | +| `voice` | string | "vivian" | Speaker name (e.g., vivian, ryan, aiden) | | `response_format` | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus | | `speed` | float | 1.0 | Playback speed (0.25-4.0) | -| `model` | string | optional | Model name (optional when serving single model) | -### Qwen3-TTS Parameters +### vLLM-Omni Extension Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `task_type` | string | "CustomVoice" | Task: CustomVoice, VoiceDesign, or Base | -| `language` | string | "Auto" | Language: Auto, Chinese, English, Japanese, Korean | +| `language` | string | "Auto" | Language (see supported languages below) | | `instructions` | string | "" | Voice style/emotion instructions | | `max_new_tokens` | int | 2048 | Maximum tokens to generate | +**Supported languages:** Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian + ### Voice Clone Parameters (Base task) -| Parameter | Type | Required | Description | +| Parameter | Type | Default | Description | |-----------|------|----------|-------------| -| `ref_audio` | string | Yes* | Reference audio (file path, URL, or base64) | -| `ref_text` | string | No | Transcript of reference audio (for ICL mode) | -| `x_vector_only_mode` | bool | false | Use speaker embedding only (no ICL) | +| `ref_audio` | string | null | Reference audio (URL or base64 data URL) | +| `ref_text` | string | null | Transcript of reference audio | +| `x_vector_only_mode` | bool | null | Use speaker embedding only (no ICL) | ## Python Usage +### Using OpenAI SDK + +```python +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:8091/v1", api_key="none") + +response = client.audio.speech.create( + model="Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", + voice="vivian", + input="Hello, how are you?", +) + +response.stream_to_file("output.wav") +``` + +### Using httpx + ```python import httpx -# Simple request response = httpx.post( - "http://localhost:8000/v1/audio/speech", + "http://localhost:8091/v1/audio/speech", json={ - "model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice", - "input": "Hello world", - "voice": "Vivian", + "input": "Hello, how are you?", + "voice": "vivian", + "language": "English", }, timeout=300.0, ) @@ -169,7 +212,8 @@ with open("output.wav", "wb") as f: ## Troubleshooting -1. **Connection refused**: Make sure the server is running on the correct port -2. **Out of memory**: Reduce `--gpu-memory-utilization` in run_server.sh -3. **Unsupported speaker**: Check supported speakers via model documentation -4. **Voice clone fails**: Ensure you're using the Base model variant for voice cloning +1. **"TTS model did not produce audio output"**: Ensure you're using the correct model variant for your task type (CustomVoice task → CustomVoice model, etc.) +2. **Connection refused**: Make sure the server is running on the correct port +3. **Out of memory**: Use smaller model variant (`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`) or reduce `--gpu-memory-utilization` +4. **Unsupported speaker**: Use `/v1/audio/voices` to list available voices for the loaded model +5. **Voice clone fails**: Ensure you're using the Base model variant for voice cloning diff --git a/examples/online_serving/qwen3_tts/openai_speech_client.py b/examples/online_serving/qwen3_tts/openai_speech_client.py index 4d70460940..bbc13081da 100644 --- a/examples/online_serving/qwen3_tts/openai_speech_client.py +++ b/examples/online_serving/qwen3_tts/openai_speech_client.py @@ -5,10 +5,10 @@ Examples: # CustomVoice task (predefined speaker) - python openai_speech_client.py --text "Hello, how are you?" --voice Vivian + python openai_speech_client.py --text "Hello, how are you?" --voice vivian # CustomVoice with emotion instruction - python openai_speech_client.py --text "I'm so happy!" --voice Vivian \ + python openai_speech_client.py --text "I'm so happy!" --voice vivian \ --instructions "Speak with excitement" # VoiceDesign task (voice from description) @@ -30,7 +30,7 @@ import httpx # Default server configuration -DEFAULT_API_BASE = "http://localhost:8000" +DEFAULT_API_BASE = "http://localhost:8091" DEFAULT_API_KEY = "EMPTY" @@ -173,8 +173,8 @@ def parse_args(): parser.add_argument( "--voice", type=str, - default="Vivian", - help="Speaker/voice name (default: Vivian). Options: Vivian, Ryan, etc.", + default="vivian", + help="Speaker/voice name (default: vivian). Options: vivian, ryan, aiden, etc.", ) parser.add_argument( "--language", diff --git a/examples/online_serving/qwen3_tts/run_server.sh b/examples/online_serving/qwen3_tts/run_server.sh index 203ed76e8d..fd020f18b2 100755 --- a/examples/online_serving/qwen3_tts/run_server.sh +++ b/examples/online_serving/qwen3_tts/run_server.sh @@ -33,7 +33,7 @@ echo "Starting Qwen3-TTS server with model: $MODEL" vllm-omni serve "$MODEL" \ --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \ --host 0.0.0.0 \ - --port 8000 \ + --port 8091 \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --enforce-eager \