This guide explains how to run Qwen3-TTS efficiently on CPU-only systems, particularly Intel processors like the i5-1240P.
- Overview
- CPU Backend Options
- Quick Start: PyTorch CPU Backend
- Advanced: Intel Extension for PyTorch (IPEX)
- Experimental: OpenVINO Backend
- Performance Tuning
- Troubleshooting
The Qwen3-TTS API server now supports optimized CPU inference for systems without a GPU. The implementation includes:
- CPU-optimized PyTorch backend - Recommended for most CPU users
- Intel Extension for PyTorch (IPEX) - Optional performance boost on Intel CPUs
- OpenVINO backend - Experimental, requires manual model export
- ✅ It will run, but expect slower inference compared to GPU
- 🎯 Best strategy: Keep it simple, reduce overhead
- 📊 Recommended model:
Qwen3-TTS-12Hz-0.6B-Base(smaller, faster on CPU) ⚠️ Not a drop-in win: OpenVINO requires manual export and may not accelerate all components
The API server supports three backend options for CPU inference:
| Backend | Setup | Performance | Stability | Recommended For |
|---|---|---|---|---|
| PyTorch CPU | ✅ Simple | ⭐⭐⭐ Good | ✅ Stable | All CPU users |
| PyTorch + IPEX | ⭐⭐⭐⭐ Better | ✅ Stable | Intel CPUs (Linux) | |
| OpenVINO | ⭐⭐⭐⭐⭐ Best* | Advanced users |
*OpenVINO may only accelerate parts of the pipeline
Create a .env file or export these variables:
# Backend selection
export TTS_BACKEND=pytorch # Use CPU-optimized PyTorch backend
# Model configuration
export TTS_MODEL_ID=Qwen/Qwen3-TTS-12Hz-0.6B-Base # Smaller model for CPU
# Device and precision
export TTS_DEVICE=cpu # Force CPU device
export TTS_DTYPE=float32 # Recommended for CPU (stable, fast)
export TTS_ATTN=sdpa # Scaled Dot Product Attention (CPU-friendly)
# CPU threading (adjust for your CPU)
export CPU_THREADS=12 # Physical cores (i5-1240P: 4 P-cores + 8 E-cores)
export CPU_INTEROP=2 # Inter-op parallelism
# Optional: Set OpenMP/MKL threads
export OMP_NUM_THREADS=12
export MKL_NUM_THREADS=12python -m api.main
# or
qwen-tts-apifrom openai import OpenAI
client = OpenAI(base_url="http://localhost:8880/v1", api_key="not-needed")
response = client.audio.speech.create(
model="qwen3-tts",
voice="Vivian",
input="Hello! This is Qwen3-TTS running on CPU."
)
response.stream_to_file("output.mp3")IPEX can provide significant speedup on Intel CPUs by optimizing matrix operations.
# Linux (recommended)
pip install intel-extension-for-pytorch
# Windows/macOS support may varyexport USE_IPEX=true
export TTS_BACKEND=pytorch
export TTS_DEVICE=cpu- Expected speedup: 20-40% on Intel CPUs
- Best for: Matmul/linear operations (the model's main compute)
- Compatibility: Works with PyTorch CPU backend automatically
Qwen3-TTS includes components that may not export cleanly:
- Codec/tokenizer decode to waveform
- Generation loop with dynamic behavior
- Custom audio processing
You might end up accelerating only part of the pipeline.
- Export the Qwen3-TTS model to OpenVINO IR format
- Place
model.xmlandmodel.binin the model directory - Set up the OpenVINO backend
export TTS_BACKEND=openvino
export OV_DEVICE=CPU # CPU, GPU, or AUTO
export OV_MODEL_DIR=./.ov_models # Directory with model.xml/model.bin
export OV_CACHE_DIR=./.ov_cache # Compilation cacheIf you want to attempt OpenVINO export:
# This is a conceptual example - actual export may require significant work
from optimum.intel import OVModelForCausalLM
# Export only the text/token model part
# Keep audio decode in PyTorch
# This is NOT a working recipe for Qwen3-TTS - it's a starting pointRecommendation: Don't spend time fighting export issues. Use PyTorch CPU + IPEX instead.
The most important tuning for CPU inference is thread configuration:
# For i5-1240P (4 P-cores + 8 E-cores = 12 cores)
export CPU_THREADS=12 # Total cores
export CPU_INTEROP=2 # Keep low (1-2)
# For other CPUs, use physical core count
# Check with: lscpu | grep "^CPU(s):"Choose a model based on your CPU performance:
| Model | Parameters | Speed on CPU | Quality | Best For |
|---|---|---|---|---|
| 0.6B-Base | 600M | ⚡⚡⚡ Fast | ⭐⭐⭐ Good | CPU inference |
| 1.7B-Base | 1.7B | ⚡⚡ Moderate | ⭐⭐⭐⭐ Excellent | GPU/fast CPU |
| 1.7B-CustomVoice | 1.7B | ⚡⚡ Moderate | ⭐⭐⭐⭐ Excellent | GPU/fast CPU |
Recommendation: Use Qwen3-TTS-12Hz-0.6B-Base for CPU inference.
# Recommended for CPU (in order of preference)
export TTS_ATTN=sdpa # Best for CPU (PyTorch native, optimized)
export TTS_ATTN=eager # Fallback if sdpa has issues
# NOT recommended for CPU
export TTS_ATTN=flash_attention_2 # GPU-only, will auto-fallback to sdpa# Recommended for CPU
export TTS_DTYPE=float32 # Most stable and often fastest on CPU
# NOT recommended for CPU
export TTS_DTYPE=float16 # May have precision issues on CPU
export TTS_DTYPE=bfloat16 # GPU-optimized, not recommended for CPUexport TTS_BACKEND=pytorch
export TTS_MODEL_ID=Qwen/Qwen3-TTS-12Hz-0.6B-Base
export TTS_DEVICE=cpu
export TTS_DTYPE=float32
export TTS_ATTN=sdpa
export CPU_THREADS=12
export CPU_INTEROP=2export TTS_BACKEND=pytorch
export TTS_MODEL_ID=Qwen/Qwen3-TTS-12Hz-0.6B-Base
export TTS_DEVICE=cpu
export TTS_DTYPE=float32
export TTS_ATTN=sdpa
export CPU_THREADS=12
export CPU_INTEROP=2
export USE_IPEX=trueexport TTS_BACKEND=pytorch
export TTS_MODEL_ID=Qwen/Qwen3-TTS-12Hz-0.6B-Base
export TTS_DEVICE=cpu
export TTS_DTYPE=float32
export TTS_ATTN=sdpa
export CPU_THREADS=4
export CPU_INTEROP=1Solutions:
- Use smaller model:
Qwen3-TTS-12Hz-0.6B-Base - Optimize thread count: Match your CPU core count
- Enable IPEX:
USE_IPEX=true(Intel CPUs only) - Check CPU load: Ensure other processes aren't competing
Solutions:
- Use smaller model: 0.6B instead of 1.7B
- Close other applications
- Reduce batch size (if processing multiple requests)
If you see "cannot set number of interop threads" warnings:
- This is non-critical - The backend will use existing thread settings
- Usually happens when creating multiple backend instances in tests
- In production (single backend), this won't occur
Solutions:
# Install IPEX
pip install intel-extension-for-pytorch
# Verify installation
python -c "import intel_extension_for_pytorch as ipex; print(ipex.__version__)"Expected behavior:
RuntimeError: OpenVINO IR model not found at ./.ov_models/model.xml
To use the OpenVINO backend, you need to:
1. Export the Qwen3-TTS model to OpenVINO IR format
2. Place the model.xml and model.bin files in ./.ov_models
Solution:
- OpenVINO backend requires manual model export
- For reliable CPU inference, use
TTS_BACKEND=pytorchinstead
| Configuration | RTF* | First Request | Subsequent Requests |
|---|---|---|---|
| PyTorch CPU | ~2.5-3.0 | ~30-45s (model load) | ~2-3s per request |
| PyTorch + IPEX | ~2.0-2.5 | ~30-45s (model load) | ~1.5-2.5s per request |
*RTF = Real-Time Factor (lower is better, <1.0 means faster than real-time)
- First request is slow: Model loading takes 30-45 seconds
- Warmup recommended: Set
TTS_WARMUP_ON_START=truefor production - CPU inference is slower than GPU: This is expected, CPU is ~10-50x slower
- Quality is identical: CPU and GPU produce the same audio quality
export TTS_BACKEND=pytorch
export TTS_MODEL_ID=Qwen/Qwen3-TTS-12Hz-0.6B-Base
export TTS_DEVICE=cpu
export TTS_DTYPE=float32
export TTS_WARMUP_ON_START=false # Skip warmup for faster startupexport TTS_BACKEND=pytorch
export TTS_MODEL_ID=Qwen/Qwen3-TTS-12Hz-0.6B-Base
export TTS_DEVICE=cpu
export TTS_DTYPE=float32
export TTS_WARMUP_ON_START=true # Warmup on startup
export USE_IPEX=true # Enable IPEX if available
export CPU_THREADS=12 # Match your CPU
export CPU_INTEROP=2Recommended approach for i5-1240P:
- ✅ Use PyTorch CPU backend - Simple, reliable, fast enough
- ✅ Enable IPEX (optional) - 20-40% speedup on Intel CPUs
- ✅ Use 0.6B model - Faster inference on CPU
- ✅ Tune thread count - Match your CPU cores
- ❌ Skip OpenVINO (for now) - Experimental, not worth the complexity
Expected result: Reliable CPU inference with reasonable performance for development and low-volume production use.
For high-throughput production, consider GPU deployment or a faster CPU.