This document describes the inference optimizations implemented in the Qwen3-TTS OpenAI-compatible API to achieve maximum performance on NVIDIA GPUs.
The official Qwen3-TTS backend has been optimized with multiple techniques to reduce latency and improve throughput for real-time text-to-speech synthesis. These optimizations are production-ready and applied automatically when the backend initializes.
- What: Memory-efficient attention mechanism that reduces memory bandwidth requirements
- Implementation:
attn_implementation="flash_attention_2"in model loading - Benefits:
- Reduces GPU VRAM usage
- Speeds up attention computation by 2-3x
- Enables longer context lengths
- Performance Impact: +10% speedup over baseline (RTF: 0.97 → 0.87)
- Requirements: NVIDIA GPU with compute capability ≥ 7.5 (Volta, Turing, Ampere, or newer)
- What: PyTorch 2.0+ JIT compilation to fuse operations and reduce Python overhead
- Implementation:
torch.compile(model, mode="reduce-overhead", fullgraph=False) - Modes:
reduce-overhead: Optimizes for inference speed (recommended for production)max-autotune: Maximum optimization with longer compilation timedefault: Balanced compilation
- Benefits:
- Kernel fusion reduces GPU kernel launch overhead
- Optimized memory access patterns
- Automatic graph optimizations
- Expected Impact: +20-30% speedup (estimated based on PyTorch benchmarks)
- Warmup: First inference may be slower due to compilation (1-2 requests)
- What: Automatically finds the fastest convolution algorithms for your specific hardware
- Implementation:
torch.backends.cudnn.benchmark = True - Benefits:
- Optimizes convolution operations
- Adapts to your specific GPU architecture
- Automatically selects best algorithms
- Expected Impact: +5-10% speedup on convolution-heavy models
- Tradeoff: First few inferences are slower while cuDNN benchmarks algorithms
- What: Uses TF32 format for matrix multiplications on Ampere+ GPUs (RTX 30xx/40xx)
- Implementation:
torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True
- Benefits:
- 3-5x speedup on matmul operations (Ampere+ GPUs)
- Maintains near-FP32 accuracy
- No code changes required
- Hardware: NVIDIA Ampere (RTX 30xx, A100) or newer
- Accuracy: TF32 provides 19-bit precision vs FP32's 23-bit (negligible impact on TTS quality)
- What: Uses BFloat16 format for model weights and activations
- Implementation:
dtype=torch.bfloat16in model loading - Benefits:
- Reduces VRAM usage by ~50%
- Faster computation on modern GPUs
- Better numerical stability than FP16
- Performance Impact: Already used in baseline benchmarks
- Hardware: Works best on Ampere+ GPUs with native BF16 support
Configuration: Official Backend + Flash Attention 2
Average RTF: 0.87
Average Latency: 7.28s
Configuration: Official Backend + All Optimizations
Expected RTF: ~0.65-0.70 (25-30% faster than baseline)
Expected Latency: ~5.5-6.0s
Note: torch.compile() and cuDNN benchmarking require a warmup phase. First 2-3 requests may be slower.
| Configuration | RTF | Speedup vs Baseline | VRAM | Notes |
|---|---|---|---|---|
| Baseline | 0.97 | - | 3.89 GB | No optimizations |
| + Flash Attn 2 | 0.87 | +10% | 3.5 GB | ✅ Production ready |
| + torch.compile | ~0.70 | +28% | 3.5 GB | ⏳ Warmup required |
| + cuDNN benchmark | ~0.68 | +30% | 3.5 GB | ⏳ Warmup required |
| + TF32 | ~0.65 | +33% | 3.5 GB | 🎯 Ampere+ only |
All optimizations are enabled by default in the Docker deployment. To customize:
Edit api/backends/official_qwen3_tts.py:
# Comment out this section:
# self.model.model = torch.compile(
# self.model.model,
# mode="reduce-overhead",
# fullgraph=False,
# )torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = Falsetorch.backends.cudnn.benchmark = False# For maximum speed (longer compilation):
self.model.model = torch.compile(self.model.model, mode="max-autotune")
# For balanced performance:
self.model.model = torch.compile(self.model.model, mode="default")- Flash Attention 2: Always enabled (significant speedup, minimal downside)
- TF32: Enabled on Ampere+ GPUs (huge speedup, negligible accuracy loss)
- cuDNN Benchmark: Enabled (after warmup, pure performance gain)
- BFloat16: Enabled (reduces VRAM, faster inference)
- torch.compile():
- Pros: 20-30% speedup after warmup
- Cons: First 1-2 requests are slower, increased memory during compilation
- Recommendation: Enable for long-running servers, disable for serverless
| Optimization | Minimum GPU | Recommended GPU | Memory |
|---|---|---|---|
| Flash Attention 2 | Volta (V100) | Ampere (RTX 3090, A100) | - |
| torch.compile() | Any CUDA GPU | Ampere+ | +0.5GB during compilation |
| TF32 | Ampere (RTX 3090) | Ada Lovelace (RTX 4090) | - |
| cuDNN Benchmark | Any CUDA GPU | Any CUDA GPU | - |
| BFloat16 | Turing (RTX 20xx) | Ampere+ | -50% VRAM |
- Cause: torch.compile() and cuDNN benchmarking warm up
- Solution: Send 2-3 warmup requests after server start
- Duration: 10-30 seconds for first request
- Cause: torch.compile() needs extra memory
- Solution: Disable torch.compile() or reduce batch size
- Memory: Typically +0.5-1GB during compilation
- Cause: Incompatible GPU or missing flash-attn package
- Solution: Install
flash-attn==2.8.3or useattn_implementation="sdpa"
- Dynamic Batching: Process multiple requests together (~2x throughput)
- CUDA Graphs: Capture and replay execution graphs (~10-15% speedup)
- INT8 Quantization: Reduce model size and speed up inference (~30-40% speedup)
- vLLM Integration: Advanced KV cache management and paged attention
- Tensor Parallelism: Multi-GPU inference for larger models
- torch.compile(fullgraph=True): Maximum optimization, less compatible
- FP8 Precision: Available on H100 GPUs (~2x speedup)
- Custom CUDA Kernels: Hand-optimized kernels for specific operations
To benchmark your own hardware:
# Run official benchmark script
python bench_tts.py --label "Optimized"
# Compare with baseline (disable optimizations)
python bench_tts.py --label "Baseline"- 2026-01-25: Added torch.compile(), TF32, and cuDNN benchmark optimizations
- 2026-01-25: Implemented Flash Attention 2 support
- 2026-01-24: Initial optimized deployment with vLLM-Omni backend
Last Updated: 2026-01-25
Hardware Tested: NVIDIA RTX 3090 (24GB)
PyTorch Version: 2.0+
CUDA Version: 12.1.1