Skip to content

stonebeta/GPT-SoVITS_minimal_inference

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡ GPT-SoVITS Minimal Inference

High-Performance | Production-Ready | Zero-Copy Pipeline

License Python GPU ONNX TensorRT

简体中文 | English

"Not just a refactor, but a deep squeeze of GPT-SoVITS potential."


Engineered for Speed: A completely refactored inference engine for GPT-SoVITS, featuring ONNX/TensorRT support, KV-Cache optimization, and zero-copy streaming.


🌟 Core Vision

We ain't here to nerf your model accuracy or break your production setup with retraining nonsense. We are here to smash those bottlenecks into oblivion.

Our goal is simple: Make GPU go brrr. We strive for: Fast AF 🏎️, Space-Time Tradeoff ⚖️, **Compatible AF 🤝 **, and Portable 🌍. No cap, just pure speed. 😤

🚀 Performance Benchmarks

Environment: I7 12700 | RTX 2080TI (22G) | CUDA 12.9 | FP16 Precision

Metric Native PyTorch(Original Project) Native PyTorch(This Project) ONNX ONNX Stream TensorRT (fitted)
First Token Latency (↓) 5.417s 2.424 s 2.683 s 1.000 s 2.022 s
Inference Speed (↑) 148.65 tokens/s 144.8 tok/s 172.4 tok/s 167.5 tok/s 291.6 tok/s (🤯)
RTF (↓) 0.5229 0.3434 0.3325 0.3100 0.2096
VRAM Usage (↓) 3 G 2.8 G 3.9 G 4.5 G 3.4 G

🛠️ Deep Analysis: Why Refactor?

1. Eliminating Dynamic Graph & Python Overhead

The original GPT-SoVITS is based on PyTorch dynamic graphs. During the AR decoding stage, generating each token incurs significant Python interpreter scheduling overhead. In long-text scenarios, this linear accumulation of latency is a nightmare for production.

2. Extreme VRAM Management Optimization

  • KV-Cache Pre-allocation: Avoids the "idling" and frequent memory copies caused by torch.cat after ONNX export.
  • Static Dimension Alignment: Optimized for TensorRT to ensure stable static execution plans and avoid re-build issues caused by dynamic shapes.

💎 Core Optimizations

1. "Surgical" Operator Rewriting

We decoupled the GPT model into two independent computational graphs:

  • GPTEncoder (Context Phase): Processes prompts and BERT features in one go.
  • GPTStep (Decoding Phase): Executes single-step decoding with $O(1)$ complexity and sinks Top-K Sampling into the ONNX graph, drastically reducing GPU->CPU data transfer.

2. Full Pipeline Zero-Copy

Utilizing ONNX Runtime's IOBinding technology:

  • VRAM Residency: Input/output are bound directly to VRAM addresses. The new_k_cache from the previous round is used directly as the next round's input, eliminating PCIe bandwidth bottlenecks.

3. Artifact-Free Streaming

Original Lookahead + History Window mechanism:

  • Performs linear weighted fusion (Cross-Fade) at chunk boundaries, completely eliminating the "clicking" sounds common in traditional streaming inference.

🏁 Quick Start

1. Export Model

python export_onnx.py \
    --gpt_path "pretrained_models\GPT_weights_v2ProPlus/firefly_v2_pp-e25.ckpt"
    --sovits_path "pretrained_models\SoVITS_weights_v2ProPlus/firefly_v2_pp_e10_s590.pth"
    --cnhubert_base_path pretrained_models\chinese-hubert-base
    --bert_path pretrained_models\chinese-roberta-wwm-ext-large
    --output_dir  "onnx_export/firefly_v2_proplus"
    --max_len 1000 # Reducing the size of the GPU can speed up throughput and decrease the pre-allocated video memory, but it requires parameter modification. Generally speaking, 1000 can find a relatively acceptable balance in most scenarios (text of varying lengths).

2. FP16 Optimization (Optional)

python onnx_to_fp16.py \
    --input_dir "onnx_export/firefly_v2_proplus" \
    --output_dir "onnx_export/firefly_v2_proplus_fp16"

3. Run High-Performance Inference

# Pure streaming inference
python run_onnx_streaming_inference.py \
    --onnx_dir onnx_export/firefly_v2_proplus_fp16 \
    --ref_audio "pretrained_models\看,这尊雕像就是匹诺康尼大名鼎鼎的卡通人物钟表小子.wav" \
    --ref_text "看,这尊雕像就是匹诺康尼大名鼎鼎的卡通人物“钟表小子" \
    --ref_lang "zh" \
    --text "范肖有一项奇特的能力,可以把自己的运气像钱一样攒起来用。攒的越多,越能撞大运。比如攒一个月,就能中彩票。那么,攒到极限会发生什么呢?"
     --lang "zh" --output "out_onnx_stream.wav"

# Launch full-featured WebUI
python run_optimized_inference.py --onnx_dir onnx_export/firefly_v2_proplus_fp16 --webui

Export TensorRT Engine

Note: Compiling TRT engines takes time and must be done for each specific hardware/CUDA/TRT version combination.

# Auto-detect GPU VRAM and select optimal shape profile
python onnx2trt.py \
    --input_dir onnx_export/firefly_v2_proplus_fp16 \
    --output_dir onnx_export/firefly_v2_proplus_fp16

# For VRAM-constrained GPUs, use a tighter profile
python onnx2trt.py \
    --input_dir onnx_export/firefly_v2_proplus_fp16 \
    --output_dir onnx_export/firefly_v2_proplus_fp16 \
    --shape_profile fitted --opt_level 2

# See all options
python onnx2trt.py --help

Available shape profiles:

Profile sovits max sem Max audio/seg Recommended VRAM
small 150 ~6s <=12GB
fitted 250 ~10s 8-24GB (profiled)
medium 400 ~16s 16-24GB (default)
large 1000 ~40s >=32GB

Tip: Run inference with ONNX first to collect a Shape Profile Summary, then choose the best profile. The fitted profile is optimized based on real profiling data.


🌐 API Service

If you're tired of staring at the terminal or want your backend to talk to this beast directly, we've squeezed out an OpenAI-compatible API service with streaming support. It's basically "Plug and Play".

  • PyTorch (Stable): python api_server.py (Port 8000, for the traditionalists)
  • ONNX (Turbo): python api_server_onnx.py (Port 8001, CPU users' salvation, easy deployment)
  • TensorRT (Godspeed): python api_server_trt.py (Port 8002, GPU screaming, performance peaking)

👉 Check the API Documentation — Please, just read the docs. I beg you. Everything is in there.


🛠️ SDK

C++: GPT-SoVITS-Devel/GPT-SoVITS-cpp


🗺️ Roadmap

  • V2 / V2ProPlus full support
  • TensorRT static engine acceleration
  • Zero-Copy IOBinding optimization
  • Multi-Language Binding:
    • C++ SDK (In development)
    • Rust / Golang / Android Wrapper
  • V3 / V4 model adaptation
  • Docker one-click deployment image

🤝 Acknowledgments

Special thanks to the GPT-SoVITS team for providing an excellent foundation. This project aims to push its engineering capabilities even further.

If this project helps you, please give us a ⭐! It keeps us motivated! 🤗

About

GPT-SoVITS的最小推理实现,onnx优化,支持tensorrt推理,原生性能. 带openai api兼容api.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 99.7%
  • Other 0.3%