⚡ GPT-SoVITS Minimal Inference

High-Performance | Production-Ready | Zero-Copy Pipeline

"Not just a refactor, but a deep squeeze of GPT-SoVITS potential."

Engineered for Speed: A completely refactored inference engine for GPT-SoVITS, featuring ONNX/TensorRT support, KV-Cache optimization, and zero-copy streaming.

🌟 Core Vision

We ain't here to nerf your model accuracy or break your production setup with retraining nonsense. We are here to smash those bottlenecks into oblivion.

Our goal is simple: Make GPU go brrr. We strive for: Fast AF 🏎️, Space-Time Tradeoff ⚖️, **Compatible AF 🤝 **, and Portable 🌍. No cap, just pure speed. 😤

🚀 Performance Benchmarks

Environment: I7 12700 | RTX 2080TI (22G) | CUDA 12.9 | FP16 Precision

Metric	Native PyTorch(Original Project)	Native PyTorch(This Project)	ONNX	ONNX Stream	TensorRT (fitted)
First Token Latency (↓)	5.417s	2.424 s	2.683 s	1.000 s	2.022 s
Inference Speed (↑)	148.65 tokens/s	144.8 tok/s	172.4 tok/s	167.5 tok/s	291.6 tok/s (🤯)
RTF (↓)	0.5229	0.3434	0.3325	0.3100	0.2096
VRAM Usage (↓)	3 G	2.8 G	3.9 G	4.5 G	3.4 G

🛠️ Deep Analysis: Why Refactor?

1. Eliminating Dynamic Graph & Python Overhead

The original GPT-SoVITS is based on PyTorch dynamic graphs. During the AR decoding stage, generating each token incurs significant Python interpreter scheduling overhead. In long-text scenarios, this linear accumulation of latency is a nightmare for production.

2. Extreme VRAM Management Optimization

KV-Cache Pre-allocation: Avoids the "idling" and frequent memory copies caused by torch.cat after ONNX export.
Static Dimension Alignment: Optimized for TensorRT to ensure stable static execution plans and avoid re-build issues caused by dynamic shapes.

💎 Core Optimizations

1. "Surgical" Operator Rewriting

We decoupled the GPT model into two independent computational graphs:

GPTEncoder (Context Phase): Processes prompts and BERT features in one go.
GPTStep (Decoding Phase): Executes single-step decoding with $O(1)$ complexity and sinks Top-K Sampling into the ONNX graph, drastically reducing GPU->CPU data transfer.

2. Full Pipeline Zero-Copy

Utilizing ONNX Runtime's IOBinding technology:

VRAM Residency: Input/output are bound directly to VRAM addresses. The new_k_cache from the previous round is used directly as the next round's input, eliminating PCIe bandwidth bottlenecks.

3. Artifact-Free Streaming

Original Lookahead + History Window mechanism:

Performs linear weighted fusion (Cross-Fade) at chunk boundaries, completely eliminating the "clicking" sounds common in traditional streaming inference.

🏁 Quick Start

1. Export Model

python export_onnx.py \
    --gpt_path "pretrained_models\GPT_weights_v2ProPlus/firefly_v2_pp-e25.ckpt"
    --sovits_path "pretrained_models\SoVITS_weights_v2ProPlus/firefly_v2_pp_e10_s590.pth"
    --cnhubert_base_path pretrained_models\chinese-hubert-base
    --bert_path pretrained_models\chinese-roberta-wwm-ext-large
    --output_dir  "onnx_export/firefly_v2_proplus"
    --max_len 1000 # Reducing the size of the GPU can speed up throughput and decrease the pre-allocated video memory, but it requires parameter modification. Generally speaking, 1000 can find a relatively acceptable balance in most scenarios (text of varying lengths).

2. FP16 Optimization (Optional)

python onnx_to_fp16.py \
    --input_dir "onnx_export/firefly_v2_proplus" \
    --output_dir "onnx_export/firefly_v2_proplus_fp16"

3. Run High-Performance Inference

# Pure streaming inference
python run_onnx_streaming_inference.py \
    --onnx_dir onnx_export/firefly_v2_proplus_fp16 \
    --ref_audio "pretrained_models\看，这尊雕像就是匹诺康尼大名鼎鼎的卡通人物钟表小子.wav" \
    --ref_text "看，这尊雕像就是匹诺康尼大名鼎鼎的卡通人物“钟表小子" \
    --ref_lang "zh" \
    --text "范肖有一项奇特的能力，可以把自己的运气像钱一样攒起来用。攒的越多，越能撞大运。比如攒一个月，就能中彩票。那么，攒到极限会发生什么呢？"
     --lang "zh" --output "out_onnx_stream.wav"

# Launch full-featured WebUI
python run_optimized_inference.py --onnx_dir onnx_export/firefly_v2_proplus_fp16 --webui

Export TensorRT Engine

Note: Compiling TRT engines takes time and must be done for each specific hardware/CUDA/TRT version combination.

# Auto-detect GPU VRAM and select optimal shape profile
python onnx2trt.py \
    --input_dir onnx_export/firefly_v2_proplus_fp16 \
    --output_dir onnx_export/firefly_v2_proplus_fp16

# For VRAM-constrained GPUs, use a tighter profile
python onnx2trt.py \
    --input_dir onnx_export/firefly_v2_proplus_fp16 \
    --output_dir onnx_export/firefly_v2_proplus_fp16 \
    --shape_profile fitted --opt_level 2

# See all options
python onnx2trt.py --help

Available shape profiles:

Profile	sovits max sem	Max audio/seg	Recommended VRAM
`small`	150	~6s	<=12GB
`fitted`	250	~10s	8-24GB (profiled)
`medium`	400	~16s	16-24GB (default)
`large`	1000	~40s	>=32GB

Tip: Run inference with ONNX first to collect a Shape Profile Summary, then choose the best profile. The fitted profile is optimized based on real profiling data.

🌐 API Service

If you're tired of staring at the terminal or want your backend to talk to this beast directly, we've squeezed out an OpenAI-compatible API service with streaming support. It's basically "Plug and Play".

PyTorch (Stable): python api_server.py (Port 8000, for the traditionalists)
ONNX (Turbo): python api_server_onnx.py (Port 8001, CPU users' salvation, easy deployment)
TensorRT (Godspeed): python api_server_trt.py (Port 8002, GPU screaming, performance peaking)

👉 Check the API Documentation — Please, just read the docs. I beg you. Everything is in there.

🛠️ SDK

C++: GPT-SoVITS-Devel/GPT-SoVITS-cpp

🗺️ Roadmap

V2 / V2ProPlus full support
TensorRT static engine acceleration
Zero-Copy IOBinding optimization
Multi-Language Binding:
- C++ SDK (In development)
- Rust / Golang / Android Wrapper
V3 / V4 model adaptation
Docker one-click deployment image

🤝 Acknowledgments

Special thanks to the GPT-SoVITS team for providing an excellent foundation. This project aims to push its engineering capabilities even further.

If this project helps you, please give us a ⭐! It keeps us motivated! 🤗

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
GPT_SoVITS		GPT_SoVITS
config		config
docker		docker
tools/i18n		tools/i18n
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
api_server.py		api_server.py
api_server_onnx.py		api_server_onnx.py
api_server_trt.py		api_server_trt.py
export_onnx.py		export_onnx.py
onnx2trt.py		onnx2trt.py
onnx_to_fp16.py		onnx_to_fp16.py
onnx_validation.py		onnx_validation.py
prepare_assets.sh		prepare_assets.sh
requirements.txt		requirements.txt
run_inference.py		run_inference.py
run_long_inference.py		run_long_inference.py
run_onnx_inference.py		run_onnx_inference.py
run_onnx_long_inference.py		run_onnx_long_inference.py
run_onnx_streaming_inference.py		run_onnx_streaming_inference.py
run_optimized_inference.py		run_optimized_inference.py
run_streaming_inference.py		run_streaming_inference.py
run_trt_inference.py		run_trt_inference.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ GPT-SoVITS Minimal Inference

🌟 Core Vision

🚀 Performance Benchmarks

🛠️ Deep Analysis: Why Refactor?

1. Eliminating Dynamic Graph & Python Overhead

2. Extreme VRAM Management Optimization

💎 Core Optimizations

1. "Surgical" Operator Rewriting

2. Full Pipeline Zero-Copy

3. Artifact-Free Streaming

🏁 Quick Start

1. Export Model

2. FP16 Optimization (Optional)

3. Run High-Performance Inference

Export TensorRT Engine

🌐 API Service

🛠️ SDK

🗺️ Roadmap

🤝 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ GPT-SoVITS Minimal Inference

🌟 Core Vision

🚀 Performance Benchmarks

🛠️ Deep Analysis: Why Refactor?

1. Eliminating Dynamic Graph & Python Overhead

2. Extreme VRAM Management Optimization

💎 Core Optimizations

1. "Surgical" Operator Rewriting

2. Full Pipeline Zero-Copy

3. Artifact-Free Streaming

🏁 Quick Start

1. Export Model

2. FP16 Optimization (Optional)

3. Run High-Performance Inference

Export TensorRT Engine

🌐 API Service

🛠️ SDK

🗺️ Roadmap

🤝 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages