Production-Grade | Zero-Copy | Multi-Language Binding
"Python was the prototype. C++ is the weapon."
The production inference engine for GPT-SoVITS Minimal Inference. The Python repo handles model export and tech preview. This repo is where the real work happens.
The Python project proved the concept. Now we're done being polite about performance.
This C++ SDK takes every optimization from the Python pipeline — KV-Cache pre-allocation, IOBinding zero-copy, lookahead streaming — and removes the last remaining bottleneck: the Python interpreter itself.
Goals: Faster 🏎️, Embeddable 🔩, Bindable to Everything 🌍, No Runtime Tax 💀.
Environment: I7 12700 | RTX 2080TI (22G) | CUDA 12.9 | FP16 Precision
Test text: multilingual ZH/JA/EN mixed, ~19s audio output
ONNX:
| Metric | Python ONNX | C++ ONNX | Python ONNX Stream | C++ ONNX Stream |
|---|---|---|---|---|
| Inference Speed (↑) | 172.4 tok/s | 215.1 tok/s | 167.5 tok/s | 222.73 tok/s |
| RTF (↓) | 0.3325 | 0.2398 | 0.3100 | 0.4894 |
| First Packet Latency (↓) | 2.683 s | 1.210 s | 1.000 s | 1.250 s |
| VRAM Usage (↓) | 3.9 G | 3.6 G | 4.5 G | 4.0 G |
TRT:
| Metric | Python TRT | C++ TRT | C++ TRT Stream |
|---|---|---|---|
| Inference Speed (↑) | 291.6 tok/s | 357.66 tok/s | 355.65 tok/s |
| RTF (↓) | 0.2096 | 0.1020 | 0.1205 |
| First Packet Latency (↓) | 2.683 s | 0.5 s | 0.46 S |
| VRAM Usage (↓) | 3.4 G | 2.8 G | 2.3 G |
This SDK implements a distributed inference model — speaker creation and inference are decoupled:
[Cloud / Offline] [Edge / Production]
Reference Audio .gsppkg file
↓ ↓
CreateSpeaker() → ImportSpeaker()
ExportSpeaker(.gsppkg) ↓
Infer() / InferStreaming()
↓
Audio Output
| Mode | Description |
|---|---|
| Edge Pipeline | Inference only. Loads speaker from .gsppkg. Minimal VRAM. |
| Streaming Pipeline | Chunk-based real-time generation with crossfade. |
| Full Pipeline | Speaker creation + inference. For most scenarios. |
CNBertModel → Phoneme + BERT features
GPTEncoderModel → Context encoding (one-shot)
GPTStepModel → Autoregressive decoding (O(1) per step, KV-cache)
SoVITSModel → Neural vocoder → PCM audio
- CMake 3.20+
- ONNX Runtime 1.16+ (CUDA build for GPU)
- CUDA 12.6+ (optional, for GPU)
# CPU build (ONNX only)
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
# CUDA build with ONNX
cmake -B build -S . \
-DENABLE_CUDA=1 \
-DONNXRUNTIME_PATH=/path/to/onnxruntime \
-DCUDA_TOOLKIT_ROOT_DIR=/path/to/cuda \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
# TensorRT build
cmake -B build -S . \
-DUSE_TENSORRT=1 \
-DENABLE_CUDA=1 \
-DTENSORRT_PATH=/path/to/tensorrt \
-DCUDA_TOOLKIT_ROOT_DIR=/path/to/cuda \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config ReleaseFor ONNX:
# In GPT-SoVITS_minimal_inference repo
python export_onnx.py \
--gpt_path "pretrained_models/GPT_weights_v2ProPlus/your_model.ckpt" \
--sovits_path "pretrained_models/SoVITS_weights_v2ProPlus/your_model.pth" \
--cnhubert_base_path pretrained_models/chinese-hubert-base \
--bert_path pretrained_models/chinese-roberta-wwm-ext-large \
--output_dir "onnx_export/my_onnx_model" \
--max_len 1000 \
--validate \
--validation_device cuda
# optimize
python onnx_to_fp16.py \
--input_dir "onnx_export/my_onnx_model"
--output_dir "onnx_export/my_onnx_model_fp16"For TensorRT:
python onnx2trt.py \
--input_dir onnx_export/my_onnx_model \
--output_dir onnx_export/my_trt_model \
--shape_profile fitted \
--precision autoShape Profile Notes:
- C++ SDK uses fitted profile by default (optimized for 3-8s audio, ~10s max per segment)
- Suitable for ≤24GB VRAM with sentence-based inference
- For longer audio or larger VRAM, modify
GetProfileDefs()insrc/model/backend/tensorrt_backend.cpp
ONNX:
./build/example/gpt_sovits_cpp_cloud_create_onnx \
my_speaker \
ref.wav \
"参考文本" \
zh \
my_speaker.gsppkgTensorRT:
./build/example/gpt_sovits_cpp_cloud_create_trt \
my_speaker \
ref.wav \
"参考文本" \
zh \
my_speaker.gsppkgONNX Edge Inference:
./build/example/gpt_sovits_cpp_edge_inference_onnx \
my_speaker.gsppkg \
"要合成的文本" \
zh \
my_speaker \
output.wavONNX Streaming Inference:
./build/example/gpt_sovits_cpp_streaming_onnx \
--speaker-package my_speaker.gsppkg \
--text "要合成的文本" \
--lang zh \
--chunk-length 24 \
--output output.wavTensorRT Edge Inference:
./build/example/gpt_sovits_cpp_edge_inference_trt \
my_speaker.gsppkg \
"要合成的文本" \
zh \
my_speaker \
output_trt.wavTensorRT Streaming Inference:
./build/example/gpt_sovits_cpp_streaming_trt \
--speaker-package my_speaker.gsppkg \
--text "要合成的文本" \
--lang zh \
--chunk-length 24 \
--output output_trt.wavONNX Backend:
#include "GPTSoVITS/InferencePipeline.h"
// Full mode: loads all models including speaker creation models
GPTSoVITS::PipelineConfig config = GPTSoVITS::PipelineConfig::Full(
"/path/to/model", GPTSoVITS::Model::DeviceType::kCUDA, 0);
config.resources_path = "./res";
config.backend = GPTSoVITS::Model::BackendType::kONNX;
GPTSoVITS::InferencePipeline pipeline(config);
// Create speaker from reference audio
pipeline.CreateSpeaker("my_speaker", "zh", "ref.wav", "参考文本");
// Export to portable package
pipeline.ExportSpeaker("my_speaker", "my_speaker.gsppkg");TensorRT Backend:
#include "GPTSoVITS/InferencePipeline.h"
GPTSoVITS::PipelineConfig config = GPTSoVITS::PipelineConfig::Full(
"/path/to/model", GPTSoVITS::Model::DeviceType::kCUDA, 0);
config.resources_path = "./res";
config.backend = GPTSoVITS::Model::BackendType::kTensorRT;
config.engine_cache_dir = "./trt_cache"; // TensorRT engine cache
GPTSoVITS::InferencePipeline pipeline(config);
// TensorRT engines will be built automatically on first run
pipeline.CreateSpeaker("my_speaker", "zh", "ref.wav", "参考文本");
pipeline.ExportSpeaker("my_speaker", "my_speaker.gsppkg");// Edge mode: inference only, no creation models loaded
GPTSoVITS::PipelineConfig config = GPTSoVITS::PipelineConfig::Edge(
"/path/to/model", GPTSoVITS::Model::DeviceType::kCUDA, 0);
config.resources_path = "./res";
GPTSoVITS::InferencePipeline pipeline(config);
pipeline.ImportSpeaker("my_speaker.gsppkg", "my_speaker"); // second arg: rename (optional)GPTSoVITS::Model::SampleConfig sample_config;
sample_config.temperature = 1.0f;
sample_config.top_k = 40;
sample_config.top_p = 0.6f;
GPTSoVITS::Model::InferStats stats;
double first_latency_ms = 0.0;
auto audio = pipeline.Infer(
"my_speaker", "要合成的文本", "zh",
sample_config, /*noise_scale=*/0.35f, /*speed=*/1.0f,
&stats,
[&]() { // fires after first segment is ready
first_latency_ms = /* elapsed since infer start */;
});
audio->SaveToFile("output.wav");
// stats.TokensPerSec(), stats.gpt_time_s, stats.sovits_time_sONNX Backend:
#include "GPTSoVITS/EdgePipeline.h"
#include "GPTSoVITS/StreamingPipeline.h"
// Build EdgePipeline (shared models, can serve multiple StreamingPipelines)
auto g2p = std::make_shared<GPTSoVITS::G2P::G2PPipline>();
auto bert = std::make_unique<GPTSoVITS::Model::CNBertModel>();
bert->Init<GPTSoVITS::Model::ONNXBackend>(model_path + "/bert.onnx",
"./res/bert_tokenizer.json", device);
g2p->RegisterLangProcess("zh", std::make_unique<GPTSoVITS::G2P::G2PZH>(),
std::move(bert), true);
auto enc = std::make_shared<GPTSoVITS::Model::GPTEncoderModel>();
enc->Init<GPTSoVITS::Model::ONNXBackend>(model_path + "/gpt_encoder.onnx", device);
auto step = std::make_shared<GPTSoVITS::Model::GPTStepModel>();
step->Init<GPTSoVITS::Model::ONNXBackend>(model_path + "/gpt_step.onnx", device);
auto sovits = std::make_shared<GPTSoVITS::Model::SoVITSModel>();
sovits->Init<GPTSoVITS::Model::ONNXBackend>(model_path + "/sovits.onnx", device);
auto edge = std::make_shared<GPTSoVITS::EdgePipeline>(config_json, model_path,
g2p, enc, step, sovits);
edge->ImportSpeaker("my_speaker.gsppkg", "my_speaker");
GPTSoVITS::StreamingConfig stream_cfg;
stream_cfg.chunk_length = 24; // tokens per chunk
stream_cfg.pause_length = 0.3f; // silence between sentences (s)
stream_cfg.h_len = 512; // history tokens for crossfade
stream_cfg.l_len = 16; // lookahead tokens for crossfade
stream_cfg.enable_fade = true;
auto streaming = std::make_shared<GPTSoVITS::StreamingPipeline>(edge, stream_cfg);
GPTSoVITS::Model::InferStats stats;
streaming->InferSpeakerStreaming(
"my_speaker", "要合成的文本", "zh",
[](const GPTSoVITS::AudioChunk& chunk) {
// chunk.audio_data — PCM float32 samples
// chunk.duration — seconds
// chunk.is_first / chunk.is_last
// feed to audio device or accumulate
},
/*sample_config=*/{}, /*noise_scale=*/0.35f, /*speed=*/1.0f,
&stats);TensorRT Backend:
#include "GPTSoVITS/EdgePipeline.h"
#include "GPTSoVITS/StreamingPipeline.h"
// Build EdgePipeline with TensorRT backend
auto g2p = std::make_shared<GPTSoVITS::G2P::G2PPipline>();
auto bert = std::make_unique<GPTSoVITS::Model::CNBertModel>();
bert->Init<GPTSoVITS::Model::TensorRTBackend>(model_path + "/bert.onnx",
"./res/bert_tokenizer.json", device);
g2p->RegisterLangProcess("zh", std::make_unique<GPTSoVITS::G2P::G2PZH>(),
std::move(bert), true);
auto enc = std::make_shared<GPTSoVITS::Model::GPTEncoderModel>();
enc->Init<GPTSoVITS::Model::TensorRTBackend>(model_path + "/gpt_encoder.engine", device);
auto step = std::make_shared<GPTSoVITS::Model::GPTStepModel>();
step->Init<GPTSoVITS::Model::TensorRTBackend>(model_path + "/gpt_step.engine", device);
auto sovits = std::make_shared<GPTSoVITS::Model::SoVITSModel>();
sovits->Init<GPTSoVITS::Model::TensorRTBackend>(model_path + "/sovits.engine", device);
auto edge = std::make_shared<GPTSoVITS::EdgePipeline>(config_json, model_path,
g2p, enc, step, sovits);
edge->ImportSpeaker("my_speaker.gsppkg", "my_speaker");
// Same streaming configuration as ONNX
GPTSoVITS::StreamingConfig stream_cfg;
stream_cfg.chunk_length = 24;
stream_cfg.pause_length = 0.3f;
stream_cfg.h_len = 512;
stream_cfg.l_len = 16;
stream_cfg.enable_fade = true;
auto streaming = std::make_shared<GPTSoVITS::StreamingPipeline>(edge, stream_cfg);
// Streaming callback remains the same
streaming->InferSpeakerStreaming(
"my_speaker", "要合成的文本", "zh",
[](const GPTSoVITS::AudioChunk& chunk) {
// Process audio chunks in real-time
},
/*sample_config=*/{}, /*noise_scale=*/0.35f, /*speed=*/1.0f);ONNX Runtime IOBinding keeps KV-cache tensors resident in VRAM across every autoregressive step. No PCIe round-trips, no cudaMemcpy per token.
Pre-allocated k_cache / v_cache output buffers. Each step swaps pointers — zero allocation in the hot loop.
Lookahead + history window with linear crossfade at chunk boundaries. No clicks, no pops, even at aggressive chunk sizes.
- ONNX Runtime backend (CPU + CUDA)
- Distributed inference (speaker package workflow)
- Streaming inference with crossfade
- Multi-language G2P (ZH / EN / JA)
- InferStats — tokens/s, RTF, first-packet latency
- TensorRT backend
- INT8 quantization
- Language Bindings:
- C API
- Python binding
- Rust binding
- Go binding
- Android / iOS wrapper
- (... more bindings)
Built on top of GPT-SoVITS and the engineering work in GPT-SoVITS_minimal_inference.
If this project helps you, drop a ⭐. It's free and it means a lot. 🤗