Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS
- Added data2vec as content encoder for improved semantic representation
- Implemented hybrid pitch detection: 75% CREPE + 15% RMVPE for more robust pitch extraction
- Added Mel Cepstrum loss for better spectral envelope matching
- Continued improvements from previous versions:
- Whisper fusion of v2 and v3 models for content encoding
- Phase loss, high frequency mel loss, and vggish loss for enhanced audio quality
- ContentVec integration for better content representation
- Referenced implementation techniques from:
This project now supports multiple compute devices:
- NVIDIA GPUs (CUDA): Full support with optimal performance
- Apple Silicon (M1/M2/M3 via MPS): Hardware acceleration via Metal Performance Shaders
- CPU: Fallback option when no GPU is available
The device is automatically detected and selected based on availability. Priority order: CUDA > MPS > CPU.
After installing PyTorch, you can verify that your device is correctly detected:
python verify_device.pyThis script will display:
- Your system information and architecture
- Available compute devices (CUDA/MPS/CPU)
- Which device will be used for training and inference
- A simple test to verify the device is working correctly
Expected output on Apple Silicon Macs:
Selected Device: mps
Device Type: mps
Device Name: MPS (Apple Silicon)
✓ Apple Silicon (MPS) GPU acceleration is available and working
Expected output on NVIDIA GPU systems:
Selected Device: cuda:0
Device Type: cuda
Device Name: CUDA (NVIDIA GeForce RTX ...)
✓ CUDA GPU acceleration is available and working
To run the device detection test suite:
# Run all device detection tests
python -m pytest tests/test_device_detection.py -v
# Or run with unittest
python tests/test_device_detection.pySoVits-SVC can be bundled as a native macOS application using PyInstaller and pywebview.
On macOS systems, you can build a standalone .app bundle:
./build_local.shThis creates dist/SoVitsSVC.app which can be distributed to users.
- Native macOS app with pywebview for a native window experience
- Code signed (ad-hoc signing by default, can use Apple Developer certificate)
- DMG installer for easy distribution
- Apple Silicon optimized - automatically uses Metal Performance Shaders (MPS)
- No terminal required - runs as a standard macOS application
For detailed build instructions, see BUILD_MACOS.md.
For automated builds via GitHub Actions, see .github/workflows/build-release.yml.
-
Install PyTorch.
For Apple Silicon (M1/M2/M3) users: PyTorch will automatically use MPS (Metal Performance Shaders) for GPU acceleration when available.
-
Install project dependencies
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error
-
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tarintospeaker_pretrain/. -
Download whisper model whisper-large-v2 or whisper-large-v3. Make sure to download the model file and put it into
whisper_pretrain/. -
Download hubert_soft model,put
hubert-soft-0d54a1f4.ptintohubert_pretrain/. -
Download RMVPE model and put
rmvpe.ptintormvpe_pretrain/. -
Download pretrain model sovits5.0.pretrain.pth, and put it into
vits_pretrain/.python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
If MPS is not detected on your Mac:
-
Check PyTorch version: MPS support requires PyTorch 1.12 or later
python -c "import torch; print(torch.__version__)" -
Verify macOS version: MPS requires macOS 12.3 or later
sw_vers
-
Install/Update PyTorch:
pip3 install --upgrade torch torchvision torchaudio
-
Verify MPS availability:
python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')" -
Run the verification script to see detailed diagnostics:
python verify_device.py
"MPS backend out of memory":
- MPS has memory limitations. Try reducing batch size or use CPU for large models.
- You can force CPU usage by setting device preference in the code.
Performance Issues on MPS:
- First run may be slower due to Metal shader compilation
- Some operations may fall back to CPU automatically
- Overall performance should still be significantly better than CPU-only
Necessary pre-processing:
- Separate voice and accompaniment with UVR (skip if no accompaniment)
- Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds.
- Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
- Adjust loudness if necessary, recommend Adobe Audiiton.
- Put the dataset into the
dataset_rawdirectory following the structure below.