Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

5.11 Update by HorikitaSaku

Added data2vec as content encoder for improved semantic representation
Implemented hybrid pitch detection: 75% CREPE + 15% RMVPE for more robust pitch extraction
Added Mel Cepstrum loss for better spectral envelope matching
Continued improvements from previous versions:
- Whisper fusion of v2 and v3 models for content encoding
- Phase loss, high frequency mel loss, and vggish loss for enhanced audio quality
- ContentVec integration for better content representation
- Referenced implementation techniques from:
  - whisper-vits-svc (bigvgan-mix-v2 branch)
  - Retrieval-based-Voice-Conversion-WebUI

Device Support

This project now supports multiple compute devices:

NVIDIA GPUs (CUDA): Full support with optimal performance
Apple Silicon (M1/M2/M3 via MPS): Hardware acceleration via Metal Performance Shaders
CPU: Fallback option when no GPU is available

The device is automatically detected and selected based on availability. Priority order: CUDA > MPS > CPU.

Verifying Device Detection

After installing PyTorch, you can verify that your device is correctly detected:

python verify_device.py

This script will display:

Your system information and architecture
Available compute devices (CUDA/MPS/CPU)
Which device will be used for training and inference
A simple test to verify the device is working correctly

Expected output on Apple Silicon Macs:

Selected Device:     mps
Device Type:         mps
Device Name:         MPS (Apple Silicon)
✓ Apple Silicon (MPS) GPU acceleration is available and working

Expected output on NVIDIA GPU systems:

Selected Device:     cuda:0
Device Type:         cuda
Device Name:         CUDA (NVIDIA GeForce RTX ...)
✓ CUDA GPU acceleration is available and working

Running Tests

To run the device detection test suite:

# Run all device detection tests
python -m pytest tests/test_device_detection.py -v

# Or run with unittest
python tests/test_device_detection.py

macOS Native App

SoVits-SVC can be bundled as a native macOS application using PyInstaller and pywebview.

Building the macOS App

On macOS systems, you can build a standalone .app bundle:

./build_local.sh

This creates dist/SoVitsSVC.app which can be distributed to users.

Features

Native macOS app with pywebview for a native window experience
Code signed (ad-hoc signing by default, can use Apple Developer certificate)
DMG installer for easy distribution
Apple Silicon optimized - automatically uses Metal Performance Shaders (MPS)
No terminal required - runs as a standard macOS application

Documentation

For detailed build instructions, see BUILD_MACOS.md.

For automated builds via GitHub Actions, see .github/workflows/build-release.yml.

Setup Environment

Install PyTorch.

For Apple Silicon (M1/M2/M3) users: PyTorch will automatically use MPS (Metal Performance Shaders) for GPU acceleration when available.
Install project dependencies
```
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
```
Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.
Download whisper model whisper-large-v2 or whisper-large-v3. Make sure to download the model file and put it into whisper_pretrain/.
Download hubert_soft model，put hubert-soft-0d54a1f4.pt into hubert_pretrain/.
Download RMVPE model and put rmvpe.pt into rmvpe_pretrain/.

Download pretrain model sovits5.0.pretrain.pth, and put it into vits_pretrain/.

python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav

Troubleshooting

For Apple Silicon (M1/M2/M3) Users

If MPS is not detected on your Mac:

Check PyTorch version: MPS support requires PyTorch 1.12 or later
```
python -c "import torch; print(torch.__version__)"
```
Verify macOS version: MPS requires macOS 12.3 or later
```
sw_vers
```

Install/Update PyTorch:

pip3 install --upgrade torch torchvision torchaudio

Verify MPS availability:

python -c "import torch; print(f'MPS available: {torch.backends.mps.is_available()}')"

Run the verification script to see detailed diagnostics:
```
python verify_device.py
```

Common Issues

"MPS backend out of memory":

MPS has memory limitations. Try reducing batch size or use CPU for large models.
You can force CPU usage by setting device preference in the code.

Performance Issues on MPS:

First run may be slower due to Metal shader compilation
Some operations may fall back to CPU automatically
Overall performance should still be significantly better than CPU-only

Dataset preparation

Necessary pre-processing:

Separate voice and accompaniment with UVR (skip if no accompaniment)
Cut audio input to shorter length with slicer, whisper takes input less than 30 seconds.
Manually check generated audio input, remove inputs shorter than 2 seconds or with obivous noise.
Adjust loudness if necessary, recommend Adobe Audiiton.
Put the dataset into the dataset_raw directory following the structure below.

Name		Name	Last commit message	Last commit date
Latest commit History 551 Commits
.github		.github
build/macos		build/macos
configs		configs
contentvec		contentvec
crepe		crepe
data2vec		data2vec
feature_retrieval		feature_retrieval
hubert		hubert
hubert_pretrain		hubert_pretrain
models/singer		models/singer
pitch		pitch
prepare		prepare
rmvpe		rmvpe
speaker		speaker
speaker_pretrain		speaker_pretrain
tests		tests
utils		utils
vad		vad
vits		vits
vits_decoder		vits_decoder
vits_extend		vits_extend
vits_pretrain		vits_pretrain
whisper		whisper
whisper_pretrain		whisper_pretrain
.gitignore		.gitignore
BUILD_MACOS.md		BUILD_MACOS.md
CHANGES.md		CHANGES.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
PROJECT_STATUS.md		PROJECT_STATUS.md
QUICKSTART_MACOS.md		QUICKSTART_MACOS.md
README.md		README.md
README_ZH.md		README_ZH.md
SoVitsSVC.command		SoVitsSVC.command
TESTING.md		TESTING.md
TEST_SUMMARY.md		TEST_SUMMARY.md
VERIFICATION_REPORT.md		VERIFICATION_REPORT.md
VERSION		VERSION
app.py		app.py
build_dmg.sh		build_dmg.sh
build_local.sh		build_local.sh
colab.ipynb		colab.ipynb
concentrate.ipynb		concentrate.ipynb
controller.ipynb		controller.ipynb
environment.yml		environment.yml
inference.ipynb		inference.ipynb
requirements.txt		requirements.txt
requirements_macos.txt		requirements_macos.txt
sovits_app.py		sovits_app.py
sovits_svc.spec		sovits_svc.spec
svc_eva.py		svc_eva.py
svc_export.py		svc_export.py
svc_inference.py		svc_inference.py
svc_inference_batch.py		svc_inference_batch.py
svc_inference_post.py		svc_inference_post.py
svc_inference_shift.py		svc_inference_shift.py
svc_merge.py		svc_merge.py
svc_preprocessing.py		svc_preprocessing.py
svc_train_retrieval.py		svc_train_retrieval.py
svc_trainer.py		svc_trainer.py
test_build_config.sh		test_build_config.sh
test_mock_device_detection.py		test_mock_device_detection.py
verify_device.py		verify_device.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

5.11 Update by HorikitaSaku

Device Support

Verifying Device Detection

Running Tests

macOS Native App

Building the macOS App

Features

Documentation

Setup Environment

Troubleshooting

For Apple Silicon (M1/M2/M3) Users

Common Issues

Dataset preparation

About

Uh oh!

Releases 2

Packages

Contributors 18

Uh oh!

Languages

License

audiohacking/so-vits-svc-osx

Folders and files

Latest commit

History

Repository files navigation

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

5.11 Update by HorikitaSaku

Device Support

Verifying Device Detection

Running Tests

macOS Native App

Building the macOS App

Features

Documentation

Setup Environment

Troubleshooting

For Apple Silicon (M1/M2/M3) Users

Common Issues

Dataset preparation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 18

Uh oh!

Languages

Packages