DillWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.
Credit to the original repo here.
An Nvidia GPU that is somewhere in the RTX 30XX-40XX range.
For training it's recommended to have 16+ GB of VRAM. For inference its recommended to have at least 4 GB of VRAM.
First install Pytorch, GPU version recommended! Also you need Python of course! Version 3.10.X is recommended for dillwave.
From GitHub:
git clone https://github.com/dillfrescott/dillwave
pip install -e dillwave
or
pip install git+https://github.com/dillfrescott/dillwave
You need Git installed for either of these "From GitHub" install methods to work.
python -m dillwave.preprocess /path/to/dir/containing/wavs # 48000hz, 1 channel, (8 seconds length recommended for each clip)
python -m dillwave /path/to/model/dir /path/to/dir/containing/wavs
# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all
python -m dillwave.inference /path/to/model --spectrogram_path /path/to/spectrogram -o output.wav [--fast]