Conversation
| sample_rate: int = 22050, | ||
| n_fft: int = 1024, | ||
| hop_length: int = 256, | ||
| n_mels: int = 80, |
There was a problem hiding this comment.
Is n_mels in loss.py here meant to have the default changed to 80? In feature_extractors.py it remains at 100, presumably the default in loss.py was also meant to stay at 100 and only be adjusted by the vocos-matcha.yaml?
There was a problem hiding this comment.
You're right, we should keep n_mels to 100 in loss.py. Also, in feature_extractors.py the defaults should be
f_max=None
norm=None,
mel_scale="htk"
There was a problem hiding this comment.
Would you happen to have any reference on the decision between 80 and 100 n_mels?
I understand 80 has been quite common so many models are trained with that as a result, but for the actual decision originally I am curious?
- Is 80 intended to be sufficient for speech specifically?
- I came across a paper recently that cited 96 as a minimum for covering not only speech, but also music and general sound effects.
With 80 and 96, these are multiples of 8 which I'm familiar with being preferential compute (at least traditionally, just like games used for textures - although that'd tend to be more like powers of 2, thus 64 vs 128). Perhaps Vocos just rounded that up to 100 🤔 I'm not sure if that'd actually regress somewhere vs 96 😅
Various Text-to-Speech(TTS) implementations( Grad-TTS, Matcha-TTS, P-flow ) rely on the mel spectrogram feature extractor code found in hifi-gan
This PR introduces modifications to the feature extractor in order to enable the Vocos to work seamlessly with the outputs generated by the those TTS systems.
To achieve this, the parameters within the
torchaudio.transforms.MelSpectrogramwere adjusted to match the features generated in the hifi-gan codebase. Specifically the changes were made in the frequency limits and the mel scale.We trained Vocos 400k steps using this changes and we're able to obtain a reasonable good quality audio from the output of Matcha-TTS.
Closes #39