Question about MelSpectrogram #287

ductho9799 · 2024-10-01T04:51:29Z

ductho9799
Oct 1, 2024

In StyleTTS2 paper, the datasets were resampled to 24kHz. But in StyleTTS2 code, when calculating the MelSpectrogram:
to_mel = torchaudio.transforms.MelSpectrogram( n_mels=80, n_fft=2048, win_length=1200, hop_length=300). It used the default sampling_rate = 16000.
If I change the sampling_rate to 24000, will it affect the model results?

magicse · 2025-08-03T15:16:30Z

magicse
Aug 3, 2025

Hi @ductho9799
HiFi Gan was trained as 16Khz mels (24Khz audio with incorrect sr 16Khz) -> 24 Khz wav and it work something like as fake upsampler.
Therefore, now we have to give 24Khz audio for calculation of the mel spectrogram as 16Khz without real resampling.

I think it was a mistake when training the HiFi Gan model (torchaudio.transforms.MelSpectrogram set by default sr = 16Khz). And now it's baked into the pipeline. The correct way is of course that the mel spectrograms should be calculated from 24Khz with real sr 24Khz and after passing through HiFi Gan give 24Khz wave.
To avoid this situation you need retrain all models ASR, JDC and HiFi-Gan with correct mels. Force set sample_rate=24000

to_mel = torchaudio.transforms.MelSpectrogram(
    sample_rate=24000,
    n_fft=2048,
    win_length=1200,
    hop_length=300,
    n_mels=80,
)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about MelSpectrogram #287

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Question about MelSpectrogram #287

Uh oh!

ductho9799 Oct 1, 2024

Replies: 1 comment

Uh oh!

Uh oh!

magicse Aug 3, 2025

ductho9799
Oct 1, 2024

magicse
Aug 3, 2025