The paper says that all audios are resampled at 16khz and it is trained on librilight. But the vae is working on 24khz latent?