Code repo for the 2025 ICMLWMLA workshop submission.
@misc{zhang2025audiovisualspeechseparationbottleneck,
title={Audio-Visual Speech Separation via Bottleneck Iterative Network},
author={Sidong Zhang and Shiv Shankar and Trang Nguyen and Andrea Fanelli and Madalina Fiterau},
year={2025},
eprint={2507.07270},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2507.07270},
}
profusion_mbtorprombtare alternative/historical names of our proposedBottleneck Iterative Networkavlit,iia(short forIIA-Net),rtfs(short forRTFS-Net) are benchmarks models studied in our work
- sbatch scripts in
sbatchfolder follow the name of<data>_<model>_<gpu-type>.sh, sosbatch/lrs3wham_profusion_mbt_a100.shtrains the proposedBINmodel on LRS3WHAM data using A100 GPU via this python script:scripts/run_lrs3wham_prombt.py BINimplementation can be found here:- Training main pipeline follow this abstract class:
- Training pipeline for
BINon LRS3 is here: - Training pipeline for
BINon NTCD-TIMIT is here:
- sbatch scripts follow the name of
eval_<data>_<model>.sh, for example sbatch/eval_lrs3wham_prombt.sh - This isolated evaluations script generates the separated audio tracks for each test mixture audio track, instead of just give an overall performance evaluation as the one in the training process