This repository contains the assignments and code for the Enrichment Assignments of the Course SOW-MKI85 Machine Hearing 2024-2025.
These assignments aim to provide you with a hands-on experience with deep learning approaches for machine hearing. You will learn to pre-process audio, to extract relevant audio features, to train a ResNet-18 model on an environmental sound classification task and to evaluate the performance of the trained model.
The assignments consists of five sessions. Session 1 and Session 2 introduce the assignment, the dataset, relevant concepts, frameworks, libraries, and audio feature extraction. Session 3 and Session 4 cover data preprocessing and training the ResNet-18 model on different sets of audio features, while Session 5 is focused on analyzing and visualizing model performance. To conclude, Session 6 is dedicated to placing your findings within a wider theoretical framework based on your newly acquired knowledge of AI for Audio.
After successful completion of the enrichment assignments, you can...
• Describe, extract and analyse relevant audio features for sound classification.
• Implement and train a ResNet-18 model using various audio features.
• Evaluate and compare model performance for a sound classification task using relevant performance metrics.
• Visualize data and results in a meaningful, informative way.
• Interpret findings within the wider theoretical framework of AI for Audio.
The dataset that we are using for these assignments is the Environmental Sound Classification 50 (ESC-50) dataset[1]. This dataset consists of sound clips of 5 second duration in five categories: “Natural soundscapes & Water sounds”, “Human, non-speech sounds”, “Interior/Domestic sounds”, “Exterior/Urban noises”. More information about the dataset can be found [here](https://github.com/karolpiczak/ESC-50).
The link to the ESC-50 database and metafile is on Brightspace in 'Content' --> 'Practical' --> 'Dataset'.
For these assignments, we make use of the ResNet-18 model [2]. ResNet models use skip connections to learn residual functions with respect to the input, rather than learning unreferenced functions as is the case in most neural networks. These skip connections mitigate the vanishing/exploding gradient problem that deep neural networks encounter, resulting in faster convergence and better performance.
Here, we make use of an implementation of the ResNet model with 18 layers. Although this relatively small ResNet performs on par with other state-of-the-art architectures, it has relatively low complexity and converges faster [2].
- Pytorch: The assignments use the open-source library Pytorch to implement the ResNet-18 model.
- Torchaudio: The assignments use Torchaudio to compute and extract relevant audio features. Torchaudio is a library for audio and signal processing with Pytorch. A good alternative to Torchaudio is Librosa.
- WandB (Weights and Biases): The assignments use the Weights & Biases ML Ops platform to visualize and track training progress.
- Colab: You can work on the assingments using Google Colab. Colab is a hosted Jupyter Notebook service. Use Google Drive for data storage. If you do not yet have an account, please sign up and create an account.
You can download assignments, clone the repository or open an assignment in Google Colab using the link at the top of the script. Assignments will be added at the latest on the evening before the practical session. In total, there will be six assignments.
The Practical Assignments are completed with a Practical Report consisting of two parts:
- Part 1: Audio feature extraction; template practical report part 1.
- Part 2: Sound classification using the Resnet-18 model; template practical report part 2.
[1] Piczak, K. J. (2015, October). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 1015-1018).
[2] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).