Machine Learning Noise Cancellation System

Final Year Project submitted for the degree of B.Sc. (Hons.) in Computer Engineering at the UOM under the supervision of Dr. Trevor Spiteri.

Project Overview

This project explores the application of machine learning (ML) techniques for speech enhancement, aiming to justify their superiority over classical methods in real-time denoising scenarios.

A modular and extensible Python pipeline was built to support:

Variable-length waveform inputs using dynamic bucketing
Real-time inference via spectrogram-based denoising
Consistent metric-based evaluation for batch and single-file use
Reproducible experiments via centralised configuration and logging

Three classical denoising methods — Spectral Subtraction (SS), Wiener Filtering (WF), and MMSE-LSA — were implemented to provide baselines.

These were compared against five ML models trained from scratch:

CNN: A shallow baseline to verify the pipeline
CED / R-CED: Encoder-decoder variants with and without residuals
UNet: A deeper skip-connected model for better feature preservation
Conv-TasNet: A time-domain network yielding top performance

All models were trained on magnitude spectrograms, with batch handling strategies (Static, Dynamic Bucketing, PTO) evaluated separately.

The pipeline supports training from scratch, memory-efficient inference, and consistent evaluation. Laying the foundation for future experimentation with advanced architectures.

Evaluation Metrics

SNR – Signal-to-noise ratio
MSE – Mean squared error
PESQ – Perceptual evaluation of speech quality
STOI – Short-time objective intelligibility
LSD – Log-spectral distance

The metrics were selected based on relevance in related literature and provide a balanced view of both numerical accuracy and perceptual quality in denoising performance.

Model & Method Performance

Method	↑SNR (dB)	↓MSE	↑PESQ	↑STOI	↓LSD (dB)	Time (s)
Baseline	-2.28	0.005152	1.8451	0.8928	0.9042	56
SS	3.09	0.001525	1.4535	0.8457	0.7671	61
WF	0.46	0.002875	2.0639	0.8889	0.7535	73
MMSE-LSA	-0.86	0.003726	2.0238	0.8943	0.7971	83
CNN	4.64	0.001344	1.7410	0.8073	0.7956	71
CED	13.19	0.000161	1.6780	0.8386	0.7655	65
R-CED	14.53	0.000117	2.0542	0.8677	0.6480	74
UNet	16.99	0.000069	2.1384	0.8940	0.7076	87
Conv-TasNet	18.06	0.000063	2.4329	0.9112	0.6741	139

Key Findings

Dynamic Bucketing was selected as the preferred dataset handling method. It used K-Means clustering to assign samples into optimally sized buckets, improving training efficiency over Static Bucketing while avoiding the runtime penalties of Padding-Truncation Output-Truncation (PTO) during inference.
OOM mitigation techniques — including mixed-precision (FP16), garbage collection, and gradient accumulation — enabled training of deeper models on the university GPU cluster. Evaluation showed no degradation in model quality, and in some cases, slight improvements due to accumulated gradient stability.
Among classical methods:
- Spectral Subtraction (SS) achieved strong numerical performance (e.g., SNR), but performed poorly on perceptual metrics like PESQ and STOI.
- Wiener Filtering (WF) and MMSE-LSA achieved better perceptual quality, but failed to match ML models in numerical fidelity.
- Overall, no classical approach provided a comprehensive improvement across all evaluation dimensions.
Conv-TasNet emerged as the top-performing ML model, achieving the best SNR, PESQ, and STOI. Originally developed for speech separation, its temporal masking architecture and learned bottlenecks translated effectively to the spectrogram-based denoising task used in this pipeline.
The pipeline is built for future extensibility and supports:
- Integration of transformer and diffusion-based models
- Real-time inference with beamforming support
- Exploration of unseen noise generalisation and more diverse datasets

Limitations & Future Work

Despite achieving strong numerical and perceptual performance, this project leaves several avenues for improvement:

Limitations

Perceptual Ceiling: While Conv-TasNet outperformed classical models, its PESQ score (2.43) remains well below the perceptual upper bound of 4.5.
Generalisation to Unseen Noise: The models struggled with noise types not present during training. More diverse datasets are needed to improve real-world robustness.
Resource Constraints: Due to limited GPU memory, batch sizes and model complexity were capped. Techniques like gradient accumulation and FP16 were essential but not ideal.

Future Directions

Model Expansion: Incorporating transformer-based models (e.g. ScaleFormer) or diffusion-based architectures could unlock further gains in intelligibility and naturalness.
Multi-Channel Input: Extend the pipeline to support beamforming and microphone array input for spatial filtering in real-world deployments.
Self-Supervised Pretraining: Introduce SSL or reinforcement learning strategies to improve generalisation with limited labelled data.
Real-Time Integration: Adapt the inference system for on-device deployment in edge hardware like headphones or smartphones with Active Noise Cancellation (ANC) support.

Repository Structure

.Project/
├── main.py              # Entry point for training/evaluation
├── config.py            # Central config for datasets, models, training params
├── Utils/
│   ├── dataset.py       # Spectrogram conversion + augmentation
│   ├── denoise.py       # Inference utilities
│   ├── train.py         # Model training/validation logic
│   └── models.py        # Model architectures (CNN, CED, R-CED, UNet, Conv-TasNet)
├── Models/              # Saved model weights by experiment
│   ├── 25/
│   ├── dataset/
│   └── oom/
├── Output/              # Denoising outputs (wav, txt, png)
│   ├── 25/
│   ├── dataset/
│   ├── oom/
│   ├── png/
│   ├── txt/
│   └── wav/
├── Cache/               # Cached spectrograms and length logs
│   ├── dynamic/
│   ├── static/
│   └── pto/
├── ssh/                 # SLURM-compatible job scripts
│   ├── main.sh
│   ├── latex.sh
│   └── notebook.sh
├── Template/            # Report LaTeX source
│   ├── main.pdf         # Final Year Project Report
│   ├── main.tex
│   ├── build/
│   ├── content/
│   └── references.bib
└── .gitignore

Dataset

The system uses the Noisy Speech Database from the University of Edinburgh:

https://datashare.ed.ac.uk/handle/10283/2791
License: Creative Commons Attribution 4.0 International

Audio is converted to magnitude spectrograms for all training, validation, and inference steps.

Usage

All functionality is controlled via main.py and config.py:

python main.py

To modify:

Dataset or output locations
Model selection or loss functions
Batch sizes, precision, or training strategy

Edit config.py accordingly.

Inference can denoise a single .wav or full batch denoice with metric evaluation as a .txt.

Final Report

Read the full dissertation here: main.pdf

Includes methodology, system design, ablation studies, and model evaluation.

This project received an A Grade in the B.Sc. (Hons.) Computer Engineering programme.

Author

Graham Pellegrini
B.Sc. (Hons.) Computer Engineering
University of Malta
GitHub: @GrahamPellegrini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning Noise Cancellation System

Project Overview

Evaluation Metrics

Model & Method Performance

Key Findings

Limitations & Future Work

Limitations

Future Directions

Repository Structure

Dataset

Usage

Final Report

Author

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.Project		.Project
Template		Template
ssh		ssh
.gitignore		.gitignore
README.md		README.md

GrahamPellegrini/Machine-Learning-Noise-Cancellation

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Noise Cancellation System

Project Overview

Evaluation Metrics

Model & Method Performance

Key Findings

Limitations & Future Work

Limitations

Future Directions

Repository Structure

Dataset

Usage

Final Report

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages