Skip to content

Bachelor Final Year Project exploring real-time speech denoising using machine learning. Compares classical methods (SS, WF, MMSE-LSA) with 5 deep models on spectrogram data, highlighting Conv-TasNet’s effectiveness. Features dataset bucketing, OOM mitigation, and batch evaluation.

Notifications You must be signed in to change notification settings

GrahamPellegrini/Machine-Learning-Noise-Cancellation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Noise Cancellation System

ICT3908 Course PyTorch Noisy Speech DB

Final Year Project submitted for the degree of B.Sc. (Hons.) in Computer Engineering at the UOM under the supervision of Dr. Trevor Spiteri.


Project Overview

This project explores the application of machine learning (ML) techniques for speech enhancement, aiming to justify their superiority over classical methods in real-time denoising scenarios.

A modular and extensible Python pipeline was built to support:

  • Variable-length waveform inputs using dynamic bucketing
  • Real-time inference via spectrogram-based denoising
  • Consistent metric-based evaluation for batch and single-file use
  • Reproducible experiments via centralised configuration and logging

Three classical denoising methods — Spectral Subtraction (SS), Wiener Filtering (WF), and MMSE-LSA — were implemented to provide baselines.

These were compared against five ML models trained from scratch:

  • CNN: A shallow baseline to verify the pipeline
  • CED / R-CED: Encoder-decoder variants with and without residuals
  • UNet: A deeper skip-connected model for better feature preservation
  • Conv-TasNet: A time-domain network yielding top performance

All models were trained on magnitude spectrograms, with batch handling strategies (Static, Dynamic Bucketing, PTO) evaluated separately.

The pipeline supports training from scratch, memory-efficient inference, and consistent evaluation. Laying the foundation for future experimentation with advanced architectures.


Evaluation Metrics

The metrics were selected based on relevance in related literature and provide a balanced view of both numerical accuracy and perceptual quality in denoising performance.


Model & Method Performance

Method↑SNR (dB)↓MSE↑PESQ↑STOI↓LSD (dB)Time (s)
Baseline-2.280.0051521.84510.89280.904256
SS3.090.0015251.45350.84570.767161
WF0.460.0028752.06390.88890.753573
MMSE-LSA-0.860.0037262.02380.89430.797183
CNN4.640.0013441.74100.80730.795671
CED13.190.0001611.67800.83860.765565
R-CED14.530.0001172.05420.86770.648074
UNet16.990.0000692.13840.89400.707687
Conv-TasNet18.060.0000632.43290.91120.6741139

Key Findings

  • Dynamic Bucketing was selected as the preferred dataset handling method. It used K-Means clustering to assign samples into optimally sized buckets, improving training efficiency over Static Bucketing while avoiding the runtime penalties of Padding-Truncation Output-Truncation (PTO) during inference.

  • OOM mitigation techniques — including mixed-precision (FP16), garbage collection, and gradient accumulation — enabled training of deeper models on the university GPU cluster. Evaluation showed no degradation in model quality, and in some cases, slight improvements due to accumulated gradient stability.

  • Among classical methods:

    • Spectral Subtraction (SS) achieved strong numerical performance (e.g., SNR), but performed poorly on perceptual metrics like PESQ and STOI.
    • Wiener Filtering (WF) and MMSE-LSA achieved better perceptual quality, but failed to match ML models in numerical fidelity.
    • Overall, no classical approach provided a comprehensive improvement across all evaluation dimensions.
  • Conv-TasNet emerged as the top-performing ML model, achieving the best SNR, PESQ, and STOI. Originally developed for speech separation, its temporal masking architecture and learned bottlenecks translated effectively to the spectrogram-based denoising task used in this pipeline.

  • The pipeline is built for future extensibility and supports:

    • Integration of transformer and diffusion-based models
    • Real-time inference with beamforming support
    • Exploration of unseen noise generalisation and more diverse datasets

Limitations & Future Work

Despite achieving strong numerical and perceptual performance, this project leaves several avenues for improvement:

Limitations

  • Perceptual Ceiling: While Conv-TasNet outperformed classical models, its PESQ score (2.43) remains well below the perceptual upper bound of 4.5.
  • Generalisation to Unseen Noise: The models struggled with noise types not present during training. More diverse datasets are needed to improve real-world robustness.
  • Resource Constraints: Due to limited GPU memory, batch sizes and model complexity were capped. Techniques like gradient accumulation and FP16 were essential but not ideal.

Future Directions

  • Model Expansion: Incorporating transformer-based models (e.g. ScaleFormer) or diffusion-based architectures could unlock further gains in intelligibility and naturalness.
  • Multi-Channel Input: Extend the pipeline to support beamforming and microphone array input for spatial filtering in real-world deployments.
  • Self-Supervised Pretraining: Introduce SSL or reinforcement learning strategies to improve generalisation with limited labelled data.
  • Real-Time Integration: Adapt the inference system for on-device deployment in edge hardware like headphones or smartphones with Active Noise Cancellation (ANC) support.

Repository Structure

.Project/
├── main.py              # Entry point for training/evaluation
├── config.py            # Central config for datasets, models, training params
├── Utils/
│   ├── dataset.py       # Spectrogram conversion + augmentation
│   ├── denoise.py       # Inference utilities
│   ├── train.py         # Model training/validation logic
│   └── models.py        # Model architectures (CNN, CED, R-CED, UNet, Conv-TasNet)
├── Models/              # Saved model weights by experiment
│   ├── 25/
│   ├── dataset/
│   └── oom/
├── Output/              # Denoising outputs (wav, txt, png)
│   ├── 25/
│   ├── dataset/
│   ├── oom/
│   ├── png/
│   ├── txt/
│   └── wav/
├── Cache/               # Cached spectrograms and length logs
│   ├── dynamic/
│   ├── static/
│   └── pto/
├── ssh/                 # SLURM-compatible job scripts
│   ├── main.sh
│   ├── latex.sh
│   └── notebook.sh
├── Template/            # Report LaTeX source
│   ├── main.pdf         # Final Year Project Report
│   ├── main.tex
│   ├── build/
│   ├── content/
│   └── references.bib
└── .gitignore

Dataset

The system uses the Noisy Speech Database from the University of Edinburgh:

Audio is converted to magnitude spectrograms for all training, validation, and inference steps.


Usage

All functionality is controlled via main.py and config.py:

python main.py

To modify:

  • Dataset or output locations
  • Model selection or loss functions
  • Batch sizes, precision, or training strategy

Edit config.py accordingly.

Inference can denoise a single .wav or full batch denoice with metric evaluation as a .txt.


Final Report

Read the full dissertation here: main.pdf

Includes methodology, system design, ablation studies, and model evaluation.

This project received an A Grade in the B.Sc. (Hons.) Computer Engineering programme.


Author

Graham Pellegrini
B.Sc. (Hons.) Computer Engineering
University of Malta
GitHub: @GrahamPellegrini

About

Bachelor Final Year Project exploring real-time speech denoising using machine learning. Compares classical methods (SS, WF, MMSE-LSA) with 5 deep models on spectrogram data, highlighting Conv-TasNet’s effectiveness. Features dataset bucketing, OOM mitigation, and batch evaluation.

Topics

Resources

Stars

Watchers

Forks