Final Year Project submitted for the degree of B.Sc. (Hons.) in Computer Engineering at the UOM under the supervision of Dr. Trevor Spiteri.
This project explores the application of machine learning (ML) techniques for speech enhancement, aiming to justify their superiority over classical methods in real-time denoising scenarios.
A modular and extensible Python pipeline was built to support:
- Variable-length waveform inputs using dynamic bucketing
- Real-time inference via spectrogram-based denoising
- Consistent metric-based evaluation for batch and single-file use
- Reproducible experiments via centralised configuration and logging
Three classical denoising methods — Spectral Subtraction (SS), Wiener Filtering (WF), and MMSE-LSA — were implemented to provide baselines.
These were compared against five ML models trained from scratch:
- CNN: A shallow baseline to verify the pipeline
- CED / R-CED: Encoder-decoder variants with and without residuals
- UNet: A deeper skip-connected model for better feature preservation
- Conv-TasNet: A time-domain network yielding top performance
All models were trained on magnitude spectrograms, with batch handling strategies (Static, Dynamic Bucketing, PTO) evaluated separately.
The pipeline supports training from scratch, memory-efficient inference, and consistent evaluation. Laying the foundation for future experimentation with advanced architectures.
- SNR – Signal-to-noise ratio
- MSE – Mean squared error
- PESQ – Perceptual evaluation of speech quality
- STOI – Short-time objective intelligibility
- LSD – Log-spectral distance
The metrics were selected based on relevance in related literature and provide a balanced view of both numerical accuracy and perceptual quality in denoising performance.
| Method | ↑SNR (dB) | ↓MSE | ↑PESQ | ↑STOI | ↓LSD (dB) | Time (s) |
|---|---|---|---|---|---|---|
| Baseline | -2.28 | 0.005152 | 1.8451 | 0.8928 | 0.9042 | 56 |
| SS | 3.09 | 0.001525 | 1.4535 | 0.8457 | 0.7671 | 61 |
| WF | 0.46 | 0.002875 | 2.0639 | 0.8889 | 0.7535 | 73 |
| MMSE-LSA | -0.86 | 0.003726 | 2.0238 | 0.8943 | 0.7971 | 83 |
| CNN | 4.64 | 0.001344 | 1.7410 | 0.8073 | 0.7956 | 71 |
| CED | 13.19 | 0.000161 | 1.6780 | 0.8386 | 0.7655 | 65 |
| R-CED | 14.53 | 0.000117 | 2.0542 | 0.8677 | 0.6480 | 74 |
| UNet | 16.99 | 0.000069 | 2.1384 | 0.8940 | 0.7076 | 87 |
| Conv-TasNet | 18.06 | 0.000063 | 2.4329 | 0.9112 | 0.6741 | 139 |
-
Dynamic Bucketing was selected as the preferred dataset handling method. It used K-Means clustering to assign samples into optimally sized buckets, improving training efficiency over Static Bucketing while avoiding the runtime penalties of Padding-Truncation Output-Truncation (PTO) during inference.
-
OOM mitigation techniques — including mixed-precision (FP16), garbage collection, and gradient accumulation — enabled training of deeper models on the university GPU cluster. Evaluation showed no degradation in model quality, and in some cases, slight improvements due to accumulated gradient stability.
-
Among classical methods:
- Spectral Subtraction (SS) achieved strong numerical performance (e.g., SNR), but performed poorly on perceptual metrics like PESQ and STOI.
- Wiener Filtering (WF) and MMSE-LSA achieved better perceptual quality, but failed to match ML models in numerical fidelity.
- Overall, no classical approach provided a comprehensive improvement across all evaluation dimensions.
-
Conv-TasNet emerged as the top-performing ML model, achieving the best SNR, PESQ, and STOI. Originally developed for speech separation, its temporal masking architecture and learned bottlenecks translated effectively to the spectrogram-based denoising task used in this pipeline.
-
The pipeline is built for future extensibility and supports:
- Integration of transformer and diffusion-based models
- Real-time inference with beamforming support
- Exploration of unseen noise generalisation and more diverse datasets
Despite achieving strong numerical and perceptual performance, this project leaves several avenues for improvement:
- Perceptual Ceiling: While Conv-TasNet outperformed classical models, its PESQ score (2.43) remains well below the perceptual upper bound of 4.5.
- Generalisation to Unseen Noise: The models struggled with noise types not present during training. More diverse datasets are needed to improve real-world robustness.
- Resource Constraints: Due to limited GPU memory, batch sizes and model complexity were capped. Techniques like gradient accumulation and FP16 were essential but not ideal.
- Model Expansion: Incorporating transformer-based models (e.g. ScaleFormer) or diffusion-based architectures could unlock further gains in intelligibility and naturalness.
- Multi-Channel Input: Extend the pipeline to support beamforming and microphone array input for spatial filtering in real-world deployments.
- Self-Supervised Pretraining: Introduce SSL or reinforcement learning strategies to improve generalisation with limited labelled data.
- Real-Time Integration: Adapt the inference system for on-device deployment in edge hardware like headphones or smartphones with Active Noise Cancellation (ANC) support.
.Project/
├── main.py # Entry point for training/evaluation
├── config.py # Central config for datasets, models, training params
├── Utils/
│ ├── dataset.py # Spectrogram conversion + augmentation
│ ├── denoise.py # Inference utilities
│ ├── train.py # Model training/validation logic
│ └── models.py # Model architectures (CNN, CED, R-CED, UNet, Conv-TasNet)
├── Models/ # Saved model weights by experiment
│ ├── 25/
│ ├── dataset/
│ └── oom/
├── Output/ # Denoising outputs (wav, txt, png)
│ ├── 25/
│ ├── dataset/
│ ├── oom/
│ ├── png/
│ ├── txt/
│ └── wav/
├── Cache/ # Cached spectrograms and length logs
│ ├── dynamic/
│ ├── static/
│ └── pto/
├── ssh/ # SLURM-compatible job scripts
│ ├── main.sh
│ ├── latex.sh
│ └── notebook.sh
├── Template/ # Report LaTeX source
│ ├── main.pdf # Final Year Project Report
│ ├── main.tex
│ ├── build/
│ ├── content/
│ └── references.bib
└── .gitignoreThe system uses the Noisy Speech Database from the University of Edinburgh:
- https://datashare.ed.ac.uk/handle/10283/2791
- License: Creative Commons Attribution 4.0 International
Audio is converted to magnitude spectrograms for all training, validation, and inference steps.
All functionality is controlled via main.py and config.py:
python main.pyTo modify:
- Dataset or output locations
- Model selection or loss functions
- Batch sizes, precision, or training strategy
Edit config.py accordingly.
Inference can denoise a single
.wavor full batch denoice with metric evaluation as a.txt.
Read the full dissertation here: main.pdf
Includes methodology, system design, ablation studies, and model evaluation.
This project received an A Grade in the B.Sc. (Hons.) Computer Engineering programme.
Graham Pellegrini
B.Sc. (Hons.) Computer Engineering
University of Malta
GitHub: @GrahamPellegrini