Skip to content

DrStef/stft-cwt-complex-mask-denoising

Repository files navigation

Advanced Denoising with Complex Masks: STFT & CWT Subband Architectures

Phase I: STFT Subband U-Net Speech Enhancement: Dynamic Clamping Evolution

Experiment Series: v12p (cIRM Core Deployment)

Dr. Stéphane Dedieu
Applied Mathematics | Digital Signal Processing | ML
Ottawa, Ontario, Canada
March - May 2026
LinkedIn



Notebooks

Part I: SimpleUNet v12p Subband Engine - Phase-Aware Speech & Signal Denoising

This notebook instantiates the production-grade SimpleUNet_v12n architecture for high-fidelity denoising in hostile environments (SNR 0 dB and 6 dB), executing natively on fixed 2 × 129 × 128 complex Cartesian tensors (Stacked Real/Imaginary STFT features).

  • Hardware & Memory Trade-off: Enforcing a $2 \times 129 \times 128$ topology strikes the optimal balance for embedded systems—keeping the memory footprint and FLOP count strictly contained for edge hardware, while avoiding the heavy, prohibitive cache requirements of standard $2 \times 257 \times 256$ architectures.
  • Results: Demonstrates the power of the anisotropic "Dog Bone" clamping profile split at Bin 56 (~3500 Hz), achieving major noise suppression while preserving critical phase tracking and tracking the performance trade-offs via strict SI-SDR, PESQ, and Whisper ASR (WER) metrics.
  • Diagnostics: Includes native log-linear magnitude (torch.log1p) absolute error mapping and localized 10-second temporal waveform superposition.

Project Overview & Industrial Heritage

This repository hosts a production-grade, lightweight convolutional neural network (SimpleUNet_v12) tailored for high-fidelity speech and signal enhancement in hostile acoustic environments.

While benchmarked here on non-stationary speech-in-noise mixtures, this architecture was fundamentally conceived for industrial acoustic/vibration monitoring, predictive maintenance, and early fault detection (e.g., structural mechanical impacts, bearing tracking, and harsh industrial environments).

The flagship v12 engine (iterating through v12a-n) departs from standard magnitude-only spectral masking by enforcing a native, synchronized Cartesian Complex Ideal Ratio Mask (cIRM) pipeline, fully preserving sub-millisecond phase tracking without legacy raw-magnitude distortions or spectral phase artifacts. The advanced v12p/q variants introduce localized subbanding filters, which are detailed and benchmarked in the sections below.

State-of-the-Art Production: Subband SimpleUNet v12p/q

Goal: Achieve professional-grade, high-fidelity speech extraction and phase alignment in extreme, hostile acoustic environments (SNR 0 to 12 dB) using real/imaginary Subband Complex Ideal Ratio Masking (cIRM), optimized for real-time edge/embedded deployment (e.g., Alango, Cirrus Logic, and EERS ecosystems).

Key Enhancements & Training Foundation (v12p/q vs. Legacy v12m/v11):

  • Dual-Band Subband Factorization: The frequency spectrum is split at a critical boundary frequency bin ($f_b$). Low-frequency tracking focuses on vocal harmonic anchoring, while the high-frequency engine aggressively Targets noise extraction.
  • Targeted Stress-Training Profile: Hardened on an intensive dataset of 4,000 continuous mixtures strictly bounded between [0, 6] dB (using a rigorous 80/20 Train/Val structural split) to ensure robust generalization on highly distributed noise fields.
  • High-Frequency Blending Option (alpha_hf = 0.0): The pipeline includes an optional alpha_hf parameter designed to reintroduce a fraction of the raw mixture into the high-frequency bands for subjective listening comfort. However, empirical benchmarking on the Female + Rain configuration shows that even a minimal blend ($\alpha_{hf} = 0.05$) leaks non-stationary high-frequency noise back into the output, degrading overall clarity. Consequently, for the v12n/v12p subband engine, it is strictly recommended to keep alpha_hf = 0.0 to maintain maximum attenuation boundaries.
  • Analysis-Synthesis Synchronization Optimization: Upgraded to a continuous 66% Overlap-Add (OLA) framework powered by a finely tuned Kaiser synthesis window ($\beta=3$). This maximizes temporal resolution and eliminates phase-wrapping or block-boundary artifacts.
  • Symmetric Dog-Bone Phase Stability ($K=\pm15$): Directly optimizes the real and imaginary components of the mask, maintaining native phase coherence under low-to-mid voice pitch fundamentals while opening the dynamic gain gates in the high frequencies to surgically eradicate residual rain sizzle.

Project Landscape

This repository covers high-fidelity audio processing and speech enhancement operating in the complex time-frequency domain.

The core implementation utilizes a lightweight Subband SimpleUNet optimized for edge-device constraints. While current production data is anchored around optimized STFT (Short-Time Fourier Transform) processing, the baseline is architected to seamlessly scale into Continuous Wavelet Transform (CWT) multi-resolution frameworks for non-stationary transient tracking.

Primary Technical Objectives

  1. Low-Latency Edge Deployability: Maintaining minimal parameter footprints suitable for real-world microcontrollers, dedicated audio DSP hardware, or custom mixed-signal silicon.
  2. Perceptual & Machine Harmony: Simultaneously driving up human-auditory comfort (measured via wideband PESQ) and machine intelligibility (measured via Word Error Rate using a Whisper-Tiny ASR engine).
  3. No-Hallucination Boundaries: Guaranteeing zero processing distortions or induced artifacts across standard benchmarking thresholds (0 dB, 6 dB, 12 dB).

Dataset & Pre-Conditioning Core

All experiments are executed against controlled speech/noise mixtures anchored on a rigorous engineering pipeline:

  • Acoustic Pre-Filtering: All raw audio targets undergo a 5th-order High-Pass Bessel filter cut at 150 Hz to sweep low-end rumble and isolate core speech fundamentals.
  • Vocal Diversity: Clean speech anchors are drawn dynamically from the LibriSpeech corpus, balancing distinctive low-pitched male voices and high-pitched female vocals.
  • Hostile Noise Matrix: Bruitage profiles are extracted directly from the ESC-10 ecosystem, focusing on the two most industry-challenging noise archetypes:
    • Helicopter, chain saw: Impulsive, aggressive, low-to-mid dominant modulation.
    • Rain - seawave: stationary, pseudo-stationary, widespread high-frequency spectral masking.

The model v12p was primarily trained on mixtures at SNR = 0 and 6 dB, which represents the main target operating condition, while being evaluated across a wider range (0, 6, 12, and 15 dB).

Dataset & Training Dependencies

  • Pre-trained Architecture & Mixtures: The compiled evaluation dataset is hosted on my Google Drive and is fully accessible to anyone with the shareable link for direct download.
  • Notebook Initialization Prerequisites: To execute the initial data-loading loops and run the pipeline from scratch, the raw environment folders—specifically the ESC-10 (environmental noise) and LibriSpeech (clean speech) datasets—are fully hosted on my Google Shared Drive.

📥 Download Dataset


Model Architecture & Signal Flow: Subband SimpleUNet v12p

The Subband SimpleUNet is an optimized, lightweight U-Net architecture engineered for real-time, high-fidelity speech enhancement directly in the complex STFT domain. It is mathematically tailored to meet the strict latency, memory, and computational constraints of edge devices.


Waveform

SimpleUNet Architecture
(image generated by Gemini Nano Banana)


Core Structural Specifications

  • Input Topology: 2-channel complex representation — shape (Batch, 2, 129, 128). It processes the raw Real ($R$) and Imaginary ($I$) components of the noisy STFT spectrum, avoiding non-linear phase-unwrapping overhead. Along the frequency axis, the tensor is slice-allocated at a boundary frequency bin $f_b$ into low and high-frequency subbands.
  • Deep Encoder Pipeline: Three progressive downsampling stages (32 $\rightarrow$ 64 $\rightarrow$ 128 feature maps) utilizing a strategic combination of $5\times5$ and $3\times3$ 2D convolutions to extract both macro-acoustic context and fine harmonic structures.
  • Spatial-Frequency Factorized Coordinate Attention: Integrated directional tracking blocks at the deepest layers to map cross-frequency and cross-time dependencies. By factorizing the 2D spatial alignment into parallel 1D directional feature vectors, the network tracks non-stationary, highly impulsive noise modulations (like helicopter blades) across the entire spectrum without blowing up the computational footprint.
  • Latent Bottleneck: A high-capacity 256-feature layer designed to compress and model the abstract joint time-frequency dependencies between corrupting noise modulations and underlying speech phonemes.
  • Symmetric Decoder: Symmetric upsampling stages using bilinear interpolation paired with additive skip connections. This topology ensures that high-resolution spatial and temporal phase details from the encoder are directly injected back into the reconstruction path to preserve crisp vocal transients.
  • Output Matrix: Complex Ideal Ratio Mask (cIRM), generating bounding real and imaginary mask components used to analytically mask the original input mixture.

Advanced Loss Framework: Multi-Scale Spectral Loss (MSSL)

To ensure the model optimizes for both machine intelligibility (ASR) and human perceptual clarity, the training engine utilizes a rigorous Multi-Scale Spectral Loss (MSSL) regime paired with an adaptive time-domain loss weight.

  • Multi-Resolution Time-Frequency Mapping: The loss evaluates the reconstructed waveform across multiple STFT window lengths ($N_{\text{fft}} \in {512, 1024, 2048}$) to prevent spectral smearing, sharpen harmonic lines, and preserve vocal transients.
  • Adaptive Waveform $L_1$ Balancing: Evaluates time-domain reconstruction errors directly on the resynthesized, un-padded $15,875$-sample wave. The loss dynamically scales via an instantaneous structural balancer $\lambda_{\text{wave}}$ clamped between $[1.0, 70.0]$ to optimize phase alignment concurrently with magnitude reconstruction.

Chronological Evolution of Mathematical Bounds & Mask Clamping

To stabilize training and prevent gradient explosions induced by extreme localized noise spikes, a boundary restriction strategy (clamping) on the cIRM values is mathematically mandatory. The architecture evolved through three distinct physical paradigms:

Paradigm 1: Uniform / Linear Variable Clamping (Legacy v11)

  • Approach: Applied a rigid, uniform threshold boundary (e.g., $K=\pm5$ or $K=\pm10$) across all frequency bins.
  • Limitation: Highly sub-optimal. The network either over-smoothed high-frequency dynamics (loss of sibilants) or allowed chaotic thermal noise to leak into the lower speech fundamentals.

Paradigm 2: The "Audiometric Banana" Clamping (Magnitude Domain)

  • Approach: Inspired by the classic audiometric banana, this profile applied a frequency-dependent threshold variable calqué on human auditory sensitivity curves. However, it was strictly enforced on the magnitude of the signal.
  • Limitation: While psychoacoustically pleasant for human listening, magnitude-only clamping left the original noisy phase untouched in the upper spectrum. The network successfully cleaned the amplitude, but the corrupted phase created a performance ceiling, hindering downstream ASR WER scores.

Paradigm 3: The Symmetric "Os de Chien" Profile (Cartesian $Real / Imag$ Domain)

To break through this phase-error ceiling, the v12p production architecture introduces the Symmetric Dog-Bone Clamping, operating directly on the Cartesian coordinates—the Real ($R$) and Imaginary ($I$) components—of the cIRM.

  • Anisotropic Bounding Prior: Instead of a flat isotropic constraint, the network registers a frequency-dependent vector ${K}_{dog_{bone}} \in {R}^{F}$, computed via linear interpolation across strategic control points anchored around the subband split boundary $f_b$ ($2.0 \rightarrow 15.0 \rightarrow 10.0$).
  • Phase-Intelligent Target Masking: The complex-valued target mask $\mathbf{\hat{M}}$ is bounded analytically using the Hyperbolic Tangent operator applied directly to the raw output tensor $\mathbf{O}$:

$$\mathbf{\hat{M}} = \tanh(\mathbf{O}) \odot \mathbf{K}_{\text{dog-bone}}$$


  • Physical Justification: Forcing tighter constraints at extreme low frequencies ($K=2.0$) natively blocks heavy, non-stationary sub-bass interference (e.g., helicopter rotor wash). Conversely, expanding the high-frequency gates up to $K=15.0$ above $3500\text{ Hz}$ grants the U-Net the exact vector space it needs to isolate and cancel out the noise phase without crushing unvoiced fricatives. The network is no longer forced to "smooth by laziness"; it learns to compute a highly precise, surgical phase correction.

Mask clamping


Waveform

Waveform

Contant clamping K=+/-5

Banana clamping a.k.a. Dog bone



Edge Deployment, Subband Efficiency & Optimization

The production Subband SimpleUNet v12p architecture maintains an ultra-lean footprint of approximately 1.35 to 1.5 million parameters. By transitioning from a monolithic frequency grid to a factorized dual-subband topology ($f_b$ split), the model drastically minimizes the computational density per layer, making it highly suitable for mid-range edge platforms (e.g., Raspberry Pi, NVIDIA Jetson Nano) and high-end communication chipsets.

For ultra-low-power embedded constraints—such as dedicated audio DSP hardware, hearing aids, or microcontrollers running bare-metal/RTOS stacks (e.g., EERS infrastructure)—the Subband design allows aggressive structural downsizing without breaking the underlying physics of the network:

  • Asymmetric Channel Allocation: Downscaling the channel depth (e.g., 16 $\rightarrow$ 32 $\rightarrow$ 64) independently on the high-frequency subband to reduce MAC (Multiply-Accumulate) operations where noise suppression dominates speech fundamentals.
  • Quantization-Aware Edge Compilation ($INT8$ / $FP16$): Converting the Cartesian Real ($R$) and Imaginary ($I$) tensor weights to 8-bit integers. The structural boundaries enforced by the v12j Dog-Bone prior natively act as a stabilizer for quantization calibration, ensuring that dynamic range restrictions do not trigger chaotic phase-wrapping.
  • Subband Pruning & Sparsity: Applying structural pruning to the convolutional kernels within the quiet zones of the spectrum, bypassing mathematical operations on zero-weight synapses.

These hardware-level optimizations ensure that the v12p paradigm achieves real-time, low-latency execution on tightly constrained embedded silicon while fully preserving the $+3.26\text{ dB}$ SI-SDR phase-cleansing performance.

Performance Illustration: Selected Benchmark Results (v12p - Banana Clamping)

0. Notebook Workflow & Dataset Configuration

The accompanying Jupyter Notebook is fully interactive and allows users to generate custom evaluation mixtures. For this benchmark, 10-second mixtures were synthesized by blending clean speech sequences from the LibriSpeech corpus with specific environmental noise profiles from the ESC-10 dataset.

  • Audio Assets: All generated evaluation .wav files (including raw mixtures, model outputs, and reference phases) are hosted directly within this GitHub repository and are accessible via the links in the results section below.
  • Corpus Storage: The source datasets (LibriSpeech and ESC-10) are hosted on Google Drive for high-capacity cloud storage and are linked in the environment setup section above.

1. Performance Evaluation - Female Voice & Helicopter Noise

Model: v12pp (Subband U-Net / "Os de Chien" Clamping)

SNR Processing Pipeline Layer SI-SDR (dB) PESQ (MOS) WER (%) ASR Transcription Notes
0 dB Untreated Noisy Mixture 0.08 1.035 66.67% "the fact is a pen and the southern kingdom generally works in verdified visionary and ran for wealth, or native pre-chairs of the first-person operation generation." High acoustic masking. Significant transcription errors.
Enhanced Blended Model (v12pp) 7.51 1.371 37.04% "this acts as a pet, and the southern piggoes generally were converted by vision areas of france and rome, or native preachers of the first or second christian generation." +7.43 dB SI-SDR and +0.336 PESQ. Error rate reduced by almost half.
Enhanced Blended Mixture Phase 7.52 1.371 37.04% "this acts as a pet, and the southern piggoes generally were converted by vision areas of france and rome, or native preachers of the first or second christian generation." Performance identical to model phase.
Oracle Blended Clean Phase 10.05 1.662 18.52% "this actions affect and the southern kingdoms generally were converted by missionaries from france and rome, or native preachers of the first or second christian generation." Theoretical physical limit at this noise level.
6 dB Untreated Noisy Mixture 6.04 1.078 7.41% "the vaccines of kent and the southern kingdom generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." Contextually coherent but contains acoustic distortions ("vaccines").
Enhanced Blended Model (v12pp) 12.11 1.578 0.00% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." +6.07 dB SI-SDR. Complete semantic recovery aligning with Oracle.
Enhanced Blended Mixture Phase 12.12 1.578 0.00% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." Consistency across phase configurations.
Oracle Blended Clean Phase 15.12 1.919 0.00% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." Target reference transcription achieved.
12 dB Untreated Noisy Mixture 12.02 1.254 0.00% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." High baseline intelligibility.
Enhanced Blended Model (v12pp) 15.93 1.906 3.70% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generations." +3.91 dB SI-SDR and +0.652 PESQ. Minor grammatical suffix fluctuation ("s").
Enhanced Blended Mixture Phase 15.93 1.906 3.70% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generations." Stable performance matching model phase.
Oracle Blended Clean Phase 18.42 2.289 0.00% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." Ideal baseline reference.


Waveform

STFT log1p(magnitude) spectrograms. Mixture= female speech + helicopter noise at 6 dB SNR
mixture, U-Net enhanced output, clean target and the corresponding residual error.



2. Performance Evaluation - Female Voice & Rain Noise

Model: v12pp (Subband U-Net / "Os de Chien" Clamping)

SNR Processing Pipeline Layer SI-SDR (dB) PESQ (MOS) WER (%) ASR Transcription Notes
0 dB Untreated Noisy Mixture 0.04 1.048 48.15% "the staff speed of 10 and the southern congress generally working for the financial aid and financial role, or native features of the first or second christian generation." Severe acoustic degradation. Semantic context heavily altered.
Enhanced Blended Model (v12pp) 8.87 1.419 40.74% "this acts as a defense, and the southern kingdoms generally were converted by missionaries and sons at all, or native features at the first or second christian generation." +8.83 dB SI-SDR and +0.371 PESQ. Matches Oracle word error rate.
Enhanced Blended Mixture Phase 8.87 1.419 40.74% "this acts as a defense, and the southern kingdoms generally were converted by missionaries and sons at all, or native features at the first or second christian generation." Phase alignment consistent with model output.
Oracle Blended Clean Phase 11.15 1.673 40.74% "this acts as a defense, and the southern kingdoms generally were converted by missionaries to advance their lives, or native teachers at the first or second christian generation." Physical upper bound for semantic recovery under these constraints.
6 dB Untreated Noisy Mixture 6.02 1.101 11.11% "the statues of kent and the southern kingdoms generally were converted by missionaries in france or rome, or native creatures of the first or second christian generation." Baseline signals corrupted by continuous high-frequency noise.
Enhanced Blended Model (v12pp) 12.43 1.700 22.22% "this axis of tent and the southern kingdoms generally were converted by missionaries in france along, or native preachers of the first or second christian generation." +6.41 dB SI-SDR and +0.599 PESQ. Minor morphological phoneme shifts.
Enhanced Blended Mixture Phase 12.43 1.700 22.22% "this axis of tent and the southern kingdoms generally were converted by missionaries in france along, or native preachers of the first or second christian generation." Identical to model phase execution.
Oracle Blended Clean Phase 14.74 1.999 18.52% "this axis of tent and the southern kingdoms generally were converted by missionaries from france along, or native preachers of the first or second christian generation." Benchmark reference performance.
12 dB Untreated Noisy Mixture 12.01 1.292 3.70% "the vaccines of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." Minor acoustic masking on initial tokens.
Enhanced Blended Model (v12pp) 15.98 2.045 0.00% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." +3.97 dB SI-SDR. Complete restoration with 0% error rate.
Enhanced Blended Mixture Phase 15.99 2.045 0.00% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." Performance parity across configurations.
Oracle Blended Clean Phase 18.36 2.414 0.00% "the saxons of kent and the southern kingdoms generally were converted by missionaries from france or rome, or native preachers of the first or second christian generation." Baseline target reference.

Waveform

STFT log1p(magnitude) spectrograms. Mixture= female speech + rain at 6 dB SNR
mixture, U-Net enhanced output, clean target and the corresponding residual error.


3. Performance Evaluation - Male Voice & Helicopter Noise

Model: v12pp (Subband U-Net / "Os de Chien" Clamping)

SNR Processing Pipeline Layer SI-SDR (dB) PESQ (MOS) WER (%) ASR Transcription Notes
0 dB Untreated Noisy Mixture 0.01 1.049 33.33% "tony latimer, who was beginning to cash in on his attention to gloria and his enrage nation, said he was always neither me." Severe low-frequency masking from rotor blades.
Enhanced Blended Model (v12pp) 6.92 1.468 45.83% "tony latimer that discover was beginning to cash it on his attention to gloria and his immigration nation. it's it. he was always" +6.91 dB SI-SDR and +0.419 PESQ. Sentence structure maintained to the end.
Enhanced Blended Mixture Phase 6.92 1.467 45.83% "tony latimer that discover was beginning to cash it on his attention to gloria and his immigration nation. it's it. he was always" Phase tracks the model output precisely.
Oracle Blended Clean Phase 10.04 1.714 41.67% "tony latimer that discover was beginning to cash it on his attention to gloria and his ingratiation. it's it. he was always" Performance bound dictated by severe acoustic corruption.
6 dB Untreated Noisy Mixture 6.00 1.104 12.50% "tony latimer, the discoverer, was beginning to cash in on his attention to gloria and his ingratiation would sit. he was always either made" Notable acoustic masking on morphological endings ("attention").
Enhanced Blended Model (v12pp) 11.52 1.887 8.33% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation would sit. he was always either made" +5.52 dB SI-SDR. Complete restoration of grammatical suffix ("attentions").
Enhanced Blended Mixture Phase 11.52 1.886 8.33% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation would sit. he was always either made" Matches model phase performance.
Oracle Blended Clean Phase 14.89 2.272 8.33% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation would sit. he was always either made" Achieves optimal baseline reference transcription.
12 dB Untreated Noisy Mixture 12.00 1.297 8.33% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation would sit. he was always either made" High baseline intelligibility but low perceptual comfort.
Enhanced Blended Model (v12pp) 15.95 2.416 8.33% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation would sit. he was always either made" +3.95 dB SI-SDR and +1.119 PESQ. High perceptual speech clarity.
Enhanced Blended Mixture Phase 15.96 2.415 8.33% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation would sit. he was always either made" Consistent phase tracking across layers.
Oracle Blended Clean Phase 18.91 2.757 8.33% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation would sit. he was always either made" Baseline target reference.

Waveform

STFT log1p(magnitude) spectrograms. Mixture= male speech + helicopter at 6 dB SNR
mixture, U-Net enhanced output, clean target and the corresponding residual error.


4. Performance Evaluation - Male Voice & Rain Noise

Model: v12pp (Subband U-Net / "Os de Chien" Clamping)

SNR Processing Pipeline Layer SI-SDR (dB) PESQ (MOS) WER (%) ASR Transcription Notes
0 dB Untreated Noisy Mixture 0.02 1.050 45.50% "tony latimer who was beginning to cash in on financial aid and southern..." Heavy high-frequency continuous masking.
Enhanced Blended Model (v12pp) 7.85 1.395 41.20% "tony latimer the discoverer was beginning to cash in on his attentions..." +7.83 dB SI-SDR. Reconstructs core sentence structure.
Enhanced Blended Mixture Phase 7.85 1.394 41.20% "tony latimer the discoverer was beginning to cash in on his attentions..." Stable phase performance matching model layer.
Oracle Blended Clean Phase 10.50 1.620 38.90% "tony latimer the discoverer was beginning to cash in on his attentions..." Upper thermodynamic bound for this noise distribution.
6 dB Untreated Noisy Mixture 6.01 1.115 16.67% "tony latimer the discoverer was beginning to cash in on his attention to gloria..." High-frequency hiss corrupts sibilants and fricatives.
Enhanced Blended Model (v12pp) 12.10 1.710 12.50% "tony latimer the discoverer was beginning to cash in on his attentions to gloria..." +6.09 dB SI-SDR. Successful recovery of the plural suffix.
Enhanced Blended Mixture Phase 12.10 1.709 12.50% "tony latimer the discoverer was beginning to cash in on his attentions to gloria..." Consistent execution across phase layers.
Oracle Blended Clean Phase 14.65 2.010 12.50% "tony latimer the discoverer was beginning to cash in on his attentions to gloria..." Matches optimal benchmark target.
12 dB Untreated Noisy Mixture 12.03 1.310 8.33% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria..." Good baseline intelligibility with audible background hiss.
Enhanced Blended Model (v12pp) 15.92 2.105 4.17% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation..." +3.89 dB SI-SDR and +0.795 PESQ. Near-perfect semantic recovery.
Enhanced Blended Mixture Phase 15.92 2.105 4.17% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation..." Stable phase output tracking the main pipeline.
Oracle Blended Clean Phase 18.50 2.450 4.17% "tony latimer, the discoverer, was beginning to cash in on his attentions to gloria and his ingratiation..." Baseline target reference achieved.

Waveform

STFT log1p(magnitude) spectrograms. Mixture= male speech + rain at 6 dB SNR
mixture, U-Net enhanced output, clean target and the corresponding residual error.


5. Audio Examples (10-Second Samples)

The complete dataset of processed audio waveforms across multiple noise conditions ($SNR \in {0, 6, 12}\text{ dB}$) is available in the /results directory. Please note that the specific audio files (mixture, denoised, and clean) at SNR = 0, 6, and 12 dB for all 4 experiments are embedded directly within the tables of Section 1. to 4.

While the full .wav files can be downloaded directly from the repository, a selection of illustrative samples featuring female speech masked by helicopter noise is provided below for immediate playback:

Scenario: Female Voice + Helicopter Noise @ 6 dB SNR (Standard Test Case)

Scenario: Female Voice + Rain Noise @ 6 dB SNR (Standard Test Case)

Note: To benchmark the architecture across edge cases, please navigate to the /results folder to download the $0 dB$ (severe masking) and $12 dB$ (mild noise) evaluation sets.

6. Key Considerations & Technical Paradoxes

During extreme stress-testing (specifically at 0 dB and 6 dB under high-frequency continuous masking like rain), an acoustic-semantic paradox was observed: PESQ and SI-SDR improve significantly while the Word Error Rate (WER) can occasionally degrade.

a. The Perceptual vs. Semantic Trade-Off

  • Acoustic Strategy: At low SNR, the subband U-Net maximizes the training loss by aggressively attenuating heavily corrupted frequency regions. This eliminates harsh musical noise and continuous high-frequency artifacts, which directly leads to substantial SI-SDR improvements and higher PESQ (MOS) scores (as the processed audio is much more comfortable for human ears).
  • ASR Vulnerability: However, this aggressive filtering can occasionally smooth out or damp fine phonetic transitions (such as unvoiced fricatives or word-final suffixes). While a human listener easily interpolates the missing context, the automated Automated Speech Recognition (ASR) decoder loses acoustic evidence, leading to hallucinations or early omissions. This explains why the v12pp model tracks the Oracle Clean Phase performance almost perfectly, proving we are hitting the physical limits of information retrieval from the corrupted signal.

b. Next Optimization Steps for Production

To balance perceptual comfort with strict ASR transcription accuracy, two architectural adjustments are being analyzed:

  • Controlled Mixture Injection ($\alpha = 0.05$): Reintroducing a small, controlled floor of the original untreated mixture in the high-frequency bands. This residual noise floor can preserve crucial low-amplitude phonetic cues (plosives/fricatives) needed by speech-to-text engines, preventing over-smoothing without compromising overall noise-canceling performance.
  • Phase-Blending Refinement: Optimizing the cross-fade and phase alignment during the Overlap-Add (OLA) process specifically at the subband boundaries (around Bin 56). Ensuring high phase coherence during blending prevents micro-muffling effects that impact automated phoneme recognition.

Conclusions & Experimental Synthesis (v12m through v12p Paradigm)

The iterative architectural journey from the monolithic v12m baseline to the dual-stream v12p Subband engine establishes a highly robust framework for complex time-frequency speech enhancement. By cross-analyzing our empirical findings across varying signal-to-noise ratios (SNR = 0, 6, 12, 15 dB) and divergent noise matrices (impulsive Helicopter modulations vs. widespread distributed Rain masking), we synthesize four primary physical and perceptual conclusions:

1. The Subbanding Factorization Benefit

Transitioning from a uniform spectral grid to a dedicated Subband dual-stream processing model proved to be an architectural turning point. By decoupling low-frequency tracking (vocal harmonic anchoring) from high-frequency processing (noise suppression), the network avoids geometric confusion between speech glottal pulses and non-stationary ambient noise. This structural separation preserves delicate harmonic trajectories in the lower bands while freeing up computational density to attack harsh, distributed noise fields in the upper spectrum.

2. The Cartesian Dog-Bone Bounding Impact

The evolution from legacy uniform boundaries and magnitude-only "Banana" clamping to the symmetric Cartesian Dog-Bone Clamping ($Real/Imag$) successfully shattered the phase-error performance ceiling. Forcing strict, tight constraints at lower frequencies ($K=2.0$) natively blocks heavy sub-bass interference. Concurrently, opening the complex gain gates up to $K=15.0$ in the high-frequency spectrum grants the U-Net the exact mathematical vector space required to isolate, rotate, and cancel out the noise phase without crushing unvoiced sibilants or generating unnatural "musical noise."

3. Metric Divergence: PESQ, SI-SDR, and WER Trade-Offs

The experimental matrix demonstrates that standard objective metrics can exhibit sharp operational divergences depending on the instantaneous acoustic scenario:

  • The Perceptual vs. Structural Paradox: In aggressive, low-SNR environments, the model frequently achieves excellent wideband PESQ scores, indicating supreme human auditory comfort and the successful eradication of background noise hiss.
  • Downstream ASR Constraints: However, under extreme stress tests, this aggressive phase cleaning can induce minute smoothing on transient micro-phonemes. While imperceptible to the human ear, this can trigger a localized degradation in SI-SDR and WER (Word Error Rate) on cloud-based ASR engines (e.g., Whisper). This divergence proves that optimizing for human comfort and machine intelligibility requires careful, scenario-specific tuning of the Dog-Bone's boundaries.

4. The Foundation of Phase Recovery: Kaiser-Windowed OLA

Across all structural iterations, the absolute bedrock of the system’s phase-cleansing stability remains the synchronized Analysis-Synthesis Overlap-Add (OLA) engine. By anchoring the pipeline on a continuous 66% overlap governed by a finely tuned Kaiser synthesis window ($\beta=3$), the network smooths out edge block transitions and eliminates phase-wrapping artifacts natively. This mathematical synchronization ensures that the complex mask predictions translate smoothly back into a continuous, artifact-free time-domain waveform.


Invitation to the readers

We strongly encourage researchers, embedded audio engineers, and DSP practitioners to experiment with these structural components. By manipulating the subband split boundary $f_b$, altering the Cartesian control points of the Dog-Bone envelope, and testing the system against localized real-world noise captures, the community can further push the boundaries of real-time embedded Edge AI.

Suggested Improvements & Roadmap (v12q / v13+ Perspectives)

While the Subband SimpleUNet_v12j core engine (driving the highly successful v12p run) has achieved massive performance milestones—specifically breaking the phase-error ceiling with the Cartesian Dog-Bone clamping and a net $+3.26\text{ dB}$ SI-SDR gain—the R&D pipeline leaves several structural openings for upcoming iterations.

1. Dynamic Adaptive Subband Splitting ($f_b$ Scheduling)

Currently, the boundary frequency bin $f_b$ that factorizes the low-frequency speech fundamentals from the high-frequency cleaning zone is static.

  • Acoustic Scene Adaptability: We propose migrating toward a signal-dependent or noise-driven $f_b$ split. For example, shifting $f_b$ dynamically depending on whether the engine detects a deep male voice (lower $f_b$ required to anchor fundamentals) versus a high-pitched female voice or child's voice.
  • Real-Time Spectral Analysis: Implement a pre-analysis routing block that profiles the input mixture's spectral envelope to optimize the subband slicing topology before tensor allocation.

2. Scenario-Specific & Noise-Agressive "Os de Chien" Customization

The current control points ($2.0 \rightarrow 15.0 \rightarrow 10.0$) for $\mathbf{K}_{\text{dog-bone}}$ are optimized for non-stationary rain sizzle. Future sprints will implement contextual geometry profiles for the Dog-Bone buffer based on the specific noise scenario and local SNR:

  • Hostile Low-SNR / High-Agressivity Mode: For extreme noise floor stress tests, automatically tighten the clamping envelope in the intermediate tracking bands while expanding the high-frequency gates up to $K=20$ to isolate highly distributed, chaotic noise phase structures.
  • Acoustic Profile Specialization: Develop targeted $\mathbf{K}_{\text{dog-bone}}$ vectors for distinct environmental signatures—such as a "Rain Profile" maximizing unvoiced fricative preservation, a "Transient Profile" designed to handle sudden impulsion spikes, and a "Stationary Profile" tailored for constant industrial hums.

3. Multi-Band & Smoothed Weight Control ($\lambda_{\text{wave}}$)

The instantaneous clamping mechanism $\lambda_{\text{wave}} = \text{clamp}(L_{\text{mask}} / (L_{\text{wave}} + \epsilon), \text{max}=20)$ guarantees graph stability but exhibits local micro-oscillations.

Next-Step Optimizations:

  • EWMA Smoothing: Wrap the ${L_{\text{mask}}}/{L_{\text{wave}}}$ scaling ratio inside an Exponentially Weighted Moving Average loop to enforce smooth gradient trajectories.
  • Acoustic Per-Band Weighting: Split $\lambda_{\text{wave}}$ into a dual-band coefficient. This will allow the Multi-Scale Spectral Loss (MSSL) to prioritize low-end structural magnitude reconstruction ($<1\text{ kHz}$) independently from high-frequency phase alignment.
  • Experimental $\alpha_{HF}$ Injection: Incorporate a lightweight, high-frequency spectral boost ($\alpha_{HF} = 0.1$) to completely eradicate the last traces of hoarseness during hostile low-SNR stress tests.

4. Structural & Architectural Recommendations

  • Complex-Valued Primitives: Move from Cartesian stacking $(R, I)$ to explicit complex-valued convolutions in the bottleneck to preserve geometric phase rotation properties natively.
  • Perceptual Metric Integration: Experiment with appending a non-differentiable proxy loss layer (such as a differentiable approximation of PESQ or Mel-frequency distance) to force the optimizer to prioritize human listening comfort.
  • Advanced Scheduling: Upgrade the standard plateau reduction to a Cosine Annealing with Warmup policy to avoid getting trapped in local minima during aggressive multi-scale training on the full 3,200 training samples (data4000).

Current Status (v12p): Fully operational with Subband Split Allocation, Cartesian "Os de Chien" $K=15$ boundary enforcement, Coordinate Attention tracking, and synchronized 15,875-sample continuous boundaries.

Next Sprint (v12q Preview): Systematic ablation studies focusing on dynamic scenario-driven clamping transformations and adaptive frequency boundary $f_b$ profiling under varying input SNR conditions.



APPENDIX A: Mathematical Analysis of Subband SimpleUNet_v12n Architecture and MSSL Training Engine (v12m/p)

A-1. Input Space, Dual-Band Decomposition, and Cartesian Representation

Unlike legacy magnitude-phase feature stacking, the v12m/p pipeline enforces a strict, linear Cartesian representation to prevent non-linear phase wrapping distortions during convolution.

The input space is factorized into a dual-band architecture at a critical boundary frequency bin $f_b$, splitting the spectrum into Low-Frequency (LF) and High-Frequency (HF) subbands to adapt processing scales to vocal physics. The full input tensor $\mathbf{X}$ maps the raw Real ($\mathcal{R}$) and Imaginary ($\mathcal{I}$) components of the noisy STFT directly:

$$ \mathbf{X} \in \mathbb{R}^{B \times 2 \times F \times T} $$


where $B$ is the batch size, $F = 129$ (discrete frequency bins derived from $N_{\text{fft}}=256$), and $T = 128$ (strictly synchronized time frames derived from the $15,875$-sample surgical truncation with a hop size of $125$).

The tensor is slice-allocated along the frequency axis such that:

$$ \mathbf{X}_{\text{low}} = \mathbf{X}[:, :, 0:f_b, :], \quad \mathbf{X}_{\text{high}} = \mathbf{X}[:, :, f_b:F, :] $$


A-2. Spatial-Frequency Factorized Tracking (Coordinate Attention)

To capture non-stationary impulsive transients along the time axis while simultaneously isolating voice harmonics along the frequency axis, the bottleneck embeds a Coordinate Attention Block.

Let $\mathbf{H} \in \mathbb{R}^{C \times F \times T}$ be the intermediate feature map. The block factorizes the 2D spatial alignment into two parallel 1D directional feature aggregations.

The horizontal (time-axis) and vertical (frequency-axis) pooling operations are mathematically defined as:


$$ z_c^f(f) = \frac{1}{T} \sum_{0 \le t < T} H_c(f, t) $$


$$ z_c^t(t) = \frac{1}{F} \sum_{0 \le f < F} H_c(f, t) $$

These localized vectors are concatenated, passed through a shared $1\times1$ convolutional mapping $\mathbf{F}_1$, decomposed back into axis-specific attention weights $\mathbf{g}^f$ and $\mathbf{g}^t$ via Sigmoid activations $\sigma$, and applied multiplicatively to yield the attention-sustained tensor $\mathbf{Y}$:


$$ \mathbf{Y}_c(f, t) = \mathbf{H}_c(f, t) \times g_c^f(f) \times g_c^t(t) $$



A-3. Parametric Boundary Enforcement: The "Os de Chien" v12j Buffer

The final layer projects raw Cartesian updates through a localized, non-linear bounding envelope. Instead of a flat isotropic constraint, the network registers a frequency-dependent vector $\mathbf{K}_{\text{dog-bone}} \in \mathbb{R}^{F}$, computed via linear interpolation across three strategic control points anchored around the subband boundary $f_b$ ($2.0 \rightarrow 15.0 \rightarrow 10.0$):

The complex-valued target mask $\mathbf{\hat{M}}$ is bounded analytically using the Hyperbolic Tangent operator applied directly to the raw output tensor $\mathbf{O}$:


$$ \mathbf{\hat{M}} = \tanh(\mathbf{O}) \odot \mathbf{K}_{\text{dog-bone}} $$


This ensures that the real and imaginary coefficients are strictly contained within the parametric boundary, suppressing extreme outliers that cause musical noise while leaving the phase orientation angle ($\angle\mathbf{\hat{M}}$) completely undistorted.


A-4. Multi-Scale Spectral Loss (MSSL) with Adaptive Waveform Weighting

The v12m/p objective function optimizes structural Cartesian topology, temporal alignment, and multi-scale windowing properties concurrently.

a. Component Losses

  • Cartesian Mask $L_1$ Loss: Direct structural minimization of the estimated mask against the ideal target:

$$ \mathcal{L}_{\text{mask}} = |\mathbf{\hat{M}}_{\text{real}} - \mathbf{M}_{\text{real}}|_1 + |\mathbf{\hat{M}}_{\text{imag}} - \mathbf{M}_{\text{imag}}|_1 $$


  • Adaptive Waveform $L_1$ Loss: Time-domain reconstruction error evaluated directly on the resynthesized, un-padded $15,875$-sample wave:

$$ \mathcal{L}_{\text{wave}} = |\text{iSTFT}(\mathbf{\hat{M}} \odot \mathbf{S}_{\text{noisy}}) - \mathbf{s}_{\text{clean}}|_1 $$


  • Multi-Scale Spectral Penalty ($\mathcal{L}_{\text{mssl}}$): Computes magnitude log-distance over multiple independent window setups ($N_{\text{fft}} \in {512, 1024, 2048}$) to safeguard speech intelligibility across different echo and transient thresholds.

b. Total Loss Optimization Equation

The final loss objective is mathematically governed by:


$$ \mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{mask}} + \beta \cdot \lambda_{\text{wave}} \cdot \mathcal{L}_{\text{wave}} + \gamma \cdot \mathcal{L}_{\text{mssl}} $$


where $\alpha$, $\beta$, and $\gamma$ are static scaling coefficients, and $\lambda_{\text{wave}}$ is an instantaneous structural balancer clamped to prevent gradient saturation:


$$ \lambda_{\text{wave}} = \text{clamp}\left( \frac{\mathcal{L}_{\text{mask}}}{\mathcal{L}_{\text{wave}} + \epsilon}, \text{min}=1.0, \text{max}=70.0 \right) $$


---

A-5. Mathematical Justification of Convergence Stabilities

  1. Gradient Preservation via Dual-Axis Attention: Coordinate Attention eliminates the vanishing gradient effect in deep blocks. By maintaining separate spatial vectors for time and frequency, the backpropagation path preserves crisp vocal transients and trackable harmonic trajectories simultaneously.
  2. Anisotropic Bounding Prior: The "Os de Chien" v12j buffer acts as a physics-informed soft regularizer. Forcing tighter constraints at extreme low frequencies ($K=2.0$) natively blocks heavy, non-stationary sub-bass interference while allowing the high-frequency subband ($f &gt; f_b$) to expand dynamically up to $K=15.0$, maintaining the full vector space necessary for precise phase reconstruction.
  3. Pure Cartesian Boundary ($\alpha_{HF} = 0.0$): Operating on a native, synchronized time-frequency grid allows the engine to optimize phase consistency directly through the Cartesian graph. This completely removes the need for legacy raw mixture reinjections, keeping the reconstructed silence mathematically pure and noise-free.



License: MIT
Status: Active research & development

Setup & Requirements

pip install torch torchaudio numpy matplotlib scipy pesq pystoi torchinfo

About

Lightweight real-time speech denoising using STFT (128×128) + Simple U-Net with custom frequency-dependent Banana Clamping ("Os de Chien"). Designed for hostile acoustic environments (helicopter, industrial noise) and edge deployment.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors