Classify short audio clips (e.g., dog bark, bird chirp, siren, rain) with a ResNet-style CNN trained on Mel Spectrograms. The project includes a full training pipeline (PyTorch), FastAPI inference service, serverless GPU inference with Modal, and an interactive Next.js + React dashboard for uploads, real-time predictions, and feature‑map visualization.
- 🧠 Deep Audio CNN for sound classification
- 🧱 ResNet-style architecture with residual blocks
- 🎼 Mel Spectrogram audio-to-image conversion
- 🎛️ Data augmentation: Mixup + SpecAugment (Time/Freq masking)
- ⚡ Serverless GPU inference with Modal
- 📊 Interactive Next.js & React dashboard (Tailwind + shadcn/ui)
- 📈 Real-time classification with confidence scores
- 🌊 Waveform & Spectrogram visualization
- 🚀 FastAPI inference endpoint (+ Pydantic validation)
- 📈 TensorBoard integration for training analysis
- ✅ Pydantic validation for robust API requests
- Why Mel Spectrograms? They convert audio to a perceptual time–frequency image that CNNs handle well.
- Why ResNet? Residual connections ease optimization of deeper models and boost accuracy.
- Why Mixup/SpecAugment? Strong regularization for robustness against noise and domain shift.
cd server
conda create -n audio-cnn python=3.11 -y
conda activate audio-cnn
pip install -r requirements.txtcd client
npm install
npm run devCreate .env in your client root
NEXT_PUBLIC_MODAL_API_ENDPOINT="Your_API_Key"
- Torchaudio backend errors: ensure
ffmpeg/libsndfileinstalled. - Noisy predictions: raise clip length, tweak Mixup
alpha, reduce masks. - Overfitting: stronger Mixup/SpecAug, Dropout in classifier, early stopping.
- Underfitting: deeper ResNet, higher
base_channels, longer training, lower weight decay.
Feel free to contact me on Linkedin