music-detector-demo.mp4
Real-time music detection running from the microphone — predictions update every second with confidence scores.
This is a repository to showcase the capabilities of DeepAudioX. A lightweight Pytorch-based library that me and my team in MagCIL have developed. DeepAudioX provides simple APIs for training / evaluating / and inference audio models using pretrained models as feature extractors!
In this project we train a binary classifier that detects Music vs Non-Music excerpts. We use a small sample of the AudioSet dataset that you can download from Kaggle (see the instructions below).
Let's get started 🚀
music-detector/
├── configs/
│ ├── training_config.yaml # Training hyperparameters and dataset paths
│ ├── online_inference_config.yaml # Real-time inference settings
│ ├── music_non_music_map.json # Train set: YouTube ID -> class label mapping
│ └── valid_music_non_music_map.json # Validation set: YouTube ID -> class label mapping
├── scripts/
│ ├── dataset.py # Dataset construction utilities
│ ├── train.py # Training and evaluation script
│ └── online-inference.py # Real-time microphone inference
├── pretrained_models/ # Trained model checkpoints (.pt files)
├── misc/ # Demo videos and media assets
├── data/ # AudioSet audio files (not tracked in git)
├── pyproject.toml
└── README.md
- Python 3.13+
- uv package manager
- PortAudio (required by
sounddevicefor microphone access)
git clone https://github.com/ChrisNick92/music-detector.git
cd music-detector
uv syncDownload the AudioSet train and validation sets from Kaggle:
- Train set: https://www.kaggle.com/datasets/zfturbo/audioset
- Validation set: https://www.kaggle.com/datasets/zfturbo/audioset-valid
Place the downloaded data so the directory structure looks like:
data/
├── audio_set_train/
│ └── train_wav/ # .wav files
└── audio_set_valid/
└── valid_wav/ # .wav files
The class label mappings (music_non_music_map.json and valid_music_non_music_map.json) are already included in the configs/ directory. By default, training_config.yaml expects them under data/ — either copy them there or update the train_mapping / valid_mapping paths in the config to point to configs/ instead.
Train a music detection model using a pretrained backbone:
uv run python scripts/train.pyOr specify a custom config:
uv run python scripts/train.py --config configs/training_config.yamlEdit configs/training_config.yaml to customize the training pipeline:
| Section | Key | Description |
|---|---|---|
dataset |
train_data_dir |
Path to the directory with training .wav files |
dataset |
train_mapping |
Path to JSON mapping YouTube IDs to "Music" / "Non-Music" |
dataset |
valid_data_dir |
Path to the directory with validation .wav files |
dataset |
valid_mapping |
Path to JSON mapping for validation set |
dataset |
sample_rate |
Audio sampling rate in Hz (default: 16000) |
dataset |
segment_duration |
Segment length in seconds; null for full files (default: 10.0) |
model |
backbone |
Pretrained backbone: beats, passt, mobilenet_05_as, mobilenet_10_as, mobilenet_40_as |
model |
pooling |
Pooling method: gap, simpool, ep, or null (default: ep) |
model |
freeze_backbone |
Freeze backbone weights during training (default: true) |
model |
classifier_hidden_layers |
List of hidden layer sizes for the MLP head, e.g. [256] or [] |
model |
activation |
Activation function: relu, gelu, tanh, leakyrelu |
model |
pretrained |
Load pretrained backbone weights (default: true) |
training |
learning_rate |
Initial learning rate (default: 0.001) |
training |
epochs |
Maximum training epochs (default: 100) |
training |
patience |
Early stopping patience (default: 15) |
training |
batch_size |
Batch size (default: 16) |
training |
num_workers |
DataLoader workers (default: 4) |
training |
checkpoint_path |
Where to save the best model checkpoint |
training |
device_index |
GPU index to use; omit or null for CPU |
After training completes, the script automatically evaluates the best checkpoint on the validation set and prints a classification report.
Run music detection from your microphone in real time:
uv run python scripts/online-inference.pyOr with a custom config:
uv run python scripts/online-inference.py --config configs/online_inference_config.yamlThe script captures audio segments from your microphone and prints predictions continuously:
==========================================================
Music Detector — Real-Time Inference
Segment: 1s | Sample rate: 16000 Hz
==========================================================
Press Ctrl+C to stop.
2025-02-28 14:34:56 ♫ Music [||||||||||||||||....] 82.3%
2025-02-28 14:34:57 ∅ Non-Music [||||||||||||||......] 71.5%
2025-02-28 14:34:58 ♫ Music [||||||||||||||||||||] 97.1%
Press Ctrl+C to stop.
Edit configs/online_inference_config.yaml:
| Section | Key | Description |
|---|---|---|
model |
backbone |
Must match the backbone used during training |
model |
pooling |
Must match the pooling used during training |
model |
classifier_hidden_layers |
Must match the architecture used during training |
model |
checkpoint_path |
Path to the trained .pt checkpoint |
inference |
sample_rate |
Must match the sample rate used during training |
inference |
segment_duration |
Duration of each audio segment to classify (in seconds) |
inference |
device_index |
GPU index to use; omit or null for CPU |
Provides build_music_detection_dataset() — constructs a DeepAudioX AudioClassificationDataset from a directory of .wav files and a JSON class mapping. Used by both training and can be imported for custom workflows.
End-to-end training pipeline. Loads config, builds datasets, constructs an AudioClassifier with a pretrained backbone, trains with early stopping, and evaluates the best checkpoint. All configuration is driven by YAML — no code changes needed to experiment.
Real-time inference from microphone input. Loads a trained checkpoint, captures audio in fixed-duration segments, and prints timestamped predictions with confidence scores.
A pretrained checkpoint (pretrained_models/mobilenet_05.pt) is included in the repo so you can test the music detector right away — no training required!
Just run:
uv run python scripts/online-inference.pyThen play some music from your phone, speakers, or any source near your microphone and watch the predictions roll in. 🎶