Demix

Comparing U-Net and Vision Transformers for audio source separation.

About

U-Nets have been the go-to architecture for audio source separation (ASS) for years. They're good at it because the task is similar to their original purpose: image segmentation. In both cases, you're generating masks to isolate specific elements—either organs in medical images or instruments in spectrograms.

This project questions whether that dominance is justified. Vision Transformers (ViT) have taken over computer vision by using self-attention to capture global context. The hypothesis here is simple: if music has long-range patterns (like a drum loop that repeats throughout a song), Transformers should handle them better than CNNs with their limited receptive fields.

The Goal

Train both architectures on the same data and compare their performance using Signal-to-Distortion Ratio (SDR).

Input: Time-frequency representation of a song (spectrogram)
Output: 4 separation masks, one for each source (vocals, drums, bass, other)
Dataset: MUSDB18

Both models will generate masks that segment the spectrogram into isolated sources. The masks are then used to reconstruct the individual audio signals.

Why it matters

If ViT outperforms U-Net, it suggests that global attention is more important than local convolutions for understanding musical structure. That opens doors to better separation models and potentially transfer learning from pretrained vision models.

If U-Net wins, we confirm that convolutional inductive biases still have value and aren't going away anytime soon.

Structure

demix/
├── data/
│   ├── raw/              # MUSDB18 dataset
│   ├── processed/        # preprocessed spectrograms
│   └── temp/             # temp files for API
│
├── src/
│   ├── models/           # U-Net and ViT implementations
│   ├── data/             # data loading and preprocessing
│   └── api/              # FastAPI server for inference
│
├── notebooks/            # experimentation
├── docker/               # containerization
└── results/              # metrics and plots

Tech

TensorFlow 2.10
librosa for audio
FastAPI for serving
Docker for deployment
LocalStack for AWS services (S3 & SQS) simulation and practice
Terraform for LocalStack management

Building

Download MUSDB18 dataset
- Get it from https://sigsep.github.io/datasets/musdb.html
- Extract to data/raw/musdb18/
Build and run
```
docker compose up --build
```
- FastAPI: http://localhost:8000
- LocalStack: http://localhost:4566

Building with curiosity and a GTX 1050 Ti

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
docker		docker
img		img
models		models
notebooks		notebooks
results		results
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
anteproyecto.pdf		anteproyecto.pdf
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Demix

About

The Goal

Why it matters

Structure

Tech

Building

About

Uh oh!

Releases

Packages

Languages

gabichulas/demix

Folders and files

Latest commit

History

Repository files navigation

Demix

About

The Goal

Why it matters

Structure

Tech

Building

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages