Filter VGGSound dataset to identify videos containing only sound effects (no music, no speech) and generate rich text descriptions.
This pipeline processes VGGSound videos through multiple stages:
- Extraction: Videos from tar.gz → WAV audio (16kHz mono)
- Label Pre-filtering: Fast rejection of obvious music/speech labels
- Speech Detection: Silero VAD for voice activity detection
- Music Detection: CLAP zero-shot classification
- Audio Captioning: Microsoft CLAP clapcap (batch processing)
- Multimodal Verification (optional): Qwen2.5-Omni for uncertain cases
┌─────────────────────────────────────────────────────────────────────────────┐
│ 1. EXTRACTION 2. FILTERING 3. CAPTIONING 4. OUTPUT │
│ ┌──────────────┐ ┌────────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ MP4 → WAV │───▶│ Speech: Silero │────▶│ CLAP clapcap│─▶│ JSONL │ │
│ │ (ffmpeg) │ │ Music: CLAP │ │ (captioning)│ │ + scores │ │
│ └──────────────┘ └────────────────┘ └─────────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
- Python 3.10+
- PyTorch 2.6+
- ffmpeg (for audio extraction)
- Supported platforms:
- Linux (x86_64, aarch64) - CPU or CUDA GPU
- macOS (Apple Silicon only) - CPU-based inference
- Windows (x86_64) - CPU or CUDA GPU
Note: Intel Macs (x86_64) are not supported as PyTorch dropped support for them.
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install
git clone <repo_url>
cd vggsound-pipeline
# Choose your backend:
# Apple Silicon Mac or Linux CPU-only
uv sync --extra cpu
# Linux with CUDA 12.4 (Colab, cloud GPUs)
uv sync --extra cuda
# Add dev tools (pytest, ruff)
uv sync --extra cpu --extra dev# CPU-only (macOS or Linux)
pip install -e ".[cpu]"
# With CUDA 12.4 (Colab, cloud GPUs)
pip install -e ".[cuda]"# Basic run with 100 samples
vggsound run vggsound_00.tar.gz vggsound.csv --sample-limit 100
# Full run with multimodal verification
vggsound run vggsound_00.tar.gz vggsound.csv --enable-multimodal
# View statistics
vggsound stats output/sfx_filtered.jsonl
# Validate with random samples
vggsound validate output/sfx_filtered.jsonl --sample-size 20Main pipeline command.
vggsound run INPUT_TAR CSV_PATH [OPTIONS]
Options:
--output, -o PATH Output JSONL path [default: output/sfx_filtered.jsonl]
--skip-label-filter Skip label-based pre-filtering, use ML only
--enable-multimodal Enable Qwen2.5-Omni for uncertain samples
--batch-size, -b INT Batch size for GPU inference [default: 8]
--num-workers, -w INT Workers for audio extraction [default: 4]
--sample-limit, -n INT Limit samples (for testing)
--speech-threshold FLOAT Max speech ratio [default: 0.1]
--music-threshold FLOAT Max music score [default: 0.3]
--device TEXT Device: auto, cuda, cpu, mps [default: auto]
--resume Resume from checkpointShow statistics for pipeline output.
vggsound stats OUTPUT_JSONLRandom sampling for manual validation.
vggsound validate OUTPUT_JSONL [OPTIONS]
Options:
--sample-size, -n INT Number of samples [default: 50]
--audio-dir PATH Directory with WAV files for playbackExtract and categorize VGGSound labels.
vggsound extract-labels CSV_PATH [OPTIONS]
Options:
--output, -o PATH Output JSON path{"video_id": "abc123_000030", "audio_text_description": "The sound of a dog barking repeatedly...", "speech_score": 0.02, "music_score": 0.05, "confidence": "high", "original_label": "dog barking"}Fields:
video_id: Unique identifier (YouTube ID + start time)audio_text_description: Rich text description from Qwen2-Audiospeech_score: Detected speech ratio (0-1)music_score: Music probability score (0-1)confidence: Classification confidence leveloriginal_label: VGGSound category label
| Component | Model | Purpose |
|---|---|---|
| Speech Detection | Silero VAD | Fast CPU-based voice detection |
| Music Detection | LAION CLAP | Zero-shot audio classification |
| Captioning | Microsoft CLAP clapcap | Audio captioning with batch support |
| Multimodal | Qwen2.5-Omni-7B | Video+audio verification (optional) |
Note: Microsoft CLAP clapcap was chosen for captioning - it's faster than LLM-based approaches and compatible with modern transformers. See IMPLEMENTATION_DETAILS.md.
Environment variables (prefix with VGGSOUND_):
export VGGSOUND_SPEECH_THRESHOLD=0.1
export VGGSOUND_MUSIC_THRESHOLD=0.3
export VGGSOUND_BATCH_SIZE=8Colab provides T4 GPUs with CUDA, so use the cuda extra:
# Install uv (faster than pip)
!curl -LsSf https://astral.sh/uv/install.sh | sh
!source ~/.local/bin/env
# Clone and install with CUDA support
!git clone <repo_url>
%cd vggsound-pipeline
!uv sync --extra cuda
# Or use pip if you prefer
!pip install torch torchaudio --index-url https://download.pytorch.org/whl/cuda
!pip install -e ".[cuda]"
# Upload data and run
!vggsound run vggsound_00.tar.gz vggsound.csv --sample-limit 200# Install dev dependencies
uv sync --extra cpu --extra dev
# Run tests
uv run pytest
# Lint and format
uv run ruff check .
uv run ruff format .- Silero VAD: Extremely fast (<1ms/chunk), works great on CPU, no GPU needed
- LAION CLAP: Zero-shot classification, no training needed, specifically trained on music+speech
- Microsoft CLAP clapcap: Built-in captioning, compatible with transformers>=4.34, batch processing
- Qwen2.5-Omni: When audio-only is uncertain, video context helps disambiguate
For production-scale data pipelines:
- Compatibility: Works with modern transformers (>=4.34), no dependency conflicts
- Efficiency: Batch processing support, reasonable memory usage
- Quality: Good descriptions without requiring 7B+ parameter LLMs
For eval sets requiring highest quality, consider Qwen3-Omni-30B-A3B-Captioner via vLLM on A100s.
- high: Strong signal that audio is SFX (speech < 0.05, music < 0.15)
- medium: Meets thresholds but not definitive
- low: Uncertain, may need multimodal verification
- rejected: Clear speech or music detected
- multimodal_verified: Verified by video+audio analysis
- Models loaded sequentially (not all at once)
- CLAP models are lightweight, run on any GPU including T4
- CUDA cache cleared between stages
- Checkpointing for long runs and resume support
Note: The Installation section lists outdated dependencies. Run
uv syncor checkpyproject.tomlfor current requirements.
| Platform | Notes |
|---|---|
| CUDA (Linux/Windows) | Best performance, MS CLAP uses ~2-3GB VRAM |
| MPS (Apple Silicon) | GPU-accelerated, good performance |
| CPU | Slower but functional, MS CLAP is lightweight |
MIT