VGGSound Sound Effects Filtering Pipeline

Filter VGGSound dataset to identify videos containing only sound effects (no music, no speech) and generate rich text descriptions.

Overview

This pipeline processes VGGSound videos through multiple stages:

Extraction: Videos from tar.gz → WAV audio (16kHz mono)
Label Pre-filtering: Fast rejection of obvious music/speech labels
Speech Detection: Silero VAD for voice activity detection
Music Detection: CLAP zero-shot classification
Audio Captioning: Microsoft CLAP clapcap (batch processing)
Multimodal Verification (optional): Qwen2.5-Omni for uncertain cases

┌─────────────────────────────────────────────────────────────────────────────┐
│  1. EXTRACTION        2. FILTERING           3. CAPTIONING    4. OUTPUT     │
│  ┌──────────────┐    ┌────────────────┐     ┌─────────────┐  ┌───────────┐ │
│  │ MP4 → WAV    │───▶│ Speech: Silero │────▶│ CLAP clapcap│─▶│  JSONL    │ │
│  │ (ffmpeg)     │    │ Music: CLAP    │     │ (captioning)│  │ + scores  │ │
│  └──────────────┘    └────────────────┘     └─────────────┘  └───────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

Installation

System Requirements

Python 3.10+
PyTorch 2.6+
ffmpeg (for audio extraction)
Supported platforms:
- Linux (x86_64, aarch64) - CPU or CUDA GPU
- macOS (Apple Silicon only) - CPU-based inference
- Windows (x86_64) - CPU or CUDA GPU

Note: Intel Macs (x86_64) are not supported as PyTorch dropped support for them.

Using uv (recommended)

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone <repo_url>
cd vggsound-pipeline

# Choose your backend:

# Apple Silicon Mac or Linux CPU-only
uv sync --extra cpu

# Linux with CUDA 12.4 (Colab, cloud GPUs)
uv sync --extra cuda

# Add dev tools (pytest, ruff)
uv sync --extra cpu --extra dev

Using pip

# CPU-only (macOS or Linux)
pip install -e ".[cpu]"

# With CUDA 12.4 (Colab, cloud GPUs)
pip install -e ".[cuda]"

Quick Start

# Basic run with 100 samples
vggsound run vggsound_00.tar.gz vggsound.csv --sample-limit 100

# Full run with multimodal verification
vggsound run vggsound_00.tar.gz vggsound.csv --enable-multimodal

# View statistics
vggsound stats output/sfx_filtered.jsonl

# Validate with random samples
vggsound validate output/sfx_filtered.jsonl --sample-size 20

CLI Reference

`vggsound run`

Main pipeline command.

vggsound run INPUT_TAR CSV_PATH [OPTIONS]

Options:
  --output, -o PATH           Output JSONL path [default: output/sfx_filtered.jsonl]
  --skip-label-filter         Skip label-based pre-filtering, use ML only
  --enable-multimodal         Enable Qwen2.5-Omni for uncertain samples
  --batch-size, -b INT        Batch size for GPU inference [default: 8]
  --num-workers, -w INT       Workers for audio extraction [default: 4]
  --sample-limit, -n INT      Limit samples (for testing)
  --speech-threshold FLOAT    Max speech ratio [default: 0.1]
  --music-threshold FLOAT     Max music score [default: 0.3]
  --device TEXT               Device: auto, cuda, cpu, mps [default: auto]
  --resume                    Resume from checkpoint

`vggsound stats`

Show statistics for pipeline output.

vggsound stats OUTPUT_JSONL

`vggsound validate`

Random sampling for manual validation.

vggsound validate OUTPUT_JSONL [OPTIONS]

Options:
  --sample-size, -n INT       Number of samples [default: 50]
  --audio-dir PATH            Directory with WAV files for playback

`vggsound extract-labels`

Extract and categorize VGGSound labels.

vggsound extract-labels CSV_PATH [OPTIONS]

Options:
  --output, -o PATH           Output JSON path

Output Format

{"video_id": "abc123_000030", "audio_text_description": "The sound of a dog barking repeatedly...", "speech_score": 0.02, "music_score": 0.05, "confidence": "high", "original_label": "dog barking"}

Fields:

video_id: Unique identifier (YouTube ID + start time)
audio_text_description: Rich text description from Qwen2-Audio
speech_score: Detected speech ratio (0-1)
music_score: Music probability score (0-1)
confidence: Classification confidence level
original_label: VGGSound category label

Models Used

Component	Model	Purpose
Speech Detection	Silero VAD	Fast CPU-based voice detection
Music Detection	LAION CLAP	Zero-shot audio classification
Captioning	Microsoft CLAP clapcap	Audio captioning with batch support
Multimodal	Qwen2.5-Omni-7B	Video+audio verification (optional)

Note: Microsoft CLAP clapcap was chosen for captioning - it's faster than LLM-based approaches and compatible with modern transformers. See IMPLEMENTATION_DETAILS.md.

Configuration

Environment variables (prefix with VGGSOUND_):

export VGGSOUND_SPEECH_THRESHOLD=0.1
export VGGSOUND_MUSIC_THRESHOLD=0.3
export VGGSOUND_BATCH_SIZE=8

Google Colab Setup

Colab provides T4 GPUs with CUDA, so use the cuda extra:

# Install uv (faster than pip)
!curl -LsSf https://astral.sh/uv/install.sh | sh
!source ~/.local/bin/env

# Clone and install with CUDA support
!git clone <repo_url>
%cd vggsound-pipeline
!uv sync --extra cuda

# Or use pip if you prefer
!pip install torch torchaudio --index-url https://download.pytorch.org/whl/cuda
!pip install -e ".[cuda]"

# Upload data and run
!vggsound run vggsound_00.tar.gz vggsound.csv --sample-limit 200

Development

# Install dev dependencies
uv sync --extra cpu --extra dev

# Run tests
uv run pytest

# Lint and format
uv run ruff check .
uv run ruff format .

Architecture Notes

Why These Models?

Silero VAD: Extremely fast (<1ms/chunk), works great on CPU, no GPU needed
LAION CLAP: Zero-shot classification, no training needed, specifically trained on music+speech
Microsoft CLAP clapcap: Built-in captioning, compatible with transformers>=4.34, batch processing
Qwen2.5-Omni: When audio-only is uncertain, video context helps disambiguate

Why CLAP clapcap over LLM-based captioning?

For production-scale data pipelines:

Compatibility: Works with modern transformers (>=4.34), no dependency conflicts
Efficiency: Batch processing support, reasonable memory usage
Quality: Good descriptions without requiring 7B+ parameter LLMs

For eval sets requiring highest quality, consider Qwen3-Omni-30B-A3B-Captioner via vLLM on A100s.

Confidence Levels

high: Strong signal that audio is SFX (speech < 0.05, music < 0.15)
medium: Meets thresholds but not definitive
low: Uncertain, may need multimodal verification
rejected: Clear speech or music detected
multimodal_verified: Verified by video+audio analysis

Memory Management

Models loaded sequentially (not all at once)
CLAP models are lightweight, run on any GPU including T4
CUDA cache cleared between stages
Checkpointing for long runs and resume support

Note: The Installation section lists outdated dependencies. Run uv sync or check pyproject.toml for current requirements.

Platform Notes

Platform	Notes
CUDA (Linux/Windows)	Best performance, MS CLAP uses ~2-3GB VRAM
MPS (Apple Silicon)	GPU-accelerated, good performance
CPU	Slower but functional, MS CLAP is lightweight

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
notebooks		notebooks
src/vggsound_pipeline		src/vggsound_pipeline
tests		tests
.gitignore		.gitignore
IMPLEMENTATION_DETAILS.md		IMPLEMENTATION_DETAILS.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VGGSound Sound Effects Filtering Pipeline

Overview

Installation

System Requirements

Using uv (recommended)

Using pip

Quick Start

CLI Reference

`vggsound run`

`vggsound stats`

`vggsound validate`

`vggsound extract-labels`

Output Format

Models Used

Configuration

Google Colab Setup

Development

Architecture Notes

Why These Models?

Why CLAP clapcap over LLM-based captioning?

Confidence Levels

Memory Management

Platform Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VGGSound Sound Effects Filtering Pipeline

Overview

Installation

System Requirements

Using uv (recommended)

Using pip

Quick Start

CLI Reference

vggsound run

vggsound stats

vggsound validate

vggsound extract-labels

Output Format

Models Used

Configuration

Google Colab Setup

Development

Architecture Notes

Why These Models?

Why CLAP clapcap over LLM-based captioning?

Confidence Levels

Memory Management

Platform Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`vggsound run`

`vggsound stats`

`vggsound validate`

`vggsound extract-labels`

Packages