AV-Curator

The Problem

If you've ever tried to train (or evaluate) an audio-visual model on "web video", you know the data pipeline is usually 90% of the work. Raw web video is full of:

title cards and black frames at scene boundaries
music intros and dead silence at clip ends
near-duplicate clips from re-uploads
watermarks, hardcoded subtitles, and PIP overlays
audio that's actually just the platform's stock background music
"speech" that's actually a melody or a foreign language you didn't ask for

A youtube-dl | ffmpeg | shuf | go pipeline picks up all of this, and your downstream training/eval ends up reflecting the noise rather than the signal.

The Approach

av-curator is a small, opinionated audio-visual data curation pipeline that runs a sequence of cheap, swappable filters over an input manifest and writes a clean output manifest (plus optional re-encoded clips).

The pipeline is intentionally modular — each filter is a function from (clip metadata, sources) -> (passes?, decision_record) — so you can:

start with just scene-cut + silence-trim and iterate from there
swap a heavy filter (e.g. CLIP-based dedup) for a fast one (perceptual hash) without rewiring anything
log every filter decision per clip, for auditability

Show Me

# 1. Build a manifest from a directory of raw clips
av-curate manifest data/raw/ --out manifest.jsonl

# 2. Run the full default pipeline
av-curate run manifest.jsonl --config configs/default.yaml \
    --out manifest.clean.jsonl --report report.html

# 3. Slice the surviving clips with ffmpeg
av-curate slice manifest.clean.jsonl --out data/processed/

The report.html writes a small per-stage funnel:

input:     10000 clips
├─ codec_filter             ──▶  9871  (dropped 129 unreadable)
├─ duration_filter (≥2s)    ──▶  9612  (dropped 259 too short)
├─ silence_filter           ──▶  8954  (trimmed 982 / dropped 658 silent)
├─ scene_cut (max 1 cut)    ──▶  8112  (split 612 / dropped 842)
├─ phash_dedup              ──▶  7444  (dropped 668 near-dupes)
├─ whisper_lang(en)         ──▶  6320  (kept en/zh)
└─ clip_text_align          ──▶  5907  (dropped 413)

Getting Started

git clone https://github.com/henliveira/av-curator
cd av-curator
pip install -e ".[full]"

System dependencies: ffmpeg, ffprobe. For CLIP/Whisper filters you'll want a CUDA-capable PyTorch install.

How it works

Manifest format

Every stage reads and writes a JSONL of Clip records:

{
  "id": "abc123",
  "path": "data/raw/abc123.mp4",
  "duration": 12.4,
  "video": {"fps": 25.0, "width": 1280, "height": 720, "codec": "h264"},
  "audio": {"sr": 44100, "channels": 2, "codec": "aac"},
  "trims": [[0.0, 12.4]],
  "tags": ["scene_clean", "lang=en"],
  "scores": {"phash_min_dist": 18.0, "clip_text_align": 0.27},
  "decisions": [
    {"stage": "silence_filter", "verdict": "kept", "note": "trimmed 0.6s tail"},
    ...
  ]
}

Filter contract

A filter is a Python callable:

def my_filter(clip: Clip, ctx: Context) -> Decision:
    ...
    return Decision.keep(note="ok")
    return Decision.drop(reason="too dark")
    return Decision.trim([(0.2, clip.duration - 0.1)])
    return Decision.split([(0.0, 4.1), (4.6, 12.4)])

Decision carries the verdict, an optional new trim list, and a free-form note that's logged into the clip's decisions field.

Built-in filters

Filter	Module	Cost
codec_filter	`avcurator.filters.codec`	cheap (ffprobe)
duration_filter	`avcurator.filters.duration`	cheap
black_frame	`avcurator.filters.black_frame`	cheap (ffmpeg blackdetect)
silence_filter	`avcurator.filters.silence`	cheap (ffmpeg silencedetect)
scene_cut	`avcurator.filters.scene_cut`	medium (PySceneDetect)
phash_dedup	`avcurator.filters.phash`	medium
clip_phash_dedup	`avcurator.filters.clip_dedup`	heavy (CLIP embedding)
whisper_lang	`avcurator.filters.whisper`	heavy (Whisper inference)
clip_text_align	`avcurator.filters.clip_align`	heavy (CLIP)
watermark_detect	`avcurator.filters.watermark`	heavy (template match + CNN)

Each filter is configured via a YAML stanza; see configs/default.yaml for the canonical example.

Caching

Heavy filters cache their per-clip results to disk (.cache/<filter>/) keyed by a hash of (filter version, clip id, args). Re-running the pipeline after tweaking a cheap filter doesn't re-run the heavy ones.

Examples

"I just want speech-heavy clips for ASR pre-training"

filters:
  - codec_filter
  - duration_filter:
      min: 2.0
      max: 30.0
  - silence_filter:
      noise_db: -30
      min_silence: 0.4
  - whisper_lang:
      keep: [en, zh]
      min_speech_ratio: 0.6

"I want clean visual clips for video-LM pre-training"

filters:
  - codec_filter
  - duration_filter: {min: 3.0}
  - black_frame
  - scene_cut: {max_cuts_per_clip: 1}
  - phash_dedup: {hamming: 12}
  - clip_text_align: {threshold: 0.22}

Performance notes

The cheap filters run in a process pool over ffprobe / ffmpeg subprocesses — single-machine throughput of ~5k clips / minute on 32 cores.
The heavy filters are batched on GPU; rough numbers for one A100-40G:
- CLIP filter — ~1200 clips / min (8-frame sampling)
- Whisper-large — ~600 clips / min (chunked)
Heavy filters can be sharded across nodes via --shard idx/total if you bring your own scheduler.

What this isn't

A downloader. av-curate starts from clips on disk; bring your own yt-dlp step.
A training framework. This produces clean data, nothing else.
A perfect filter. No filter is. Many filters are tuned conservatively (drop is cheap; re-running training is not).

License

BSD 3-Clause.

Acknowledgments

Built on ffmpeg, PySceneDetect, CLIP, and OpenAI's Whisper. Inspired by several years of staring at messy web-video datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
avcurator		avcurator
configs		configs
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AV-Curator

The Problem

The Approach

Show Me

Getting Started

How it works

Manifest format

Filter contract

Built-in filters

Caching

Examples

"I just want speech-heavy clips for ASR pre-training"

"I want clean visual clips for video-LM pre-training"

Performance notes

What this isn't

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AV-Curator

The Problem

The Approach

Show Me

Getting Started

How it works

Manifest format

Filter contract

Built-in filters

Caching

Examples

"I just want speech-heavy clips for ASR pre-training"

"I want clean visual clips for video-LM pre-training"

Performance notes

What this isn't

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages