If you've ever tried to train (or evaluate) an audio-visual model on "web video", you know the data pipeline is usually 90% of the work. Raw web video is full of:
- title cards and black frames at scene boundaries
- music intros and dead silence at clip ends
- near-duplicate clips from re-uploads
- watermarks, hardcoded subtitles, and PIP overlays
- audio that's actually just the platform's stock background music
- "speech" that's actually a melody or a foreign language you didn't ask for
A youtube-dl | ffmpeg | shuf | go pipeline picks up all of this, and your
downstream training/eval ends up reflecting the noise rather than the signal.
av-curator is a small, opinionated audio-visual data curation pipeline
that runs a sequence of cheap, swappable filters over an input manifest
and writes a clean output manifest (plus optional re-encoded clips).
The pipeline is intentionally modular — each filter is a function from
(clip metadata, sources) -> (passes?, decision_record) — so you can:
- start with just scene-cut + silence-trim and iterate from there
- swap a heavy filter (e.g. CLIP-based dedup) for a fast one (perceptual hash) without rewiring anything
- log every filter decision per clip, for auditability
# 1. Build a manifest from a directory of raw clips
av-curate manifest data/raw/ --out manifest.jsonl
# 2. Run the full default pipeline
av-curate run manifest.jsonl --config configs/default.yaml \
--out manifest.clean.jsonl --report report.html
# 3. Slice the surviving clips with ffmpeg
av-curate slice manifest.clean.jsonl --out data/processed/The report.html writes a small per-stage funnel:
input: 10000 clips
├─ codec_filter ──▶ 9871 (dropped 129 unreadable)
├─ duration_filter (≥2s) ──▶ 9612 (dropped 259 too short)
├─ silence_filter ──▶ 8954 (trimmed 982 / dropped 658 silent)
├─ scene_cut (max 1 cut) ──▶ 8112 (split 612 / dropped 842)
├─ phash_dedup ──▶ 7444 (dropped 668 near-dupes)
├─ whisper_lang(en) ──▶ 6320 (kept en/zh)
└─ clip_text_align ──▶ 5907 (dropped 413)
git clone https://github.com/henliveira/av-curator
cd av-curator
pip install -e ".[full]"System dependencies: ffmpeg, ffprobe. For CLIP/Whisper filters you'll
want a CUDA-capable PyTorch install.
Every stage reads and writes a JSONL of Clip records:
{
"id": "abc123",
"path": "data/raw/abc123.mp4",
"duration": 12.4,
"video": {"fps": 25.0, "width": 1280, "height": 720, "codec": "h264"},
"audio": {"sr": 44100, "channels": 2, "codec": "aac"},
"trims": [[0.0, 12.4]],
"tags": ["scene_clean", "lang=en"],
"scores": {"phash_min_dist": 18.0, "clip_text_align": 0.27},
"decisions": [
{"stage": "silence_filter", "verdict": "kept", "note": "trimmed 0.6s tail"},
...
]
}A filter is a Python callable:
def my_filter(clip: Clip, ctx: Context) -> Decision:
...
return Decision.keep(note="ok")
return Decision.drop(reason="too dark")
return Decision.trim([(0.2, clip.duration - 0.1)])
return Decision.split([(0.0, 4.1), (4.6, 12.4)])Decision carries the verdict, an optional new trim list, and a free-form
note that's logged into the clip's decisions field.
| Filter | Module | Cost |
|---|---|---|
| codec_filter | avcurator.filters.codec |
cheap (ffprobe) |
| duration_filter | avcurator.filters.duration |
cheap |
| black_frame | avcurator.filters.black_frame |
cheap (ffmpeg blackdetect) |
| silence_filter | avcurator.filters.silence |
cheap (ffmpeg silencedetect) |
| scene_cut | avcurator.filters.scene_cut |
medium (PySceneDetect) |
| phash_dedup | avcurator.filters.phash |
medium |
| clip_phash_dedup | avcurator.filters.clip_dedup |
heavy (CLIP embedding) |
| whisper_lang | avcurator.filters.whisper |
heavy (Whisper inference) |
| clip_text_align | avcurator.filters.clip_align |
heavy (CLIP) |
| watermark_detect | avcurator.filters.watermark |
heavy (template match + CNN) |
Each filter is configured via a YAML stanza; see configs/default.yaml for
the canonical example.
Heavy filters cache their per-clip results to disk (.cache/<filter>/) keyed
by a hash of (filter version, clip id, args). Re-running the pipeline
after tweaking a cheap filter doesn't re-run the heavy ones.
filters:
- codec_filter
- duration_filter:
min: 2.0
max: 30.0
- silence_filter:
noise_db: -30
min_silence: 0.4
- whisper_lang:
keep: [en, zh]
min_speech_ratio: 0.6filters:
- codec_filter
- duration_filter: {min: 3.0}
- black_frame
- scene_cut: {max_cuts_per_clip: 1}
- phash_dedup: {hamming: 12}
- clip_text_align: {threshold: 0.22}- The cheap filters run in a process pool over ffprobe / ffmpeg subprocesses — single-machine throughput of ~5k clips / minute on 32 cores.
- The heavy filters are batched on GPU; rough numbers for one A100-40G:
- CLIP filter — ~1200 clips / min (8-frame sampling)
- Whisper-large — ~600 clips / min (chunked)
- Heavy filters can be sharded across nodes via
--shard idx/totalif you bring your own scheduler.
- A downloader.
av-curatestarts from clips on disk; bring your ownyt-dlpstep. - A training framework. This produces clean data, nothing else.
- A perfect filter. No filter is. Many filters are tuned conservatively (drop is cheap; re-running training is not).
BSD 3-Clause.
Built on ffmpeg, PySceneDetect,
CLIP, and OpenAI's Whisper. Inspired by
several years of staring at messy web-video datasets.