Skip to content

henliveira/av-curator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AV-Curator

The Problem

If you've ever tried to train (or evaluate) an audio-visual model on "web video", you know the data pipeline is usually 90% of the work. Raw web video is full of:

  • title cards and black frames at scene boundaries
  • music intros and dead silence at clip ends
  • near-duplicate clips from re-uploads
  • watermarks, hardcoded subtitles, and PIP overlays
  • audio that's actually just the platform's stock background music
  • "speech" that's actually a melody or a foreign language you didn't ask for

A youtube-dl | ffmpeg | shuf | go pipeline picks up all of this, and your downstream training/eval ends up reflecting the noise rather than the signal.

The Approach

av-curator is a small, opinionated audio-visual data curation pipeline that runs a sequence of cheap, swappable filters over an input manifest and writes a clean output manifest (plus optional re-encoded clips).

The pipeline is intentionally modular — each filter is a function from (clip metadata, sources) -> (passes?, decision_record) — so you can:

  • start with just scene-cut + silence-trim and iterate from there
  • swap a heavy filter (e.g. CLIP-based dedup) for a fast one (perceptual hash) without rewiring anything
  • log every filter decision per clip, for auditability

Show Me

# 1. Build a manifest from a directory of raw clips
av-curate manifest data/raw/ --out manifest.jsonl

# 2. Run the full default pipeline
av-curate run manifest.jsonl --config configs/default.yaml \
    --out manifest.clean.jsonl --report report.html

# 3. Slice the surviving clips with ffmpeg
av-curate slice manifest.clean.jsonl --out data/processed/

The report.html writes a small per-stage funnel:

input:     10000 clips
├─ codec_filter             ──▶  9871  (dropped 129 unreadable)
├─ duration_filter (≥2s)    ──▶  9612  (dropped 259 too short)
├─ silence_filter           ──▶  8954  (trimmed 982 / dropped 658 silent)
├─ scene_cut (max 1 cut)    ──▶  8112  (split 612 / dropped 842)
├─ phash_dedup              ──▶  7444  (dropped 668 near-dupes)
├─ whisper_lang(en)         ──▶  6320  (kept en/zh)
└─ clip_text_align          ──▶  5907  (dropped 413)

Getting Started

git clone https://github.com/henliveira/av-curator
cd av-curator
pip install -e ".[full]"

System dependencies: ffmpeg, ffprobe. For CLIP/Whisper filters you'll want a CUDA-capable PyTorch install.

How it works

Manifest format

Every stage reads and writes a JSONL of Clip records:

{
  "id": "abc123",
  "path": "data/raw/abc123.mp4",
  "duration": 12.4,
  "video": {"fps": 25.0, "width": 1280, "height": 720, "codec": "h264"},
  "audio": {"sr": 44100, "channels": 2, "codec": "aac"},
  "trims": [[0.0, 12.4]],
  "tags": ["scene_clean", "lang=en"],
  "scores": {"phash_min_dist": 18.0, "clip_text_align": 0.27},
  "decisions": [
    {"stage": "silence_filter", "verdict": "kept", "note": "trimmed 0.6s tail"},
    ...
  ]
}

Filter contract

A filter is a Python callable:

def my_filter(clip: Clip, ctx: Context) -> Decision:
    ...
    return Decision.keep(note="ok")
    return Decision.drop(reason="too dark")
    return Decision.trim([(0.2, clip.duration - 0.1)])
    return Decision.split([(0.0, 4.1), (4.6, 12.4)])

Decision carries the verdict, an optional new trim list, and a free-form note that's logged into the clip's decisions field.

Built-in filters

Filter Module Cost
codec_filter avcurator.filters.codec cheap (ffprobe)
duration_filter avcurator.filters.duration cheap
black_frame avcurator.filters.black_frame cheap (ffmpeg blackdetect)
silence_filter avcurator.filters.silence cheap (ffmpeg silencedetect)
scene_cut avcurator.filters.scene_cut medium (PySceneDetect)
phash_dedup avcurator.filters.phash medium
clip_phash_dedup avcurator.filters.clip_dedup heavy (CLIP embedding)
whisper_lang avcurator.filters.whisper heavy (Whisper inference)
clip_text_align avcurator.filters.clip_align heavy (CLIP)
watermark_detect avcurator.filters.watermark heavy (template match + CNN)

Each filter is configured via a YAML stanza; see configs/default.yaml for the canonical example.

Caching

Heavy filters cache their per-clip results to disk (.cache/<filter>/) keyed by a hash of (filter version, clip id, args). Re-running the pipeline after tweaking a cheap filter doesn't re-run the heavy ones.

Examples

"I just want speech-heavy clips for ASR pre-training"

filters:
  - codec_filter
  - duration_filter:
      min: 2.0
      max: 30.0
  - silence_filter:
      noise_db: -30
      min_silence: 0.4
  - whisper_lang:
      keep: [en, zh]
      min_speech_ratio: 0.6

"I want clean visual clips for video-LM pre-training"

filters:
  - codec_filter
  - duration_filter: {min: 3.0}
  - black_frame
  - scene_cut: {max_cuts_per_clip: 1}
  - phash_dedup: {hamming: 12}
  - clip_text_align: {threshold: 0.22}

Performance notes

  • The cheap filters run in a process pool over ffprobe / ffmpeg subprocesses — single-machine throughput of ~5k clips / minute on 32 cores.
  • The heavy filters are batched on GPU; rough numbers for one A100-40G:
    • CLIP filter — ~1200 clips / min (8-frame sampling)
    • Whisper-large — ~600 clips / min (chunked)
  • Heavy filters can be sharded across nodes via --shard idx/total if you bring your own scheduler.

What this isn't

  • A downloader. av-curate starts from clips on disk; bring your own yt-dlp step.
  • A training framework. This produces clean data, nothing else.
  • A perfect filter. No filter is. Many filters are tuned conservatively (drop is cheap; re-running training is not).

License

BSD 3-Clause.

Acknowledgments

Built on ffmpeg, PySceneDetect, CLIP, and OpenAI's Whisper. Inspired by several years of staring at messy web-video datasets.

About

Audio-visual data curation pipeline — scene cuts, silence trim, dedup, CLIP/Whisper filtering for messy web video.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages