Multi-Agent Coordination for Latency-Constrained VLMs: Anomaly Detection in Large-Scale Surveillance
This repository is the official evaluation and reproducibility bundle for research on multi-agent coordination, cascade early-exit, and latency-aware vision–language models (VLMs) applied to weakly supervised video anomaly detection in large-scale surveillance.
The system is built around progressive perception: inexpensive models run first, and early exit avoids invoking heavier models on “easy” frames. The Dual-Stage Perception Module (Stages I–II) combines a reconstruction gate (convolutional autoencoder) with YOLOv8-nano object screening. Frames that remain ambiguous after these stages are eligible for Stage III semantic reasoning with a vision–language model (LLaVA-Next–class), using embedding + prototype matching and a cosine-similarity threshold (\tau_c) (with abstention when confidence is low). Event-driven and cyclical agents coordinate ingestion from multiple cameras over a Pub/Sub broker (Redis) in the full deployment narrative; this repository emphasizes offline evaluation and reproducible metric CSVs.
Because the VLM is orders of magnitude slower than the AE and YOLO stages, latency-constrained coordination is central: most traffic should exit early at Stage I or II. Representative profiling illustrates per-stage latency (note the log scale for milliseconds) and the fraction of frames routed to each stage—only a minority should reach the VLM under a tuned policy.
The Combined Anomaly Detection Dashboard is an optional operator-facing tool (typically shipped alongside this bundle in a larger monorepo) for browsing large frame corpora. It surfaces Person (YOLO) and Anomaly (Autoencoder) events with reconstruction error, duration in frames, and source paths—suitable for audits and CSV export. It does not replace the paper’s csv/runs/ experiment outputs; see GUIDELINES.md.
CVCIMP26 is a research-grade evaluation bundle for Multi-Agent Coordination for Latency-Constrained VLMs: Anomaly Detection in Large-Scale Surveillance. It packages runnable Python tooling to reproduce and extend the quantitative experiments associated with the paper: frame-level detection metrics, cascade ablations, latency accounting, cross-dataset transfer characterization, and multi-stream scaling stress tests.
Large-scale surveillance produces high-volume, continuous video. Running a heavy vision–language model (VLM) on every frame is prohibitively slow and expensive. At the same time, weakly supervised anomaly detection on benchmarks such as UCF-Crime requires calibrated, comparable reporting (e.g. AUROC, AP, F1) under realistic operational constraints. This project addresses the gap by combining progressive inference (cheap gates before expensive reasoning), multi-agent coordination (who runs what, when, under a budget), and documented evaluation scripts that emit versioned, provenance-rich CSV outputs.
- Provide a transparent pipeline from benchmark manifests and local frame data to paper-aligned metric CSVs under
csv/runs/. - Encode the three-stage cascade (reconstruction → object cues → selective VLM semantics) in a way that separates fully runnable components (e.g. autoencoder scoring, YOLO paths where installed) from paper-described Stage III behavior (see honesty notes in
MODELS_AND_STACK.md). - Support reproducibility: pinned dependency guidance, experiment launchers (
exp_E1–exp_E6), and timestamps on every generated result file. - Document datasets, licenses, and non-redistribution expectations clearly so users can comply with benchmark terms.
| Theme | How it appears in this repository |
|---|---|
| Cascaded early exit | Ablation and latency scripts quantify stage contributions and timing; Stage I/II paths are exercised in evaluation and dashboard-adjacent workflows described in GUIDELINES.md. |
| Latency-constrained VLM use | Latency benchmarks and documentation reflect selective Stage III invocation rather than dense per-frame VLM scoring. |
| Multi-agent coordination | Scaling and orchestration narratives are supported by multi_agent_benchmark.py / E5 and the paper’s agent model; full broker infrastructure may be external. |
| Semantic stabilization | Described in the manuscript (CLIP-style embeddings, prototypes, abstention); optional Hugging Face stack in requirements.txt. |
| Rigorous metrics | code/metrics.py centralizes AUROC/AP/F1 and related reporting for frame-level studies. |
- E1 — Frame-level results on UCF-oriented frame subsets (see script and data layout).
- E2 — Cascade ablation (which stages matter).
- E3 — Latency profiling across stages (with documented caveats for simulated or partial Stage III).
- E4 — Cross-dataset behavior (ShanghaiTech, XD-Violence loaders and templates).
- E5 — Multi-stream / scaling benchmark.
- E6 — Autoencoder reconstruction diagnostics.
Exact mapping to LaTeX labels and sections: experiments/README.md. Supporting detail: MODELS_AND_STACK.md, GUIDELINES.md, BENCHMARKS.md.
- Researchers reproducing or comparing against the paper’s experimental protocol.
- Graduate students learning cascaded VAD + weak-supervision evaluation.
- Engineers prototyping surveillance analytics under compute budgets, extending manifests and loaders without rewriting metric code.
- Hosting full benchmark videos or proprietary weights in git.
- Guaranteeing identical numbers across all hardware without your own pinned environment and data snapshot.
- Replacing official dataset licenses or redistribution policies — users must obtain data legally.
Use the following in the repository About field on GitHub if you like:
- Description: Research code: multi-agent coordination and cascaded gates for latency-aware VLM anomaly detection on UCF-Crime, ShanghaiTech, and XD-Violence — evaluation bundle with provenance CSV outputs.
- Topics:
anomaly-detectionvideo-surveillanceweak-supervisionvision-language-modelpytorchyolov8ucf-crimeshanghai-techxd-violencemulti-agentlatency-optimizationcomputer-vision
Surveillance pipelines must balance accuracy, latency, and compute when scenes produce a continuous stream of frames. This project studies a three-stage cascaded detector:
- Stage I — Reconstruction gate: A lightweight convolutional autoencoder flags frames whose reconstruction error exceeds a threshold, suppressing obvious normality and reducing downstream load.
- Stage II — Object-level screening: A compact detector (YOLOv8-family) provides object-level cues for candidate anomalies.
- Stage III — Vision–language reasoning: A VLM (paper: LLaVA-Next–class backbone) performs selective semantic reasoning; CLIP-style text embeddings and prototype matching stabilize category decisions and support abstention when confidence is low.
Multi-agent coordination (event-driven and cyclical agents over a publish–subscribe style design, with deployment narrative including Redis and containers) governs when each stage runs and how partial results are fused under latency constraints.
The code here focuses on offline evaluation: frame-level AUROC / AP / F1, cascade ablations, per-stage latency, cross-dataset behavior, and multi-stream scaling, aligned with the paper’s experimental section.
This tree is the CVCIMP26 bundle: manifests, experiment launchers, mirrored evaluation scripts, and documented outputs.
| Path | Role |
|---|---|
code/ |
Evaluation scripts (generate_results.py, ablation_study.py, …) |
code/results_io.py |
Writes provenance-aware rows to csv/runs/ |
code/metrics.py |
AUROC, AP, F1, ROC helpers |
experiments/ |
One-command runners exp_E1 … exp_E6 (paper-aligned) |
csv/manifests/ |
Benchmark index templates (UCF-Crime, ShanghaiTech, XD-Violence) |
csv/runs/ |
Generated metrics (timestamps, hostname); often gitignored locally |
csv/reference/ |
Literature-only baseline table (not produced by our runs) |
requirements.txt |
Pinned-style dependency list with optional VLM block |
BENCHMARKS.md, MODELS_AND_STACK.md, GUIDELINES.md |
Datasets, model stack, operator FAQ |
To respect licenses, size, and reproducibility norms:
- Full UCF-Crime, ShanghaiTech Campus, or XD-Violence video archives (download from official sources; see
BENCHMARKS.md). - Pretrained checkpoints by default (
*.pth/ large*.pt); obtain or train your own and place them where scripts expect (typically next toTest/for the bundled AE demos). - A full live broker stack for every script; the paper describes deployment; this repo emphasizes evaluation and optional dashboard tooling in a larger monorepo if present.
python -m venv .venv
.venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
Use a CUDA build of PyTorch from pytorch.org when available.
Uncomment the Hugging Face–related lines at the bottom of requirements.txt when you extend Stage III inference.
- Download datasets per
BENCHMARKS.md. - Fill or generate manifests under
csv/manifests/(seeexperiments/load_manifest_example.py). - For the default AE frame demo in
generate_results.py, place extracted frames underTest/(repository working directory for experiments — see below) andautoencoder_model.pthwhere the script can load it.
Experiment launchers set the correct working directory and invoke code/. From the directory that contains this bundle:
Standalone clone (this repository only):
python experiments/exp_E1_frame_level_ucf.py
Inside a monorepo where this folder is named CVCIMP26/ under a parent project root (e.g. full CVC26 tree):
python CVCIMP26/experiments/exp_E1_frame_level_ucf.py
| ID | Launcher | Output CSV (csv/runs/) |
|---|---|---|
| E1 | experiments/exp_E1_frame_level_ucf.py |
frame_results_ucf.csv |
| E2 | experiments/exp_E2_ablation_cascade.py |
ablation_ucf.csv |
| E3 | experiments/exp_E3_latency_profile.py |
latency_stages.csv |
| E4 | experiments/exp_E4_cross_dataset.py |
cross_dataset.csv |
| E5 | experiments/exp_E5_multiagent_scaling.py |
scaling_streams.csv |
| E6 | experiments/exp_E6_ae_training_metrics.py |
ae_reconstruction_metrics.csv |
LaTeX label crosswalk: experiments/README.md.
Each run writes CSV rows with generated_utc (UTC ISO) and hostname, so tables can be traced to a specific machine and time. For paper tables, copy values from csv/runs/ into your manuscript; see csv/paper_tables/README.md.
Dashboard exports (if you use a sibling anomaly_dashboard.py in a larger project) are not the same artifact as csv/runs/ — see the FAQ in GUIDELINES.md.
| Document | Contents |
|---|---|
GUIDELINES.md |
Architecture → data → CSV; dashboard vs experiments |
MODELS_AND_STACK.md |
Stages I–III; what is fully run vs simulated in places |
BENCHMARKS.md |
Dataset statistics and official links |
experiments/README.md |
E1–E6 ↔ paper / LaTeX labels |
When the camera-ready paper is available, cite it as your primary reference. For benchmark attribution, cite the original datasets (e.g. UCF-Crime: Sultani et al., CVPR 2018). Example placeholder:
@misc{multiagent_vlm_surveillance_2026,
title = {Multi-Agent Coordination for Latency-Constrained VLMs: Anomaly Detection in Large-Scale Surveillance},
howpublished = {GitHub repository},
year = {2026},
note = {Replace with venue proceedings entry when published}
}This bundle includes an LICENSE file (MIT); change it if your institution requires a different license for the public remote.
Author: Tayyab Rehman — please use GitHub Issues on the public repository for reproducibility questions.
Thanks to the creators of UCF-Crime, ShanghaiTech Campus, and XD-Violence, and to the PyTorch, Ultralytics, and Hugging Face communities.


