Rethinking Metrics and Benchmarks of Video Anomaly Detection

State Key Laboratory of Media Convergence and Communication, Communication University of China

This repository contains the codes and datasets for the ArXiv paper: https://arxiv.org/abs/2505.19022.

Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation methods through comprehensive analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms. To address these limitations, we propose three novel evaluation methods: first, we establish probabilistic AUC/AP (Prob-AUC/AP) metrics utlizing multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development.

Project Structure

RethinkingVAD/
├── laap.py                   # Implementation of Latency-aware Average Precision (LaAP)
├── prob_auc_ap.py            # Implementation of Probabilistic AUC/AP metrics
├── data/                     # Annotations
│   ├── msad/                 # MSAD dataset annotations
│   ├── ucf/                  # UCF dataset annotations
│   ├── xd/                   # XD-Violence dataset annotations
│   └── prediction_results/   # Model prediction results for evaluation
└── experiments/              # Experimental notebooks and scripts

Requirements

If you want to use the LaAP and ProbAUC/AP metrics, you only need to install the following dependencies:

numpy
sklearn

The following dependencies are required to run the experimental notebooks:

matplotlib
seaborn
pandas
tabulate
pingouin
statsmodels

Quick Start

1. Probabilistic AUC/AP (Prob-AUC/AP)

Prob-AUC/AP are metrics that extend traditional AUC/AP to handle probabilistic (soft) labels, mitigating single annotation bias by utilizing multi-round annotations.

Usage

from prob_auc_ap import prob_average_precision_score, prob_roc_auc_score

# y_true: probabilistic labels (values in [0, 1])
# y_score: model predictions
prob_auc = prob_roc_auc_score(y_true, y_score)
prob_ap = prob_average_precision_score(y_true, y_score)

2. Latency-aware Average Precision (LaAP)

LaAP is a metric that rewards early and accurate anomaly detection by considering both detection accuracy and detection latency.

Usage

from laap import get_la_score

la_score, lar_values, precision_values = get_la_score(gts=gts, preds=preds)

Parameters

gts: List of ground truth binary arrays (1 for anomaly, 0 for normal)
preds: List of prediction arrays (anomaly scores)
interval: Interval for scoring points (default: 16), corresponding to $\phi$ in the paper.
sigmoid_k: Parameter k for the sigmoid function (default: 7), corresponding to $\beta$ in the paper.
weight_base: Base for the exponential weight decay (default: 2), corresponding to $1/\alpha$ in the paper.

3. Complete Evaluation Example

For a complete evaluation example using all three metrics, refer to experiments/LaAP Prob-AUC Prob-AP Evaluation.py.

# Example of evaluating a model with all three metrics
import numpy as np
from laap import get_la_score
from prob_auc_ap import prob_average_precision_score, prob_roc_auc_score
from sklearn.metrics import average_precision_score, roc_auc_score

# Prepare data
gts = [np.array([0, 1, 1, 0, 0])]  # Ground truth
preds = [np.array([0.1, 0.9, 0.8, 0.2, 0.1])]  # Predictions

# Original AUC/AP (binary labels)
orig_auc = roc_auc_score(gts[0], preds[0])
orig_ap = average_precision_score(gts[0], preds[0])

# Probabilistic AUC/AP (soft labels)
soft_gts = [np.array([0.0, 0.8, 0.9, 0.1, 0.0])]  # Soft labels from multiple annotators
prob_auc = prob_roc_auc_score(soft_gts[0], preds[0])
prob_ap = prob_average_precision_score(soft_gts[0], preds[0])

# LaAP
la_score, _, _ = get_la_score(gts, preds)

print(f"Original AUC: {orig_auc:.4f}, Original AP: {orig_ap:.4f}")
print(f"Probabilistic AUC: {prob_auc:.4f}, Probabilistic AP: {prob_ap:.4f}")
print(f"LaAP Score: {la_score:.4f}")

Hard Normal Benchmarks (UCF-HN, MSAD-HN)

We release two hard normal benchmarks, UCF-HN and MSAD-HN, which are specifically designed to evaluate scene overfitting of fully/weakly-supervised VAD models.

Usage: Download the videos, make predictions using your trained models, and expect the predictions to be zero for all frames.

The videos are hosted in ModelScope and HuggingFace(comming soon)

License

Our Re-annotation of UCF-Crime, MSAD and XD-Violence datasets are released under the CC BY-NC 4.0 license.

Our hard normal benchmarks, UCF-HN and MSAD-HN, are released under the CC BY-NC 4.0 license.

The codes in this repo are licensed under the MIT License.

Citation

@article{liu2025rethinking,
  title={Rethinking Metrics and Benchmarks of Video Anomaly Detection},
  author={Liu, Zihao and Wu, Xiaoyu and Li, Wenna and Yang, Linlin},
  journal={arXiv preprint arXiv:2505.19022},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
experiments		experiments
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
laap.py		laap.py
prob_auc_ap.py		prob_auc_ap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rethinking Metrics and Benchmarks of Video Anomaly Detection

Project Structure

Requirements

Quick Start

1. Probabilistic AUC/AP (Prob-AUC/AP)

Usage

2. Latency-aware Average Precision (LaAP)

Usage

Parameters

3. Complete Evaluation Example

Hard Normal Benchmarks (UCF-HN, MSAD-HN)

License

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rethinking Metrics and Benchmarks of Video Anomaly Detection

Project Structure

Requirements

Quick Start

1. Probabilistic AUC/AP (Prob-AUC/AP)

Usage

2. Latency-aware Average Precision (LaAP)

Usage

Parameters

3. Complete Evaluation Example

Hard Normal Benchmarks (UCF-HN, MSAD-HN)

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages