State Key Laboratory of Media Convergence and Communication, Communication University of China
This repository contains the codes and datasets for the ArXiv paper: https://arxiv.org/abs/2505.19022.
Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation methods through comprehensive analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms. To address these limitations, we propose three novel evaluation methods: first, we establish probabilistic AUC/AP (Prob-AUC/AP) metrics utlizing multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development.
RethinkingVAD/
├── laap.py # Implementation of Latency-aware Average Precision (LaAP)
├── prob_auc_ap.py # Implementation of Probabilistic AUC/AP metrics
├── data/ # Annotations
│ ├── msad/ # MSAD dataset annotations
│ ├── ucf/ # UCF dataset annotations
│ ├── xd/ # XD-Violence dataset annotations
│ └── prediction_results/ # Model prediction results for evaluation
└── experiments/ # Experimental notebooks and scripts
If you want to use the LaAP and ProbAUC/AP metrics, you only need to install the following dependencies:
- numpy
- sklearn
The following dependencies are required to run the experimental notebooks:
- matplotlib
- seaborn
- pandas
- tabulate
- pingouin
- statsmodels
Prob-AUC/AP are metrics that extend traditional AUC/AP to handle probabilistic (soft) labels, mitigating single annotation bias by utilizing multi-round annotations.
from prob_auc_ap import prob_average_precision_score, prob_roc_auc_score
# y_true: probabilistic labels (values in [0, 1])
# y_score: model predictions
prob_auc = prob_roc_auc_score(y_true, y_score)
prob_ap = prob_average_precision_score(y_true, y_score)LaAP is a metric that rewards early and accurate anomaly detection by considering both detection accuracy and detection latency.
from laap import get_la_score
la_score, lar_values, precision_values = get_la_score(gts=gts, preds=preds)-
gts: List of ground truth binary arrays (1 for anomaly, 0 for normal) -
preds: List of prediction arrays (anomaly scores) -
interval: Interval for scoring points (default: 16), corresponding to$\phi$ in the paper. -
sigmoid_k: Parameter k for the sigmoid function (default: 7), corresponding to$\beta$ in the paper. -
weight_base: Base for the exponential weight decay (default: 2), corresponding to$1/\alpha$ in the paper.
For a complete evaluation example using all three metrics, refer to experiments/LaAP Prob-AUC Prob-AP Evaluation.py.
# Example of evaluating a model with all three metrics
import numpy as np
from laap import get_la_score
from prob_auc_ap import prob_average_precision_score, prob_roc_auc_score
from sklearn.metrics import average_precision_score, roc_auc_score
# Prepare data
gts = [np.array([0, 1, 1, 0, 0])] # Ground truth
preds = [np.array([0.1, 0.9, 0.8, 0.2, 0.1])] # Predictions
# Original AUC/AP (binary labels)
orig_auc = roc_auc_score(gts[0], preds[0])
orig_ap = average_precision_score(gts[0], preds[0])
# Probabilistic AUC/AP (soft labels)
soft_gts = [np.array([0.0, 0.8, 0.9, 0.1, 0.0])] # Soft labels from multiple annotators
prob_auc = prob_roc_auc_score(soft_gts[0], preds[0])
prob_ap = prob_average_precision_score(soft_gts[0], preds[0])
# LaAP
la_score, _, _ = get_la_score(gts, preds)
print(f"Original AUC: {orig_auc:.4f}, Original AP: {orig_ap:.4f}")
print(f"Probabilistic AUC: {prob_auc:.4f}, Probabilistic AP: {prob_ap:.4f}")
print(f"LaAP Score: {la_score:.4f}")We release two hard normal benchmarks, UCF-HN and MSAD-HN, which are specifically designed to evaluate scene overfitting of fully/weakly-supervised VAD models.
Usage: Download the videos, make predictions using your trained models, and expect the predictions to be zero for all frames.
The videos are hosted in ModelScope and HuggingFace(comming soon)
Our Re-annotation of UCF-Crime, MSAD and XD-Violence datasets are released under the CC BY-NC 4.0 license.
Our hard normal benchmarks, UCF-HN and MSAD-HN, are released under the CC BY-NC 4.0 license.
The codes in this repo are licensed under the MIT License.
@article{liu2025rethinking,
title={Rethinking Metrics and Benchmarks of Video Anomaly Detection},
author={Liu, Zihao and Wu, Xiaoyu and Li, Wenna and Yang, Linlin},
journal={arXiv preprint arXiv:2505.19022},
year={2025}
}
