This repository contains the codebase, baseline models, and evaluation tools for the Blended Emotion Recognition Challenge (BlEmoRe).
BlEmoRe introduces a novel publicly available dataset for multimodal (video + audio) recognition of blended emotions, featuring fine-grained salience annotations and standardized evaluation protocols.
The dataset is available on Zenodo.
- A multimodal dataset of 3,050 video clips from 58 actors (video + audio)
- Includes both single and blended emotions with controlled salience (e.g., 70/30 blend)
- Evaluation metrics for emotion presence and salience prediction
- Baselines across visual, audio, and fused modalities
- Public train/test split and 5-fold validation partitions
To participate in the challenge, download the test dataset from Zenodo and submit your predictions to CodaBench using the submission template.
Baselines are being updated with the inclusion of additional pre-trained encoders.
We provide baselines trained on features from the following pre-trained encoders:
- Frame-level: OpenFace 2.0, CLIP, ImageBind
- Spatiotemporal: VideoMAEv2, Video Swin Transformer
- HuBERT (LL-60k), WavLM (Large)
- Early fusion via concatenation (e.g., ImageBind + WavLM, VideoMAEv2 + HuBERT)
- Linear, MLP-256, MLP-512 architectures trained on:
- Aggregated features (mean, std, percentiles)
- Or short video clip subsamples (e.g., 16 frames for VideoMAEv2)
Figure 2: PCA projection of VideoMae features (Happy vs Sad). |
Figure 3: PCA projection of WavLM features (Happy vs Sad). |
| Encoder(s) | Presence Accuracy | Salience Accuracy |
|---|---|---|
| ImageBind | 0.290 ± 0.028 | 0.130 ± 0.008 |
| WavLM | 0.265 ± 0.027 | 0.121 ± 0.012 |
| VideoMAEv2 (subsampled) | 0.260 ± 0.030 | 0.124 ± 0.027 |
| ImageBind + WavLM | 0.345 ± 0.035 | 0.170 ± 0.055 |
Trivial baselines:
The single emotion baseline always predicts the most frequent single emotion;
the blend baseline always predicts the most frequent emotion pair with a fixed salience ratio.
They serve as reference points and yield the following accuracies on the full validation set:
Single emotion: Presence = 0.078, Salience = 0.000
Blend: Presence = 0.057, Salience = 0.035
| Encoder(s) | Presence Accuracy | Salience Accuracy |
|---|---|---|
| WavLM | 0.311 | 0.084 |
| VideoMAEv2 | 0.293 | 0.054 |
| VideoMAEv2 + HuBERT | 0.332 | 0.114 |
| ImageBind + WavLM | 0.327 | 0.114 |
Trivial baselines:
Same strategy as above for test set.
Single emotion: Presence = 0.074, Salience = 0.000
Blend: Presence = 0.059, Salience = 0.036
To reproduce the baselines:
Download the dataset from Zenodo and extract it.
Extract features using the provided scripts in feature_extraction.
Aggregate features using:
python feature_extraction/video_encoding/timeseries2aggregate.pyTrain and evaluate models:
For aggregation-based features:
python main.pyFor subsampled features:
python main_subsampling.pyNote: Make sure to update dataset and feature paths in the corresponding scripts before running.
A utility to parse filenames and extract metadata from the filenames is available in: utils/filename_parser.py.
Generic functions to calculate the accuracy metrics are available in: utils/generic_accuracy/accuracy_funcs.py.
These functions rely on predictions provided in the following dictionary format:
{
# The key is the filename of the video
'A411_mix_ang_hap_30_70_ver1':
# The value is a list of dictionaries,
# each containing the predicted emotion and its salience
[
{'emotion': 'hap', 'salience': 70.0},
{'emotion': 'ang', 'salience': 30.0}
],
'A102_ang_int1_ver1':
[
{'emotion': 'neu', 'salience': 100.0}
]
...
}We employ two main evaluation metrics: ACC_presence and ACC_salience.
-
ACC_presencemeasures whether the correct label(s) are predicted without errors. A correct prediction must include all present emotions while avoiding false negatives (e.g., predicting only one emotion in a blend) and false positives (e.g., predicting emotions that are not part of the label). -
ACC_salienceextendsACC_presenceby considering the relative prominence of each emotion. It evaluates whether the predicted proportions reflect the correct ranking — whether the emotions are equally present or one is more dominant than the other. This metric applies only to blended emotions.
arXiv citation below:
@misc{lachmann2026blendsequalblemoredataset,
title={Not all Blends are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations},
author={Tim Lachmann and Alexandra Israelsson and Christina Tornberg and Teimuraz Saghinadze and Michal Balazia and Philipp Müller and Petri Laukka},
year={2026},
eprint={2601.13225},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.13225},
}

