BlEmoRe-common

This repository contains the codebase, baseline models, and evaluation tools for the Blended Emotion Recognition Challenge (BlEmoRe).
BlEmoRe introduces a novel publicly available dataset for multimodal (video + audio) recognition of blended emotions, featuring fine-grained salience annotations and standardized evaluation protocols.

The dataset is available on Zenodo.

What’s New in BlEmoRe?

A multimodal dataset of 3,050 video clips from 58 actors (video + audio)
Includes both single and blended emotions with controlled salience (e.g., 70/30 blend)
Evaluation metrics for emotion presence and salience prediction
Baselines across visual, audio, and fused modalities
Public train/test split and 5-fold validation partitions

Submission Guidelines

To participate in the challenge, download the test dataset from Zenodo and submit your predictions to CodaBench using the submission template.

Baseline Models

Baselines are being updated with the inclusion of additional pre-trained encoders.

We provide baselines trained on features from the following pre-trained encoders:

Video-based Encoders

Frame-level: OpenFace 2.0, CLIP, ImageBind
Spatiotemporal: VideoMAEv2, Video Swin Transformer

Audio-based Encoders

HuBERT (LL-60k), WavLM (Large)

Multimodal Fusion

Early fusion via concatenation (e.g., ImageBind + WavLM, VideoMAEv2 + HuBERT)

Models

Linear, MLP-256, MLP-512 architectures trained on:
- Aggregated features (mean, std, percentiles)
- Or short video clip subsamples (e.g., 16 frames for VideoMAEv2)

Feature Visualizations

Figure 2: PCA projection of VideoMae features (Happy vs Sad).

Figure 3: PCA projection of WavLM features (Happy vs Sad).

Results Overview

Validation (5-fold cross-validation, Aggregation, MLP-512)

Encoder(s)	Presence Accuracy	Salience Accuracy
ImageBind	0.290 ± 0.028	0.130 ± 0.008
WavLM	0.265 ± 0.027	0.121 ± 0.012
VideoMAEv2 (subsampled)	0.260 ± 0.030	0.124 ± 0.027
ImageBind + WavLM	0.345 ± 0.035	0.170 ± 0.055

Trivial baselines:
^{The single emotion baseline always predicts the most frequent single emotion;

the blend baseline always predicts the most frequent emotion pair with a fixed salience ratio.

They serve as reference points and yield the following accuracies on the full validation set:

Single emotion: Presence = 0.078, Salience = 0.000

Blend: Presence = 0.057, Salience = 0.035}

Test Set (Aggregation, MLP-512)

Encoder(s)	Presence Accuracy	Salience Accuracy
WavLM	0.311	0.084
VideoMAEv2	0.293	0.054
VideoMAEv2 + HuBERT	0.332	0.114
ImageBind + WavLM	0.327	0.114

Trivial baselines:
^{Same strategy as above for test set.

Single emotion: Presence = 0.074, Salience = 0.000

Blend: Presence = 0.059, Salience = 0.036}

Figure 1: Confusion matrix for VideoMAEv2+Hubert (Aggregation) model on the test set.

Running the baselines

To reproduce the baselines:

1. Download dataset

Download the dataset from Zenodo and extract it.

2. Extract Features

Extract features using the provided scripts in feature_extraction.

4. Aggregate features:

Aggregate features using:

python feature_extraction/video_encoding/timeseries2aggregate.py

3. Train and evaluate models

Train and evaluate models:

For aggregation-based features:

python main.py

For subsampled features:

python main_subsampling.py

Note: Make sure to update dataset and feature paths in the corresponding scripts before running.

Tools

Filename parser

A utility to parse filenames and extract metadata from the filenames is available in: utils/filename_parser.py.

Accuracy metrics

Generic functions to calculate the accuracy metrics are available in: utils/generic_accuracy/accuracy_funcs.py. These functions rely on predictions provided in the following dictionary format:

{   
    # The key is the filename of the video
    'A411_mix_ang_hap_30_70_ver1':
        # The value is a list of dictionaries, 
        # each containing the predicted emotion and its salience
        [
            {'emotion': 'hap', 'salience': 70.0},
            {'emotion': 'ang', 'salience': 30.0}
        ],
    'A102_ang_int1_ver1':
        [
            {'emotion': 'neu', 'salience': 100.0}
        ]
    ...
}

We employ two main evaluation metrics: ACC_presence and ACC_salience.

ACC_presence measures whether the correct label(s) are predicted without errors. A correct prediction must include all present emotions while avoiding false negatives (e.g., predicting only one emotion in a blend) and false positives (e.g., predicting emotions that are not part of the label).
ACC_salience extends ACC_presence by considering the relative prominence of each emotion. It evaluates whether the predicted proportions reflect the correct ranking — whether the emotions are equally present or one is more dominant than the other. This metric applies only to blended emotions.

Citation

arXiv citation below:

@misc{lachmann2026blendsequalblemoredataset,
      title={Not all Blends are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations}, 
      author={Tim Lachmann and Alexandra Israelsson and Christina Tornberg and Teimuraz Saghinadze and Michal Balazia and Philipp Müller and Petri Laukka},
      year={2026},
      eprint={2601.13225},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.13225}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlEmoRe-common

What’s New in BlEmoRe?

Submission Guidelines

Baseline Models

Video-based Encoders

Audio-based Encoders

Multimodal Fusion

Models

Feature Visualizations

Results Overview

Validation (5-fold cross-validation, Aggregation, MLP-512)

Test Set (Aggregation, MLP-512)

Running the baselines

1. Download dataset

2. Extract Features

4. Aggregate features:

3. Train and evaluate models

Tools

Filename parser

Accuracy metrics

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

BlEmoRe-common

What’s New in BlEmoRe?

Submission Guidelines

Baseline Models

Video-based Encoders

Audio-based Encoders

Multimodal Fusion

Models

Feature Visualizations

Results Overview

Validation (5-fold cross-validation, Aggregation, MLP-512)

Test Set (Aggregation, MLP-512)

Running the baselines

1. Download dataset

2. Extract Features

4. Aggregate features:

3. Train and evaluate models

Tools

Filename parser

Accuracy metrics

Citation