![]()
![]()
![]()
Peihao Xiang, Chaohao Lin, Kaida Wu, and Ou Bai
HCPS Laboratory, Department of Electrical and Computer Engineering, Florida International University
Official TensorFlow implementation and pre-trained VideoMAE models for MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition.
Note: The .ipynb is just a simple example. In addition, the VideoMAE encoder model should be pre-trained using the MAE-DFER method, but this repository does not provide it.
This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAE-DER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMA-D. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.

Fig. 1 Illustration of our MultiMAE-DER.
General Multimodal Model vs. MultiMAE-DER. The uniqueness of our approach lies in the capability to extract features from cross-domain data using only a single encoder, eliminating the need for targeted feature extraction for different modalities.

Fig. 2 Multimodal Sequence Fusion Strategies.
Fig. 3 The architecture of MultiMAE-DER.
If you have any questions, please feel free to reach me out at pxian001@fiu.edu.
This project is built upon VideoMAE and MAE-DFER. Thanks for their great codebase.
This project is under the Apache License 2.0. See LICENSE for details.
If you find this repository helpful, please consider citing our work:
@misc{xiang2024multimaeder,
title={MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition},
author={Peihao Xiang and Chaohao Lin and Kaida Wu and Ou Bai},
year={2024},
eprint={2404.18327},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@INPROCEEDINGS{10677820,
author={Xiang, Peihao and Lin, Chaohao and Wu, Kaida and Bai, Ou},
booktitle={2024 14th International Conference on Pattern Recognition Systems (ICPRS)},
title={MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition},
year={2024},
volume={},
number={},
pages={1-7},
keywords={Emotion recognition;Visualization;Correlation;Supervised learning;Semantics;Self-supervised learning;Transformers;Dynamic Emotion Recognition;Multimodal Model;Self-Supervised Learning;Video Masked Autoencoder;Vision Transformer},
doi={10.1109/ICPRS62101.2024.10677820}}

