Skip to content

jackchen69/Awesome-Emotion-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Emotion Models

Paper GitHub WeChat Stars


📖 Our Survey

🔥🔥🔥 A Comprehensive Review in Unimodal and Multimodal Emotion Recognition

[📄 Paper] | [🌟 Project Page (This Page)] | [📝 Citation] | [💬 WeChat Group (Emo微信交流群,欢迎加入)]

This survey provides a unified synthesis of deep learning-based uni-modal and multi-modal emotion recognition within a coherent analytical framework that spans the full learning pipeline — from emotion modeling and dataset curation to modality-specific representation learning, fusion strategy design, and evaluation.

Key Contributions:

  • 🔬 Deep Analytical Framework: A structured taxonomy covering data preprocessing, input representations, uni-modal learning, multi-modal fusion, and evaluation strategies.
  • 📚 Systematic Synthesis: Comprehensive comparison of uni-modal (Face, Speech, Text) and multi-modal emotion recognition methods.
  • 🗺️ Future Roadmap: Concrete research directions grounded in identified gaps across modeling, data, and evaluation.

Resources: https://github.com/jackchen69/Awesome-Emotion-Models


🔥 Our Emo Works

EmoBench-M

🔥🔥🔥 EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

📽 Demo | 📖 Paper | 🌟 GitHub | 🤖 Basic Demo | 💬 WeChat

A representative evaluation benchmark for multimodal emotion recognition. All codes have been released! ✨


Other Emo Works

🔥 Work Links
MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition [Paper] [GitHub]
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation [Paper] [GitHub]
Uncertain Multimodal Intention and Emotion Understanding in the Wild [Paper] [GitHub]
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark [Paper] [GitHub]
Belief Mismatch Coefficient (BMC)ACII 2023 Best Paper [Paper]
1st Place Solution to Odyssey Emotion Recognition Challenge Task1 🥇 [Paper]
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective [Paper] [GitHub]
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition [Paper] [GitHub]
Spectral Representation of Behaviour Primitives for Depression AnalysisIEEE TAFFC Best Paper Runner-Up [Paper] [GitHub]
Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning [Paper]
A Scoping Review of Large Language Models for Generative Tasks in Mental Health Care (NPJ Digital Medicine) [Paper]

📑 Table of Contents



📚 Classic Books

Essential reading for researchers in affective computing, emotion recognition, and related fields. Organized by theme.


🧠 Affective Computing & Human-Computer Interaction

Cover Title Author(s) Year Publisher Link
📖 Affective Computing Rosalind W. Picard 1997 MIT Press Amazon · MIT Press
📖 The Oxford Handbook of Affective Computing Calvo, D'Mello, Gratch, Kappas (Eds.) 2015 Oxford Univ. Press Amazon · OUP
📖 Applied Affective Computing Schuller, Batliner et al. 2022 ACM Books ACM DL
📖 The Empathic Brain Christian Keysers 2011 Social Brain Press Amazon
📖 Wired for Culture: Origins of the Human Social Mind Mark Pagel 2012 Norton Amazon

💡 Picard (1997) is the founding text of affective computing — essential first read. The Oxford Handbook is the most comprehensive reference with 41 chapters on detection, generation, methodology, and applications.


😊 Emotion Psychology & Theory

Cover Title Author(s) Year Publisher Link
📖 Emotions Revealed: Recognizing Faces and Feelings Paul Ekman 2003 Times Books Amazon
📖 The Expression of the Emotions in Man and Animals Charles Darwin 1872 John Murray Free PDF · Amazon
📖 Emotion: Theory, Research, and Experience (Vol. 1) Robert Plutchik & Henry Kellerman (Eds.) 1980 Academic Press Amazon
📖 Handbook of Affective Sciences Davidson, Scherer, Goldsmith (Eds.) 2003 Oxford Univ. Press Amazon · OUP
📖 The Emotional Brain Joseph LeDoux 1996 Simon & Schuster Amazon
📖 Descartes' Error: Emotion, Reason, and the Human Brain António Damásio 1994 Putnam Amazon
📖 The Feeling of What Happens: Body, Emotion and the Making of Consciousness António Damásio 1999 Harcourt Amazon
📖 How Emotions Are Made: The Secret Life of the Brain Lisa Feldman Barrett 2017 Houghton Mifflin Amazon
📖 Emotions in Social Psychology: Essential Readings W. Gerrod Parrott (Ed.) 2001 Psychology Press Amazon
📖 The Nature of Emotion: Fundamental Questions Ekman & Davidson (Eds.) 1994 Oxford Univ. Press Amazon

💡 Ekman (2003) is the definitive guide to reading facial expressions. Damásio (1994) revolutionized understanding of the emotion-cognition relationship and remains highly influential in affective computing.


🗣️ Speech & Audio Emotion

Cover Title Author(s) Year Publisher Link
📖 Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing Björn Schuller & Anton Batliner 2013 Wiley Amazon · Wiley
📖 Speech and Language Processing (3rd ed.) Jurafsky & Martin 2023 Prentice Hall Free Draft
📖 Spoken Language Processing: A Guide to Theory, Algorithm, and System Development Huang, Acero & Hon 2001 Prentice Hall Amazon
📖 Fundamentals of Speech Recognition Rabiner & Juang 1993 Prentice Hall Amazon

💡 Schuller & Batliner (2013) is the go-to textbook for speech-based emotion and paralinguistics — directly relevant to SER research. Jurafsky & Martin is the standard NLP reference, freely available online.


📝 Sentiment Analysis & NLP

Cover Title Author(s) Year Publisher Link
📖 Sentiment Analysis: Mining Opinions, Sentiments, and Emotions (2nd ed.) Bing Liu 2020 Cambridge Univ. Press Amazon · Cambridge
📖 Sentiment Analysis and Opinion Mining Bing Liu 2012 Morgan & Claypool Free PDF · Amazon
📖 Natural Language Processing with Python Bird, Klein & Loper 2009 O'Reilly Free Online · Amazon
📖 Neural Network Methods for Natural Language Processing Yoav Goldberg 2017 Morgan & Claypool Amazon
📖 Speech and Language Processing (3rd ed.) Jurafsky & Martin 2023 Prentice Hall Free Draft

💡 Bing Liu (2020) is the definitive NLP text on sentiment analysis, now including deep learning and multimodal emotion analysis. The 2012 version is freely available as a PDF.


👁️ Computer Vision & Facial Expression

Cover Title Author(s) Year Publisher Link
📖 Facial Action Coding System (FACS): A Technique for the Measurement of Facial Movement Ekman & Friesen 1978 Consulting Psychologists Press Reference
📖 Deep Learning Goodfellow, Bengio & Courville 2016 MIT Press Free Online · Amazon
📖 Computer Vision: Algorithms and Applications (2nd ed.) Richard Szeliski 2022 Springer Free Online · Amazon
📖 Programming Computer Vision with Python Jan Erik Solem 2012 O'Reilly Free Online

💡 Ekman & Friesen's FACS (1978) is the foundational system for coding facial expressions used in virtually all FER datasets. Goodfellow et al. is the essential deep learning reference.


🤖 Deep Learning & Machine Learning

Cover Title Author(s) Year Publisher Link
📖 Deep Learning Goodfellow, Bengio & Courville 2016 MIT Press Free Online · Amazon
📖 Dive into Deep Learning Zhang, Lipton, Li & Smola 2023 Cambridge Univ. Press Free Online · Amazon
📖 Pattern Recognition and Machine Learning Christopher Bishop 2006 Springer Free PDF · Amazon
📖 Transformers for Natural Language Processing Denis Rothman 2022 Packt Amazon
📖 Attention Is All You Need (paper, but landmark) Vaswani et al. 2017 NeurIPS arXiv

🌐 Multimodal Learning & Fusion

Cover Title Author(s) Year Publisher Link
📖 Multimodal Machine Learning: A Survey and Taxonomy Baltrušaitis, Ahuja & Morency 2019 IEEE TPAMI arXiv · IEEE
📖 Foundations and Trends in Multimodal Machine Learning Liang, Zadeh & Morency 2022 Now Publishers arXiv
📖 Multimodal Deep Learning Ngiam et al. 2011 ICML PDF

🧬 Neuroscience & Cognitive Science of Emotion

Cover Title Author(s) Year Publisher Link
📖 The Emotional Brain Joseph LeDoux 1996 Simon & Schuster Amazon
📖 Descartes' Error António Damásio 1994 Putnam Amazon
📖 How Emotions Are Made Lisa Feldman Barrett 2017 Houghton Mifflin Amazon
📖 The Handbook of Emotion (4th ed.) Lewis, Haviland-Jones & Barrett (Eds.) 2016 Guilford Press Amazon
📖 Cognitive Neuroscience of Emotion Lane & Nadel (Eds.) 2000 Oxford Univ. Press Amazon

⚖️ Ethics, Fairness & Society

Cover Title Author(s) Year Publisher Link
📖 The Oxford Handbook of Ethics of AI Dubber, Pasquale & Das (Eds.) 2020 Oxford Univ. Press Amazon
📖 Weapons of Math Destruction Cathy O'Neil 2016 Crown Amazon
📖 Atlas of AI Kate Crawford 2021 Yale Univ. Press Amazon
📖 The Oxford Handbook of Affective Computing (Ethics Section) Calvo et al. (Eds.) 2015 Oxford Univ. Press Amazon

📋 Quick Reference: Books by Research Focus

If you work on... Read this first
Affective Computing (foundations) Picard, Affective Computing (1997)
Emotion Theory & Psychology Ekman, Emotions Revealed (2003)
Speech Emotion Recognition Schuller & Batliner, Computational Paralinguistics (2013)
Text / Sentiment Analysis Bing Liu, Sentiment Analysis (2020)
Facial Expression Recognition Ekman & Friesen, FACS (1978)
Deep Learning Methods Goodfellow et al., Deep Learning (2016)
Multimodal Fusion Baltrušaitis et al., Multimodal ML Survey (2019)
Neuroscience of Emotion Damásio, Descartes' Error (1994)
AI Ethics & Fairness O'Neil, Weapons of Math Destruction (2016)
Comprehensive Reference Calvo et al., Oxford Handbook of Affective Computing (2015)

Survey Comparison (2020–2026)

A = Audio, T = Text, V = Visual, P = Physiological

Publication Year Modality Uni-modal Multi-modal Evaluation Pipeline Dataset
Speech Commun 2020 A
IEEE TAFFC 2020 A
Information Fusion 2020 A,T,V
Electronics 2021 A,T,V
IEEE Signal Process. Mag. 2021 A,T,V
Information Science 2022 A,T,V
Neurocomputing 2022 A,T,V
Information Fusion 2022 A,T,V
IEEE TIM 2023 V
Proc. IEEE 2023 V
IEEE TAFFC 2023 T
Speech Commun 2023 A
IEEE Access 2023 A
Information Fusion 2023 A,T,V
Entropy 2023 A,T,V
Neurocomputing 2023 A,T,V,P
Information Fusion 2024 V
Information Fusion 2024 A,T,V,P
IEEE Access 2024 A,T,V
Expert Syst. Appl. 2024 A,T,V
Expert Systems 2025 A,T,V
ACM TOMM 2025 A,T,V,P
IEEE Access 2025 A,T,V
Information Fusion 2026 S
Ours 2026 A,T,V,P

📊 Awesome Datasets

Uni-modal Datasets

Facial Expression Datasets

Dataset Modality Emotion Labels Samples Paper/Link
CK+ V Anger, Disgust, Fear, Happy, Sad, Surprise, Neutral, Contempt 593 videos Paper
AffectNet V Neutral, Happy, Sad, Surprise, Fear, Disgust, Anger, Contempt 1,000,000 images Paper
FER+ V Anger, Disgust, Fear, Happy, Sad, Surprise, Neutral, Contempt 35,887 images Paper
RAF-DB V Basic & compound emotions 29,672 images Paper
EmoReact V Curiosity, Uncertainty, Excitement, Happy, Surprise, Disgust, Fear, Frustration 1,102 videos Paper
Aff-Wild2 V Valence, Arousal 558 videos Paper
FERV39K V 7 basic emotions 38,935 video clips Paper

Speech Emotion Datasets

Dataset Modality Emotion Labels Samples Paper/Link
TESS A Anger, Disgust, Fear, Happy, Pleasant Surprise, Sadness, Neutral 2,800 utterances Paper
EmoDB 2.0 A Anger, Boredom, Disgust, Fear, Happy, Neutral, Sadness 817 utterances Paper
RAVDESS A, V Calm, Happy, Sad, Angry, Fearful, Surprise, Disgust 7,356 videos Paper
IEMOCAP A, V, T Happy, Angry, Sad, Frustrated, Neutral; Valence, Arousal, Dominance 12.46h video Paper
MSP-Podcast A Anger, Contempt, Disgust, Fear, Happy, Neutral, Sadness, Surprise 264,705 turns Paper
CREMA-D A, V Anger, Disgust, Fear, Happy, Neutral, Sad 7,442 clips GitHub
EMO-DB A Anger, Boredom, Disgust, Fear, Happy, Neutral, Sad 535 utterances -

Text Emotion Datasets

Dataset Modality Emotion Labels Samples Paper/Link
ISEAR T Joy, Fear, Anger, Sadness, Disgust, Shame, Guilt 7,666 sentences Paper
EmoBank T Joy, Anger, Sad, Fear, Disgust, Surprise 10,548 sentences Paper
SemEval-2018 Task 1 T 11 emotions + Neutral 22,000 sentences Paper
GoEmotions T 27 emotion categories 58,000 Reddit comments Paper
Empathetic Dialogues T 32 emotion categories 24,850 conversations Paper
WRIME T 8 emotions (reader/writer) 17,000 social media posts Paper

Multi-modal Datasets

Dataset Modality Type Emotion Labels Samples Paper/Link
eNTERFACE'05 A, V Acted Anger, Disgust, Fear, Happy, Sad, Surprise 1,166 videos Paper
SAVEE A, V Acted Anger, Disgust, Fear, Happy, Sad, Surprise, Neutral 480 videos Paper
AFEW A, V Natural Anger, Disgust, Fear, Happy, Neutral, Sad, Surprise 1,426 videos Paper
CHEAVD A, V Natural Anger, Happy, Sad, Worried, Anxious, Surprise, Disgust, Neutral 140min video Paper
SEWA A, V Natural Valence, Arousal 2,000min video Paper
AMIGOS A, V Natural Valence, Arousal, Dominance 40 videos Paper
CMU-MOSI A, V, T Induced Continuous Sentiment Score 3,702 clips Paper
CMU-MOSEI A, V, T Induced Happy, Sad, Angry, Disgust, Surprise, Fear 23,500 clips Paper
MELD A, V, T Induced Anger, Disgust, Fear, Joy, Neutral, Sad, Surprise 13,708 utterances GitHub
IEMOCAP A, V, T Induced Happy, Angry, Sad, Frustrated, Neutral 12.46h video Paper
CH-SIMS A, V, T Induced 5-class sentiment 2,281 clips Paper
RAMAS A, V Induced Anger, Sad, Disgust, Happy, Fear, Surprise 7h video Paper
MER2023 A, V, T Natural 6 discrete + continuous 5,030 clips Paper
MER2024 A, V, T Natural Multi-label + OV Extended Paper
MER2025 A, V, T Natural Open-vocabulary Extended Paper

🏆 Awesome Papers

📈 Benchmark Comparison

Model Performance on Key Benchmarks

Vision Models

Model Framework Input Loss Performance Dataset Paper
C3D 3D Conv Video Softmax Acc: 59.02% AFEW Fan et al., 2016
I3D Inflated 3D Video Softmax Acc: 68.90% GreSti Ghaleb et al., 2021
SlowFast Dual CNN Video Softmax WAR: 49.34% FERV39K Neshov et al., 2024
ViT-B/16+SAM Transformer Video Cross-Entropy Acc: 52.42% FER-2013 Arnab et al., 2021
DTL-I-ResNet18 3D ResNet Video Softmax Acc: 83.0% FER2013 Helaly et al., 2023
ESTLNet CNN-LSTM Video Cross-Entropy Acc: 53.79% AFEW Wang et al., 2022
D2SP Dual Purification Video Cross-Entropy WAR: 50.5% FERV39k CVPR 2025

Audio Models

Model Framework Input Loss Performance Dataset Paper
HuBERT CNN+Transformer Raw audio Contrastive WA: 79.58% IEMOCAP Wang et al., 2021
Wav2Vec 1D CNN Raw audio Contrastive WA: 77.00% IEMOCAP Wang et al., 2021
emotion2vec Online Distillation Raw audio Utterance+Frame WA: 85.0% RAVDESS Ma et al., 2024
SL-GEmo-CLAP CNN+Transformer WavLM-large KL Loss WAR: 81.43% IEMOCAP Pan et al., 2024
WavLM CNN+Transformer Raw audio Discriminative Macro-F1: 33.6% IEMOCAP Wu et al., 2024
Mockingjay NPC Raw audio L1/MSE Acc: 50.28% IEMOCAP Liu et al., 2024
DeCoAR SVM Mel FBANK L1/MSE UAR: 71.93% IEMOCAP Stanea et al., 2023
Vesper CNN+Transformer Raw audio MSE WA: 54.2% IEMOCAP Chen et al., 2024
Audio-Transformer Transformer Spectrogram Cross-Entropy Acc: 75.42% EMO-DB Bayraktar et al., 2023
DTNet CNN+Transformer Raw audio Cross-Entropy UA: 74.8% IEMOCAP Yuan et al., 2024

Text Models

Model Framework Input Loss Performance Dataset Paper
BERT Transformer Text token MLM+NSP Acc: 70.09% ISEAR Adoma et al., 2020
RoBERTa Transformer Text token Cross-Entropy Acc: 74.31% ISEAR Adoma et al., 2020
XLNet Transformer Permuted tokens Permuted LM Acc: 72.99% ISEAR Adoma et al., 2020
ALBERT Transformer Text token Focal+KL Acc: 73.86% ISEAR Adoma et al., 2020
DistilBERT Transformer Text token MLM+Distillation Acc: 66.93% ISEAR Adoma et al., 2020
DeBERTa-v3 Transformer Text token Cross-Entropy F1: 66.2% WRIME Takenaka et al., 2025
ChatGPT-4o Transformer Text token Prompt-based F1: 52.7% WRIME Atitienei et al., 2024
GloVe Co-occurrence matrix Text tokens Weighted LS Acc: 95.09% Twitter Gupta et al., 2021
Word2Vec CBOW Text tokens Hierarchical Softmax Macro-F1: 73.21% Tweets Tang et al., 2014
ELMo BiLSTM Context. vectors Cross-Entropy Acc: 88.91% Wikipedia Yang et al., 2021
COMET Transformer Commonsense triple Cross-Entropy W-Avg F1: 65.21% MELD Zhang et al., 2021

Uni-modal Emotion Recognition

Facial Emotion Recognition

FER has evolved from hand-crafted descriptors (LBP, HOG, Gabor) → CNN-based end-to-end learning → spatio-temporal models → Transformer-based architectures → self-supervised pretraining.

Title Venue Date Code
D2SP: Dual Denoising via Saliency Prompt for Video-based Emotion Recognition CVPR 2025 GitHub
Facial Emotion Recognition using CNN arXiv 2023 GitHub
DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos ACM MM 2022 -
STCAM: Spatio-Temporal and Channel Attention Module for Dynamic Facial Expression Recognition IEEE TAFFC 2020 -
DTL: Disentangled Transfer Learning for Visual Recognition arXiv 2023 -
SlowFast Networks for Video Recognition ICCV 2019 GitHub
ViViT: A Video Vision Transformer ICCV 2021 GitHub
Big Self-Supervised Models are Strong Semi-Supervised Learners NeurIPS 2020 GitHub

Speech Emotion Recognition

SER has transitioned from hand-crafted prosodic/spectral features → deep CNN/LSTM → Transformer-based → Self-Supervised Learning (SSL) as the dominant paradigm.

Title Venue Date Code
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation arXiv 2023 GitHub
SL-GEmo-CLAP: Contrastive Language-Audio Pretraining for Speech Emotion Interspeech 2024 -
HuBERT: Self-Supervised Speech Representation Learning IEEE/ACM TASLP 2021 GitHub
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing IEEE JSTSP 2022 GitHub
Wav2Vec: Unsupervised Pre-Training for Speech Recognition Interspeech 2019 GitHub
Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition IEEE TASLP 2024 GitHub
DTNet: Disentanglement Learning for Speech Emotion Recognition ICASSP 2024 -
Audio Transformer for Speech Emotion Recognition ACM MM Asia 2023 -
Mockingjay: Unsupervised Speech Representation Learning ICASSP 2020 GitHub

Text Emotion Recognition

TER has evolved from lexicon-based methods → static embeddings → transformer pretraining (BERT family) → Large Language Models enabling zero-shot generalization.

Title Venue Date Code
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding NAACL 2019 GitHub
RoBERTa: A Robustly Optimized BERT Pretraining Approach arXiv 2019 GitHub
DeBERTa: Decoding-enhanced BERT with Disentangled Attention ICLR 2021 GitHub
GloVe: Global Vectors for Word Representation EMNLP 2014 GitHub
Word2Vec: Efficient Estimation of Word Representations in Vector Space ICLR 2013 GitHub
ELMo: Deep Contextualized Word Representations NAACL 2018 GitHub
COMET: Commonsense Transformers for Automatic Knowledge Graph Construction ACL 2019 GitHub
XLNet: Generalized Autoregressive Pretraining for Language Understanding NeurIPS 2019 GitHub
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations ICLR 2020 GitHub
DistilBERT: A distilled version of BERT NeurIPS Workshop 2019 GitHub
DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models arXiv 2023 GitHub

Multi-modal Emotion Recognition

Fusion Strategy

Fusion strategy determines when and how modalities interact: Early Fusion (feature-level) → Late Fusion (decision-level) → Model-level FusionHybrid Fusion.

Title Venue Date Code Fusion Type
M²FNet: Multi-scale Multi-modal Fusion Network for Emotion Recognition in Conversations CVPRW 2022 GitHub Early
TDFNet: Text-Directed Fusion Network for Multimodal Sentiment Analysis arXiv 2023 - Early
UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition EMNLP 2022 GitHub Hybrid
Cross-Modal Fusion Network with Dual-Task Interaction IEEE TAFFC 2023 - Hybrid
Memory Fusion Network for Multi-view Sequential Learning AAAI 2018 GitHub Late
MISA: Modality-Invariant and -Specific Representations ACM MM 2020 GitHub Model-level
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors ACL 2018 GitHub Model-level

Fusion Granularity

Four core granularity challenges: Modality Alignment · Modality Dominance · Modality Complementarity · Modality Robustness

Title Venue Date Code Granularity
MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences ACL 2019 GitHub Alignment
TFN: Tensor Fusion Network for Multimodal Sentiment Analysis EMNLP 2017 GitHub Alignment
DialogueMMT: Distribution-Aware Multi-modal Dialogue Emotion Recognition Interspeech 2025 - Alignment
MAG-BERT: Integrating Multimodal Information in Large Pretrained Transformers ACL 2020 GitHub Dominance
MMIN: Multimodal Multiple Instance Learning AAAI 2021 GitHub Robustness
IMDer: Incomplete Multimodal Learning for Emotion Recognition AAAI 2024 - Robustness
GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation IEEE TPAMI 2023 GitHub Robustness

Model Architectures

Multi-modal fusion architectures are broadly classified into 7 categories: Kernel-based · Graph-based · Neural Network-based · Attention-based · Transformer-based · Generative-based · LLM-based.


1. Kernel-based Architectures

MKL established the first principled framework for multi-modal fusion by associating each modality with its own similarity function, recognizing that heterogeneous affective signals cannot be compared through a single shared kernel.

Title Venue Date Code
Multiple Kernel Learning for Emotion Recognition in the Wild ACM MM 2013 -
Emotion Recognition in the Wild via CNN and Mapped Binary Patterns ACM MM 2015 -
Ensemble of SVM Trees for Multimodal Emotion Recognition ACII 2017 -

2. Graph-based Architectures

Graph-based models treat dialogue as a heterogeneous graph where nodes represent utterances and edges encode speaker, temporal, and cross-modal dependencies — capturing that emotional meaning emerges from relational structure, not just individual utterances.

Title Venue Date Code
DialogueGCN: A Graph Convolutional Neural Network for ERC EMNLP 2019 GitHub
COGMEN: COntextualized GNN based Multimodal Emotion recognitioN NAACL 2022 GitHub
M3GAT: Multi-granularity Multi-scale Multi-modal Graph Attention Network ACM TOMM 2023 -
M²FNet: Multi-scale Multi-modal Fusion Network for ERC CVPRW 2022 GitHub
MMGCN: Multi-relational Graph Convolutional Network for Multimodal ERC ACM MM 2021 GitHub
Hierarchical Heterogeneous Graph for Multimodal ERC AAAI 2025 -
Decoupled Distillation Graph for Cross-modal ERC ACM MM 2024 -
Persona-aware ERC with Graph Network and Turn Interaction EMNLP 2024 -

3. Neural Network-based Architectures

Early neural MER established the template of separate uni-modal encoders (CNN/LSTM) followed by joint fusion — but the concatenate-then-classify bottleneck allows classifiers to ignore weaker modalities whenever the dominant one minimizes training loss.

Title Venue Date Code
Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities IEEE TAFFC 2011 -
LSTM-based Multimodal Affect Prediction ACII 2013 -
End-to-End Multimodal Emotion Recognition using Deep Neural Networks IEEE JSTSP 2017 -
Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling KBS 2019 GitHub
Semi-supervised Multimodal Emotion Recognition AAAI 2017 -
Ensemble CNN for Multimodal Emotion Recognition IEEE TAFFC 2020 -
MIST: Multi-modal Integration with Semi-supervised and Transfer Learning arXiv 2025 -
DISD-Net: Dynamic Interaction Self-Distillation for Cross-Subject ERC arXiv 2025 -

4. Attention-based Architectures

Attention mechanisms span four complementary granularities: self-modal (intra-modal noise), cross-modal (inter-modal integration), spatial (discriminative region selection), temporal (salient moment capture) — each targeting distinct failure modes, not interchangeable solutions.

Title Venue Date Code
MultiEMO: An Attention-Based Correlation-Aware Multimodal Fusion Framework ACL 2023 GitHub
CTNet: Conversational Transformer Network for Emotion Recognition IEEE/ACM TASLP 2021 -
Attentive Modality Hopping for Speech Emotion Recognition ICASSP 2020 -
Multi-Modal Multi-Scale Temporal Self-Attention for Multimodal Sentiment Analysis ACM MM 2021 -
Cross-Modal Residual Attention for Multimodal ERC ACL 2021 -
Spatial Attention for Image-Text Emotion Recognition AAAI 2020 -
Temporal Attention for Video-based Emotion Recognition ICCV 2019 -
Knowledge-aware Graph-based Co-Attention for Multimodal ERC EMNLP 2023 -
MSER: Multi-Scale Emotion Recognition with Orthogonal Learning AAAI 2024 -
Phy-FusionNet: Temporal Attention with Memory-Augmented Periodic Modeling arXiv 2025 -
Conv-Attention Adapter for LLM-based ERC arXiv 2024 -
Joint Transformer-Attention for Multimodal ERC ICASSP 2023 -
Bayesian Co-Attention for Uncertainty-Aware Multimodal ERC ACL 2023 -

5. Transformer-based Architectures

Transformers make modality alignment and cross-modal interaction simultaneous via cross-attention at every layer rather than sequential — providing structural robustness to noise and modality inconsistency. Key open question: gains from architecture or pretrained representations?

Title Venue Date Code
MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences ACL 2019 GitHub
UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition EMNLP 2022 GitHub
CTNet: Conversational Transformer Network for ERC IEEE/ACM TASLP 2021 -
TDFNet: Text-Directed Fusion Network with Cross-Modal Transformer arXiv 2023 -
Cross-Modality Transformer for Robust Multimodal ERC arXiv 2021 -
Pre-trained Audio-Visual Transformers for Multimodal ERC ICASSP 2022 -
Hierarchical Transformer Fusion for Multi-level ERC ACM MM 2022 -
Modality-Aware Transformer with Dynamic Modality Prioritization EMNLP 2022 -
DQ-based Multimodal Fusion with Dynamic Query Selection ACL 2024 -
DialogueMMT: Distribution-Aware Multi-modal Dialogue ERC Interspeech 2025 -
Capsule Graph Transformer for Multimodal ERC arXiv 2025 -
TACFN: Transformer-based Adaptive Cross-Modal Fusion Network arXiv 2025 -
Phy-FusionNet: Memory-Augmented Periodicity-Aware Fusion arXiv 2025 -
RRMER-DT: Diffusion-Enhanced Transformer for Conversational MER arXiv 2025 -
Self-Supervised Multimodal Transformer for Sentiment Analysis arXiv 2020 -
Flexible-Input Transformer for Continuous and Multi-label ERC EMNLP 2023 -
Modality-Collaborative Transformer with Feature Reconstruction ACM MM 2024 -

6. Generative-based Architectures

Generative models address two problems discriminative models cannot solve by design: (1) data scarcity via GAN/diffusion augmentation, and (2) missing-modality robustness via conditional reconstruction. Critical insight: perceptual plausibility ≠ affective coherence — generative and discriminative objectives must be jointly optimized.

Title Venue Date Code
IMDer: Incomplete Multimodal Learning via Diffusion for ERC AAAI 2024 -
GAN-based Multimodal Emotion Data Augmentation ICASSP 2019 -
MALN: Modality-Adversarial Learning Network for Multimodal ERC ACM MM 2023 -
Deep Autoencoder-Based Fusion for Multimodal Sentiment Analysis ACL 2021 -
Diffusion-Based Multi-modal Emotion Recovery arXiv 2025 -
DiffuFuse: Diffusion-based Fusion for Incomplete Multimodal ERC arXiv 2025 -
Progressive Cross-Modal Reconstruction under Missing Modality arXiv 2025 -
RoHyDr: Robust Hybrid Discriminative-Generative MER arXiv 2025 -
Cross-Modal Adversarial-Generative Framework for Robust ERC ACM MM 2024 -
Novel Autoencoder Fusion with Affective Discriminative Constraints ICASSP 2024 -

7. Large Language Model-Based Architectures

LLMs shift MER from task-specific fusion toward language-mediated reasoning over heterogeneous inputs — enabling zero-shot generalization, open-vocabulary recognition, and natural language explainability. Key challenges remain: prompt sensitivity, multimodal hallucination, and memory forgetting.

Title Venue Date Code
AffectGPT: Explainable Multimodal Emotion Reasoning arXiv 2025 -
AffectGPT-R1: RL-Enhanced Interpretable Affective Reasoning arXiv 2025 -
OV-MER: Open-Vocabulary Multimodal Emotion Recognition arXiv 2025 -
EmoLLM: Multimodal Emotional Understanding with LLMs arXiv 2024 GitHub
Emotion-LLaMA: Multimodal ERC and Reasoning with Instruction Tuning NeurIPS 2024 GitHub
DialogueLLM: Context and Emotion Knowledge-Tuned LLM for ERC arXiv 2023 GitHub
DialogueMLLM: Instruction-Tuned MLLM for Conversational ERC arXiv 2025 -
R1-Omni: Explainable Omni-Multimodal ERC with Reinforcement Learning arXiv 2025 GitHub
OMNISAPIENS-7B: Unified Multimodal Human Behaviour Understanding arXiv 2025 -
OMNISAPIENS-7B 2.0: RL-Balanced Multimodal Behaviour Understanding arXiv 2026 -
GPT-4V for Zero-Shot Multimodal Emotion Recognition arXiv 2024 -
AUGESC: LLM-based Data Augmentation for Emotional Support Conversations ACL Findings 2023 GitHub
Generalized LLMs with Emergent Cross-Domain Emotion Reasoning arXiv 2024 -
EmoBench: Evaluating the Emotional Intelligence of LLMs arXiv 2024 GitHub
Beyond Text: LLMs Integrating Vocal and Visual Signals for Emotion arXiv 2025 -
REVISE: Prompt Sensitivity in LLMs for Emotion Recognition arXiv 2025 -
EmotionHallucer: Detecting Hallucination in LLM Emotion Prediction arXiv 2025 -

Summary of Multi-modal Architectures

Architecture Representative Models Core Strength Key Limitation
Kernel-based MKL, SVM-Ensemble Modality-specific similarity kernels Poor scalability, cannot be end-to-end
Graph-based DialogueGCN, COGMEN, M3GAT Relational & conversational structure Graph construction sensitivity
Neural Network LSTM-based, CNN-based, MIST Temporal & sequential emotion modeling Concatenate-then-classify bottleneck
Attention-based MultiEMO, Phy-FusionNet, MSER Selective salient cue capture Noise amplification from unreliable cues
Transformer-based MulT, UniMSE, CTNet, TDFNet Simultaneous cross-modal alignment Pretrain bias; compute cost
Generative-based IMDer, DiffuFuse, MALN Missing-modality robustness; data augmentation Plausibility ≠ affective coherence
LLM-based AffectGPT, EmoLLM, R1-Omni Zero-shot + explainability + open-vocabulary Prompt sensitivity; hallucination

If you find this repository or our survey useful for your research, please cite:

@article{luo2026comprehensive,
  title     = {A Comprehensive Review in Unimodal and Multimodal Emotion Recognition},
  author    = {Luo, Jiachen and Yang, Qu and He, Jiajun and Hua, Yining and 
               Zheng, Lian and Li, Yuanchao and Song, Siyang and Mathur, Leena and 
               Wen, Wu and Wang, Dingdong and Shen, Shuai and Wu, Jingyao and 
               Hu, Guimin and Hu, He and Li, Yong and Zhang, Zixing and 
               Wang, Jiadong and Zhou, Sifan and Tang, Zuojin and Cao, Canran and 
               Xu, Sheng and Zhao, Zhenjun and Toda, Tomoki and Xue, Xiangyang and 
               Zhao, Siyang and Sun, Licai and Zhang, Liyun and Cai, Cong and 
               Du, Jiamin and Ma, Ziyang and Chen, Mingjie and Qian, Chengxuan and 
               Phan, Huy and Wang, Lin and Schuller, Bjoern and Reiss, Joshua},
  journal   = {ACM Transactions on Intelligent Systems and Technology},
  year      = {2026},
  note      = {Resources: \url{https://github.com/jackchen69/Awesome-Emotion-Models}}
}

🤝 Contributing

We welcome contributions! If you have papers, datasets, or models to add:

  1. Fork this repository
  2. Add your entry following the existing table format
  3. Submit a Pull Request with a brief description

Please ensure the added work is peer-reviewed or on arXiv with verifiable results.


📬 Contact

💬 WeChat Group: Scan the QR code here to join our Emo discussion group (Emo微信交流群,欢迎加入)


⭐ Star this repository if you find it helpful! ⭐

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors