Awesome Emotion Models

📖 Our Survey

🔥🔥🔥 A Comprehensive Review in Unimodal and Multimodal Emotion Recognition

[📄 Paper] | [🌟 Project Page (This Page)] | [📝 Citation] | [💬 WeChat Group (Emo微信交流群，欢迎加入)]

This survey provides a unified synthesis of deep learning-based uni-modal and multi-modal emotion recognition within a coherent analytical framework that spans the full learning pipeline — from emotion modeling and dataset curation to modality-specific representation learning, fusion strategy design, and evaluation.

Key Contributions:

🔬 Deep Analytical Framework: A structured taxonomy covering data preprocessing, input representations, uni-modal learning, multi-modal fusion, and evaluation strategies.
📚 Systematic Synthesis: Comprehensive comparison of uni-modal (Face, Speech, Text) and multi-modal emotion recognition methods.
🗺️ Future Roadmap: Concrete research directions grounded in identified gaps across modeling, data, and evaluation.

Resources: https://github.com/jackchen69/Awesome-Emotion-Models

🔥 Our Emo Works

EmoBench-M

🔥🔥🔥 EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

📽 Demo | 📖 Paper | 🌟 GitHub | 🤖 Basic Demo | 💬 WeChat

A representative evaluation benchmark for multimodal emotion recognition. All codes have been released! ✨

Other Emo Works

🔥 Work	Links
MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition	[Paper] [GitHub]
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation	[Paper] [GitHub]
Uncertain Multimodal Intention and Emotion Understanding in the Wild	[Paper] [GitHub]
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark	[Paper] [GitHub]
Belief Mismatch Coefficient (BMC) ⭐ ACII 2023 Best Paper	[Paper]
1st Place Solution to Odyssey Emotion Recognition Challenge Task1 🥇	[Paper]
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective	[Paper] [GitHub]
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition	[Paper] [GitHub]
Spectral Representation of Behaviour Primitives for Depression Analysis ⭐ IEEE TAFFC Best Paper Runner-Up	[Paper] [GitHub]
Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning	[Paper]
A Scoping Review of Large Language Models for Generative Tasks in Mental Health Care (NPJ Digital Medicine)	[Paper]

📑 Table of Contents

Survey Overview
- - Classic Books
- Benchmark Comparison
Awesome Papers
- Uni-modal Emotion Recognition
- Multi-modal Emotion Recognition (MER)
Citation

📚 Classic Books

Essential reading for researchers in affective computing, emotion recognition, and related fields. Organized by theme.

🧠 Affective Computing & Human-Computer Interaction

Cover	Title	Author(s)	Year	Publisher	Link
📖	Affective Computing	Rosalind W. Picard	1997	MIT Press	Amazon · MIT Press
📖	The Oxford Handbook of Affective Computing	Calvo, D'Mello, Gratch, Kappas (Eds.)	2015	Oxford Univ. Press	Amazon · OUP
📖	Applied Affective Computing	Schuller, Batliner et al.	2022	ACM Books	ACM DL
📖	The Empathic Brain	Christian Keysers	2011	Social Brain Press	Amazon
📖	Wired for Culture: Origins of the Human Social Mind	Mark Pagel	2012	Norton	Amazon

💡 Picard (1997) is the founding text of affective computing — essential first read. The Oxford Handbook is the most comprehensive reference with 41 chapters on detection, generation, methodology, and applications.

😊 Emotion Psychology & Theory

Cover	Title	Author(s)	Year	Publisher	Link
📖	Emotions Revealed: Recognizing Faces and Feelings	Paul Ekman	2003	Times Books	Amazon
📖	The Expression of the Emotions in Man and Animals	Charles Darwin	1872	John Murray	Free PDF · Amazon
📖	Emotion: Theory, Research, and Experience (Vol. 1)	Robert Plutchik & Henry Kellerman (Eds.)	1980	Academic Press	Amazon
📖	Handbook of Affective Sciences	Davidson, Scherer, Goldsmith (Eds.)	2003	Oxford Univ. Press	Amazon · OUP
📖	The Emotional Brain	Joseph LeDoux	1996	Simon & Schuster	Amazon
📖	Descartes' Error: Emotion, Reason, and the Human Brain	António Damásio	1994	Putnam	Amazon
📖	The Feeling of What Happens: Body, Emotion and the Making of Consciousness	António Damásio	1999	Harcourt	Amazon
📖	How Emotions Are Made: The Secret Life of the Brain	Lisa Feldman Barrett	2017	Houghton Mifflin	Amazon
📖	Emotions in Social Psychology: Essential Readings	W. Gerrod Parrott (Ed.)	2001	Psychology Press	Amazon
📖	The Nature of Emotion: Fundamental Questions	Ekman & Davidson (Eds.)	1994	Oxford Univ. Press	Amazon

💡 Ekman (2003) is the definitive guide to reading facial expressions. Damásio (1994) revolutionized understanding of the emotion-cognition relationship and remains highly influential in affective computing.

🗣️ Speech & Audio Emotion

Cover	Title	Author(s)	Year	Publisher	Link
📖	Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing	Björn Schuller & Anton Batliner	2013	Wiley	Amazon · Wiley
📖	Speech and Language Processing (3rd ed.)	Jurafsky & Martin	2023	Prentice Hall	Free Draft
📖	Spoken Language Processing: A Guide to Theory, Algorithm, and System Development	Huang, Acero & Hon	2001	Prentice Hall	Amazon
📖	Fundamentals of Speech Recognition	Rabiner & Juang	1993	Prentice Hall	Amazon

💡 Schuller & Batliner (2013) is the go-to textbook for speech-based emotion and paralinguistics — directly relevant to SER research. Jurafsky & Martin is the standard NLP reference, freely available online.

📝 Sentiment Analysis & NLP

Cover	Title	Author(s)	Year	Publisher	Link
📖	Sentiment Analysis: Mining Opinions, Sentiments, and Emotions (2nd ed.)	Bing Liu	2020	Cambridge Univ. Press	Amazon · Cambridge
📖	Sentiment Analysis and Opinion Mining	Bing Liu	2012	Morgan & Claypool	Free PDF · Amazon
📖	Natural Language Processing with Python	Bird, Klein & Loper	2009	O'Reilly	Free Online · Amazon
📖	Neural Network Methods for Natural Language Processing	Yoav Goldberg	2017	Morgan & Claypool	Amazon
📖	Speech and Language Processing (3rd ed.)	Jurafsky & Martin	2023	Prentice Hall	Free Draft

💡 Bing Liu (2020) is the definitive NLP text on sentiment analysis, now including deep learning and multimodal emotion analysis. The 2012 version is freely available as a PDF.

👁️ Computer Vision & Facial Expression

Cover	Title	Author(s)	Year	Publisher	Link
📖	Facial Action Coding System (FACS): A Technique for the Measurement of Facial Movement	Ekman & Friesen	1978	Consulting Psychologists Press	Reference
📖	Deep Learning	Goodfellow, Bengio & Courville	2016	MIT Press	Free Online · Amazon
📖	Computer Vision: Algorithms and Applications (2nd ed.)	Richard Szeliski	2022	Springer	Free Online · Amazon
📖	Programming Computer Vision with Python	Jan Erik Solem	2012	O'Reilly	Free Online

💡 Ekman & Friesen's FACS (1978) is the foundational system for coding facial expressions used in virtually all FER datasets. Goodfellow et al. is the essential deep learning reference.

🤖 Deep Learning & Machine Learning

Cover	Title	Author(s)	Year	Publisher	Link
📖	Deep Learning	Goodfellow, Bengio & Courville	2016	MIT Press	Free Online · Amazon
📖	Dive into Deep Learning	Zhang, Lipton, Li & Smola	2023	Cambridge Univ. Press	Free Online · Amazon
📖	Pattern Recognition and Machine Learning	Christopher Bishop	2006	Springer	Free PDF · Amazon
📖	Transformers for Natural Language Processing	Denis Rothman	2022	Packt	Amazon
📖	Attention Is All You Need (paper, but landmark)	Vaswani et al.	2017	NeurIPS	arXiv

🌐 Multimodal Learning & Fusion

Cover	Title	Author(s)	Year	Publisher	Link
📖	Multimodal Machine Learning: A Survey and Taxonomy	Baltrušaitis, Ahuja & Morency	2019	IEEE TPAMI	arXiv · IEEE
📖	Foundations and Trends in Multimodal Machine Learning	Liang, Zadeh & Morency	2022	Now Publishers	arXiv
📖	Multimodal Deep Learning	Ngiam et al.	2011	ICML	PDF

🧬 Neuroscience & Cognitive Science of Emotion

Cover	Title	Author(s)	Year	Publisher	Link
📖	The Emotional Brain	Joseph LeDoux	1996	Simon & Schuster	Amazon
📖	Descartes' Error	António Damásio	1994	Putnam	Amazon
📖	How Emotions Are Made	Lisa Feldman Barrett	2017	Houghton Mifflin	Amazon
📖	The Handbook of Emotion (4th ed.)	Lewis, Haviland-Jones & Barrett (Eds.)	2016	Guilford Press	Amazon
📖	Cognitive Neuroscience of Emotion	Lane & Nadel (Eds.)	2000	Oxford Univ. Press	Amazon

⚖️ Ethics, Fairness & Society

Cover	Title	Author(s)	Year	Publisher	Link
📖	The Oxford Handbook of Ethics of AI	Dubber, Pasquale & Das (Eds.)	2020	Oxford Univ. Press	Amazon
📖	Weapons of Math Destruction	Cathy O'Neil	2016	Crown	Amazon
📖	Atlas of AI	Kate Crawford	2021	Yale Univ. Press	Amazon
📖	The Oxford Handbook of Affective Computing (Ethics Section)	Calvo et al. (Eds.)	2015	Oxford Univ. Press	Amazon

📋 Quick Reference: Books by Research Focus

If you work on...	Read this first
Affective Computing (foundations)	Picard, Affective Computing (1997)
Emotion Theory & Psychology	Ekman, Emotions Revealed (2003)
Speech Emotion Recognition	Schuller & Batliner, Computational Paralinguistics (2013)
Text / Sentiment Analysis	Bing Liu, Sentiment Analysis (2020)
Facial Expression Recognition	Ekman & Friesen, FACS (1978)
Deep Learning Methods	Goodfellow et al., Deep Learning (2016)
Multimodal Fusion	Baltrušaitis et al., Multimodal ML Survey (2019)
Neuroscience of Emotion	Damásio, Descartes' Error (1994)
AI Ethics & Fairness	O'Neil, Weapons of Math Destruction (2016)
Comprehensive Reference	Calvo et al., Oxford Handbook of Affective Computing (2015)

Survey Comparison (2020–2026)

A = Audio, T = Text, V = Visual, P = Physiological

Publication	Year	Modality	Uni-modal	Multi-modal	Evaluation	Pipeline	Dataset
Speech Commun	2020	A	✅	❌	✅	❌	✅
IEEE TAFFC	2020	A	✅	❌	❌	❌	❌
Information Fusion	2020	A,T,V	❌	✅	❌	❌	✅
Electronics	2021	A,T,V	✅	✅	❌	✅	✅
IEEE Signal Process. Mag.	2021	A,T,V	❌	✅	❌	❌	✅
Information Science	2022	A,T,V	✅	❌	❌	❌	✅
Neurocomputing	2022	A,T,V	❌	✅	❌	❌	✅
Information Fusion	2022	A,T,V	✅	✅	❌	❌	✅
IEEE TIM	2023	V	✅	❌	❌	❌	✅
Proc. IEEE	2023	V	✅	❌	✅	❌	✅
IEEE TAFFC	2023	T	✅	❌	❌	❌	✅
Speech Commun	2023	A	✅	❌	❌	❌	✅
IEEE Access	2023	A	✅	❌	❌	❌	✅
Information Fusion	2023	A,T,V	✅	✅	❌	❌	✅
Entropy	2023	A,T,V	✅	✅	✅	✅	✅
Neurocomputing	2023	A,T,V,P	✅	✅	❌	❌	✅
Information Fusion	2024	V	✅	❌	❌	❌	✅
Information Fusion	2024	A,T,V,P	❌	✅	❌	❌	✅
IEEE Access	2024	A,T,V	❌	✅	❌	❌	✅
Expert Syst. Appl.	2024	A,T,V	✅	✅	❌	❌	✅
Expert Systems	2025	A,T,V	✅	✅	❌	❌	✅
ACM TOMM	2025	A,T,V,P	❌	✅	❌	❌	✅
IEEE Access	2025	A,T,V	✅	✅	❌	❌	✅
Information Fusion	2026	S	✅	❌	✅	✅	✅
Ours	2026	A,T,V,P	✅	✅	✅	✅	✅

📊 Awesome Datasets

Uni-modal Datasets

Facial Expression Datasets

Dataset	Modality	Emotion Labels	Samples	Paper/Link
CK+	V	Anger, Disgust, Fear, Happy, Sad, Surprise, Neutral, Contempt	593 videos	Paper
AffectNet	V	Neutral, Happy, Sad, Surprise, Fear, Disgust, Anger, Contempt	1,000,000 images	Paper
FER+	V	Anger, Disgust, Fear, Happy, Sad, Surprise, Neutral, Contempt	35,887 images	Paper
RAF-DB	V	Basic & compound emotions	29,672 images	Paper
EmoReact	V	Curiosity, Uncertainty, Excitement, Happy, Surprise, Disgust, Fear, Frustration	1,102 videos	Paper
Aff-Wild2	V	Valence, Arousal	558 videos	Paper
FERV39K	V	7 basic emotions	38,935 video clips	Paper

Speech Emotion Datasets

Dataset	Modality	Emotion Labels	Samples	Paper/Link
TESS	A	Anger, Disgust, Fear, Happy, Pleasant Surprise, Sadness, Neutral	2,800 utterances	Paper
EmoDB 2.0	A	Anger, Boredom, Disgust, Fear, Happy, Neutral, Sadness	817 utterances	Paper
RAVDESS	A, V	Calm, Happy, Sad, Angry, Fearful, Surprise, Disgust	7,356 videos	Paper
IEMOCAP	A, V, T	Happy, Angry, Sad, Frustrated, Neutral; Valence, Arousal, Dominance	12.46h video	Paper
MSP-Podcast	A	Anger, Contempt, Disgust, Fear, Happy, Neutral, Sadness, Surprise	264,705 turns	Paper
CREMA-D	A, V	Anger, Disgust, Fear, Happy, Neutral, Sad	7,442 clips	GitHub
EMO-DB	A	Anger, Boredom, Disgust, Fear, Happy, Neutral, Sad	535 utterances	-

Text Emotion Datasets

Dataset	Modality	Emotion Labels	Samples	Paper/Link
ISEAR	T	Joy, Fear, Anger, Sadness, Disgust, Shame, Guilt	7,666 sentences	Paper
EmoBank	T	Joy, Anger, Sad, Fear, Disgust, Surprise	10,548 sentences	Paper
SemEval-2018 Task 1	T	11 emotions + Neutral	22,000 sentences	Paper
GoEmotions	T	27 emotion categories	58,000 Reddit comments	Paper
Empathetic Dialogues	T	32 emotion categories	24,850 conversations	Paper
WRIME	T	8 emotions (reader/writer)	17,000 social media posts	Paper

Multi-modal Datasets

Dataset	Modality	Type	Emotion Labels	Samples	Paper/Link
eNTERFACE'05	A, V	Acted	Anger, Disgust, Fear, Happy, Sad, Surprise	1,166 videos	Paper
SAVEE	A, V	Acted	Anger, Disgust, Fear, Happy, Sad, Surprise, Neutral	480 videos	Paper
AFEW	A, V	Natural	Anger, Disgust, Fear, Happy, Neutral, Sad, Surprise	1,426 videos	Paper
CHEAVD	A, V	Natural	Anger, Happy, Sad, Worried, Anxious, Surprise, Disgust, Neutral	140min video	Paper
SEWA	A, V	Natural	Valence, Arousal	2,000min video	Paper
AMIGOS	A, V	Natural	Valence, Arousal, Dominance	40 videos	Paper
CMU-MOSI	A, V, T	Induced	Continuous Sentiment Score	3,702 clips	Paper
CMU-MOSEI	A, V, T	Induced	Happy, Sad, Angry, Disgust, Surprise, Fear	23,500 clips	Paper
MELD	A, V, T	Induced	Anger, Disgust, Fear, Joy, Neutral, Sad, Surprise	13,708 utterances	GitHub
IEMOCAP	A, V, T	Induced	Happy, Angry, Sad, Frustrated, Neutral	12.46h video	Paper
CH-SIMS	A, V, T	Induced	5-class sentiment	2,281 clips	Paper
RAMAS	A, V	Induced	Anger, Sad, Disgust, Happy, Fear, Surprise	7h video	Paper
MER2023	A, V, T	Natural	6 discrete + continuous	5,030 clips	Paper
MER2024	A, V, T	Natural	Multi-label + OV	Extended	Paper
MER2025	A, V, T	Natural	Open-vocabulary	Extended	Paper

🏆 Awesome Papers

📈 Benchmark Comparison

Model Performance on Key Benchmarks

Vision Models

Model	Framework	Input	Loss	Performance	Dataset	Paper
C3D	3D Conv	Video	Softmax	Acc: 59.02%	AFEW	Fan et al., 2016
I3D	Inflated 3D	Video	Softmax	Acc: 68.90%	GreSti	Ghaleb et al., 2021
SlowFast	Dual CNN	Video	Softmax	WAR: 49.34%	FERV39K	Neshov et al., 2024
ViT-B/16+SAM	Transformer	Video	Cross-Entropy	Acc: 52.42%	FER-2013	Arnab et al., 2021
DTL-I-ResNet18	3D ResNet	Video	Softmax	Acc: 83.0%	FER2013	Helaly et al., 2023
ESTLNet	CNN-LSTM	Video	Cross-Entropy	Acc: 53.79%	AFEW	Wang et al., 2022
D2SP	Dual Purification	Video	Cross-Entropy	WAR: 50.5%	FERV39k	CVPR 2025

Audio Models

Model	Framework	Input	Loss	Performance	Dataset	Paper
HuBERT	CNN+Transformer	Raw audio	Contrastive	WA: 79.58%	IEMOCAP	Wang et al., 2021
Wav2Vec	1D CNN	Raw audio	Contrastive	WA: 77.00%	IEMOCAP	Wang et al., 2021
emotion2vec	Online Distillation	Raw audio	Utterance+Frame	WA: 85.0%	RAVDESS	Ma et al., 2024
SL-GEmo-CLAP	CNN+Transformer	WavLM-large	KL Loss	WAR: 81.43%	IEMOCAP	Pan et al., 2024
WavLM	CNN+Transformer	Raw audio	Discriminative	Macro-F1: 33.6%	IEMOCAP	Wu et al., 2024
Mockingjay	NPC	Raw audio	L1/MSE	Acc: 50.28%	IEMOCAP	Liu et al., 2024
DeCoAR	SVM	Mel FBANK	L1/MSE	UAR: 71.93%	IEMOCAP	Stanea et al., 2023
Vesper	CNN+Transformer	Raw audio	MSE	WA: 54.2%	IEMOCAP	Chen et al., 2024
Audio-Transformer	Transformer	Spectrogram	Cross-Entropy	Acc: 75.42%	EMO-DB	Bayraktar et al., 2023
DTNet	CNN+Transformer	Raw audio	Cross-Entropy	UA: 74.8%	IEMOCAP	Yuan et al., 2024

Text Models

Model	Framework	Input	Loss	Performance	Dataset	Paper
BERT	Transformer	Text token	MLM+NSP	Acc: 70.09%	ISEAR	Adoma et al., 2020
RoBERTa	Transformer	Text token	Cross-Entropy	Acc: 74.31%	ISEAR	Adoma et al., 2020
XLNet	Transformer	Permuted tokens	Permuted LM	Acc: 72.99%	ISEAR	Adoma et al., 2020
ALBERT	Transformer	Text token	Focal+KL	Acc: 73.86%	ISEAR	Adoma et al., 2020
DistilBERT	Transformer	Text token	MLM+Distillation	Acc: 66.93%	ISEAR	Adoma et al., 2020
DeBERTa-v3	Transformer	Text token	Cross-Entropy	F1: 66.2%	WRIME	Takenaka et al., 2025
ChatGPT-4o	Transformer	Text token	Prompt-based	F1: 52.7%	WRIME	Atitienei et al., 2024
GloVe	Co-occurrence matrix	Text tokens	Weighted LS	Acc: 95.09%	Twitter	Gupta et al., 2021
Word2Vec	CBOW	Text tokens	Hierarchical Softmax	Macro-F1: 73.21%	Tweets	Tang et al., 2014
ELMo	BiLSTM	Context. vectors	Cross-Entropy	Acc: 88.91%	Wikipedia	Yang et al., 2021
COMET	Transformer	Commonsense triple	Cross-Entropy	W-Avg F1: 65.21%	MELD	Zhang et al., 2021

Uni-modal Emotion Recognition

Facial Emotion Recognition

FER has evolved from hand-crafted descriptors (LBP, HOG, Gabor) → CNN-based end-to-end learning → spatio-temporal models → Transformer-based architectures → self-supervised pretraining.

Title	Venue	Date	Code
D2SP: Dual Denoising via Saliency Prompt for Video-based Emotion Recognition	CVPR	2025	GitHub
Facial Emotion Recognition using CNN	arXiv	2023	GitHub
DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos	ACM MM	2022	-
STCAM: Spatio-Temporal and Channel Attention Module for Dynamic Facial Expression Recognition	IEEE TAFFC	2020	-
DTL: Disentangled Transfer Learning for Visual Recognition	arXiv	2023	-
SlowFast Networks for Video Recognition	ICCV	2019	GitHub
ViViT: A Video Vision Transformer	ICCV	2021	GitHub
Big Self-Supervised Models are Strong Semi-Supervised Learners	NeurIPS	2020	GitHub

Speech Emotion Recognition

SER has transitioned from hand-crafted prosodic/spectral features → deep CNN/LSTM → Transformer-based → Self-Supervised Learning (SSL) as the dominant paradigm.

Title	Venue	Date	Code
emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation	arXiv	2023	GitHub
SL-GEmo-CLAP: Contrastive Language-Audio Pretraining for Speech Emotion	Interspeech	2024	-
HuBERT: Self-Supervised Speech Representation Learning	IEEE/ACM TASLP	2021	GitHub
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing	IEEE JSTSP	2022	GitHub
Wav2Vec: Unsupervised Pre-Training for Speech Recognition	Interspeech	2019	GitHub
Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition	IEEE TASLP	2024	GitHub
DTNet: Disentanglement Learning for Speech Emotion Recognition	ICASSP	2024	-
Audio Transformer for Speech Emotion Recognition	ACM MM Asia	2023	-
Mockingjay: Unsupervised Speech Representation Learning	ICASSP	2020	GitHub

Text Emotion Recognition

TER has evolved from lexicon-based methods → static embeddings → transformer pretraining (BERT family) → Large Language Models enabling zero-shot generalization.

Title	Venue	Date	Code
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	NAACL	2019	GitHub
RoBERTa: A Robustly Optimized BERT Pretraining Approach	arXiv	2019	GitHub
DeBERTa: Decoding-enhanced BERT with Disentangled Attention	ICLR	2021	GitHub
GloVe: Global Vectors for Word Representation	EMNLP	2014	GitHub
Word2Vec: Efficient Estimation of Word Representations in Vector Space	ICLR	2013	GitHub
ELMo: Deep Contextualized Word Representations	NAACL	2018	GitHub
COMET: Commonsense Transformers for Automatic Knowledge Graph Construction	ACL	2019	GitHub
XLNet: Generalized Autoregressive Pretraining for Language Understanding	NeurIPS	2019	GitHub
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations	ICLR	2020	GitHub
DistilBERT: A distilled version of BERT	NeurIPS Workshop	2019	GitHub
DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models	arXiv	2023	GitHub

Multi-modal Emotion Recognition

Fusion Strategy

Fusion strategy determines when and how modalities interact: Early Fusion (feature-level) → Late Fusion (decision-level) → Model-level Fusion → Hybrid Fusion.

Title	Venue	Date	Code	Fusion Type
M²FNet: Multi-scale Multi-modal Fusion Network for Emotion Recognition in Conversations	CVPRW	2022	GitHub	Early
TDFNet: Text-Directed Fusion Network for Multimodal Sentiment Analysis	arXiv	2023	-	Early
UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition	EMNLP	2022	GitHub	Hybrid
Cross-Modal Fusion Network with Dual-Task Interaction	IEEE TAFFC	2023	-	Hybrid
Memory Fusion Network for Multi-view Sequential Learning	AAAI	2018	GitHub	Late
MISA: Modality-Invariant and -Specific Representations	ACM MM	2020	GitHub	Model-level
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors	ACL	2018	GitHub	Model-level

Fusion Granularity

Four core granularity challenges: Modality Alignment · Modality Dominance · Modality Complementarity · Modality Robustness

Title	Venue	Date	Code	Granularity
MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences	ACL	2019	GitHub	Alignment
TFN: Tensor Fusion Network for Multimodal Sentiment Analysis	EMNLP	2017	GitHub	Alignment
DialogueMMT: Distribution-Aware Multi-modal Dialogue Emotion Recognition	Interspeech	2025	-	Alignment
MAG-BERT: Integrating Multimodal Information in Large Pretrained Transformers	ACL	2020	GitHub	Dominance
MMIN: Multimodal Multiple Instance Learning	AAAI	2021	GitHub	Robustness
IMDer: Incomplete Multimodal Learning for Emotion Recognition	AAAI	2024	-	Robustness
GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation	IEEE TPAMI	2023	GitHub	Robustness

Model Architectures

Multi-modal fusion architectures are broadly classified into 7 categories: Kernel-based · Graph-based · Neural Network-based · Attention-based · Transformer-based · Generative-based · LLM-based.

1. Kernel-based Architectures

MKL established the first principled framework for multi-modal fusion by associating each modality with its own similarity function, recognizing that heterogeneous affective signals cannot be compared through a single shared kernel.

Title	Venue	Date	Code
Multiple Kernel Learning for Emotion Recognition in the Wild	ACM MM	2013	-
Emotion Recognition in the Wild via CNN and Mapped Binary Patterns	ACM MM	2015	-
Ensemble of SVM Trees for Multimodal Emotion Recognition	ACII	2017	-

2. Graph-based Architectures

Graph-based models treat dialogue as a heterogeneous graph where nodes represent utterances and edges encode speaker, temporal, and cross-modal dependencies — capturing that emotional meaning emerges from relational structure, not just individual utterances.

Title	Venue	Date	Code
DialogueGCN: A Graph Convolutional Neural Network for ERC	EMNLP	2019	GitHub
COGMEN: COntextualized GNN based Multimodal Emotion recognitioN	NAACL	2022	GitHub
M3GAT: Multi-granularity Multi-scale Multi-modal Graph Attention Network	ACM TOMM	2023	-
M²FNet: Multi-scale Multi-modal Fusion Network for ERC	CVPRW	2022	GitHub
MMGCN: Multi-relational Graph Convolutional Network for Multimodal ERC	ACM MM	2021	GitHub
Hierarchical Heterogeneous Graph for Multimodal ERC	AAAI	2025	-
Decoupled Distillation Graph for Cross-modal ERC	ACM MM	2024	-
Persona-aware ERC with Graph Network and Turn Interaction	EMNLP	2024	-

3. Neural Network-based Architectures

Early neural MER established the template of separate uni-modal encoders (CNN/LSTM) followed by joint fusion — but the concatenate-then-classify bottleneck allows classifiers to ignore weaker modalities whenever the dominant one minimizes training loss.

Title	Venue	Date	Code
Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities	IEEE TAFFC	2011	-
LSTM-based Multimodal Affect Prediction	ACII	2013	-
End-to-End Multimodal Emotion Recognition using Deep Neural Networks	IEEE JSTSP	2017	-
Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling	KBS	2019	GitHub
Semi-supervised Multimodal Emotion Recognition	AAAI	2017	-
Ensemble CNN for Multimodal Emotion Recognition	IEEE TAFFC	2020	-
MIST: Multi-modal Integration with Semi-supervised and Transfer Learning	arXiv	2025	-
DISD-Net: Dynamic Interaction Self-Distillation for Cross-Subject ERC	arXiv	2025	-

4. Attention-based Architectures

Attention mechanisms span four complementary granularities: self-modal (intra-modal noise), cross-modal (inter-modal integration), spatial (discriminative region selection), temporal (salient moment capture) — each targeting distinct failure modes, not interchangeable solutions.

Title	Venue	Date	Code
MultiEMO: An Attention-Based Correlation-Aware Multimodal Fusion Framework	ACL	2023	GitHub
CTNet: Conversational Transformer Network for Emotion Recognition	IEEE/ACM TASLP	2021	-
Attentive Modality Hopping for Speech Emotion Recognition	ICASSP	2020	-
Multi-Modal Multi-Scale Temporal Self-Attention for Multimodal Sentiment Analysis	ACM MM	2021	-
Cross-Modal Residual Attention for Multimodal ERC	ACL	2021	-
Spatial Attention for Image-Text Emotion Recognition	AAAI	2020	-
Temporal Attention for Video-based Emotion Recognition	ICCV	2019	-
Knowledge-aware Graph-based Co-Attention for Multimodal ERC	EMNLP	2023	-
MSER: Multi-Scale Emotion Recognition with Orthogonal Learning	AAAI	2024	-
Phy-FusionNet: Temporal Attention with Memory-Augmented Periodic Modeling	arXiv	2025	-
Conv-Attention Adapter for LLM-based ERC	arXiv	2024	-
Joint Transformer-Attention for Multimodal ERC	ICASSP	2023	-
Bayesian Co-Attention for Uncertainty-Aware Multimodal ERC	ACL	2023	-

5. Transformer-based Architectures

Transformers make modality alignment and cross-modal interaction simultaneous via cross-attention at every layer rather than sequential — providing structural robustness to noise and modality inconsistency. Key open question: gains from architecture or pretrained representations?

Title	Venue	Date	Code
MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences	ACL	2019	GitHub
UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition	EMNLP	2022	GitHub
CTNet: Conversational Transformer Network for ERC	IEEE/ACM TASLP	2021	-
TDFNet: Text-Directed Fusion Network with Cross-Modal Transformer	arXiv	2023	-
Cross-Modality Transformer for Robust Multimodal ERC	arXiv	2021	-
Pre-trained Audio-Visual Transformers for Multimodal ERC	ICASSP	2022	-
Hierarchical Transformer Fusion for Multi-level ERC	ACM MM	2022	-
Modality-Aware Transformer with Dynamic Modality Prioritization	EMNLP	2022	-
DQ-based Multimodal Fusion with Dynamic Query Selection	ACL	2024	-
DialogueMMT: Distribution-Aware Multi-modal Dialogue ERC	Interspeech	2025	-
Capsule Graph Transformer for Multimodal ERC	arXiv	2025	-
TACFN: Transformer-based Adaptive Cross-Modal Fusion Network	arXiv	2025	-
Phy-FusionNet: Memory-Augmented Periodicity-Aware Fusion	arXiv	2025	-
RRMER-DT: Diffusion-Enhanced Transformer for Conversational MER	arXiv	2025	-
Self-Supervised Multimodal Transformer for Sentiment Analysis	arXiv	2020	-
Flexible-Input Transformer for Continuous and Multi-label ERC	EMNLP	2023	-
Modality-Collaborative Transformer with Feature Reconstruction	ACM MM	2024	-

6. Generative-based Architectures

Generative models address two problems discriminative models cannot solve by design: (1) data scarcity via GAN/diffusion augmentation, and (2) missing-modality robustness via conditional reconstruction. Critical insight: perceptual plausibility ≠ affective coherence — generative and discriminative objectives must be jointly optimized.

Title	Venue	Date	Code
IMDer: Incomplete Multimodal Learning via Diffusion for ERC	AAAI	2024	-
GAN-based Multimodal Emotion Data Augmentation	ICASSP	2019	-
MALN: Modality-Adversarial Learning Network for Multimodal ERC	ACM MM	2023	-
Deep Autoencoder-Based Fusion for Multimodal Sentiment Analysis	ACL	2021	-
Diffusion-Based Multi-modal Emotion Recovery	arXiv	2025	-
DiffuFuse: Diffusion-based Fusion for Incomplete Multimodal ERC	arXiv	2025	-
Progressive Cross-Modal Reconstruction under Missing Modality	arXiv	2025	-
RoHyDr: Robust Hybrid Discriminative-Generative MER	arXiv	2025	-
Cross-Modal Adversarial-Generative Framework for Robust ERC	ACM MM	2024	-
Novel Autoencoder Fusion with Affective Discriminative Constraints	ICASSP	2024	-

7. Large Language Model-Based Architectures

LLMs shift MER from task-specific fusion toward language-mediated reasoning over heterogeneous inputs — enabling zero-shot generalization, open-vocabulary recognition, and natural language explainability. Key challenges remain: prompt sensitivity, multimodal hallucination, and memory forgetting.

Title	Venue	Date	Code
AffectGPT: Explainable Multimodal Emotion Reasoning	arXiv	2025	-
AffectGPT-R1: RL-Enhanced Interpretable Affective Reasoning	arXiv	2025	-
OV-MER: Open-Vocabulary Multimodal Emotion Recognition	arXiv	2025	-
EmoLLM: Multimodal Emotional Understanding with LLMs	arXiv	2024	GitHub
Emotion-LLaMA: Multimodal ERC and Reasoning with Instruction Tuning	NeurIPS	2024	GitHub
DialogueLLM: Context and Emotion Knowledge-Tuned LLM for ERC	arXiv	2023	GitHub
DialogueMLLM: Instruction-Tuned MLLM for Conversational ERC	arXiv	2025	-
R1-Omni: Explainable Omni-Multimodal ERC with Reinforcement Learning	arXiv	2025	GitHub
OMNISAPIENS-7B: Unified Multimodal Human Behaviour Understanding	arXiv	2025	-
OMNISAPIENS-7B 2.0: RL-Balanced Multimodal Behaviour Understanding	arXiv	2026	-
GPT-4V for Zero-Shot Multimodal Emotion Recognition	arXiv	2024	-
AUGESC: LLM-based Data Augmentation for Emotional Support Conversations	ACL Findings	2023	GitHub
Generalized LLMs with Emergent Cross-Domain Emotion Reasoning	arXiv	2024	-
EmoBench: Evaluating the Emotional Intelligence of LLMs	arXiv	2024	GitHub
Beyond Text: LLMs Integrating Vocal and Visual Signals for Emotion	arXiv	2025	-
REVISE: Prompt Sensitivity in LLMs for Emotion Recognition	arXiv	2025	-
EmotionHallucer: Detecting Hallucination in LLM Emotion Prediction	arXiv	2025	-

Summary of Multi-modal Architectures

Architecture	Representative Models	Core Strength	Key Limitation
Kernel-based	MKL, SVM-Ensemble	Modality-specific similarity kernels	Poor scalability, cannot be end-to-end
Graph-based	DialogueGCN, COGMEN, M3GAT	Relational & conversational structure	Graph construction sensitivity
Neural Network	LSTM-based, CNN-based, MIST	Temporal & sequential emotion modeling	Concatenate-then-classify bottleneck
Attention-based	MultiEMO, Phy-FusionNet, MSER	Selective salient cue capture	Noise amplification from unreliable cues
Transformer-based	MulT, UniMSE, CTNet, TDFNet	Simultaneous cross-modal alignment	Pretrain bias; compute cost
Generative-based	IMDer, DiffuFuse, MALN	Missing-modality robustness; data augmentation	Plausibility ≠ affective coherence
LLM-based	AffectGPT, EmoLLM, R1-Omni	Zero-shot + explainability + open-vocabulary	Prompt sensitivity; hallucination

If you find this repository or our survey useful for your research, please cite:

@article{luo2026comprehensive,
  title     = {A Comprehensive Review in Unimodal and Multimodal Emotion Recognition},
  author    = {Luo, Jiachen and Yang, Qu and He, Jiajun and Hua, Yining and 
               Zheng, Lian and Li, Yuanchao and Song, Siyang and Mathur, Leena and 
               Wen, Wu and Wang, Dingdong and Shen, Shuai and Wu, Jingyao and 
               Hu, Guimin and Hu, He and Li, Yong and Zhang, Zixing and 
               Wang, Jiadong and Zhou, Sifan and Tang, Zuojin and Cao, Canran and 
               Xu, Sheng and Zhao, Zhenjun and Toda, Tomoki and Xue, Xiangyang and 
               Zhao, Siyang and Sun, Licai and Zhang, Liyun and Cai, Cong and 
               Du, Jiamin and Ma, Ziyang and Chen, Mingjie and Qian, Chengxuan and 
               Phan, Huy and Wang, Lin and Schuller, Bjoern and Reiss, Joshua},
  journal   = {ACM Transactions on Intelligent Systems and Technology},
  year      = {2026},
  note      = {Resources: \url{https://github.com/jackchen69/Awesome-Emotion-Models}}
}

🤝 Contributing

We welcome contributions! If you have papers, datasets, or models to add:

Fork this repository
Add your entry following the existing table format
Submit a Pull Request with a brief description

Please ensure the added work is peer-reviewed or on arXiv with verifiable results.

📬 Contact

Jiachen Luo — jiachen.luo@qmul.ac.uk — Queen Mary University of London / TU Munich
Lin Wang — lin.wang@qmul.ac.uk — Queen Mary University of London
Bjoern Schuller — schuller@tum.de — Imperial College London / TU Munich
Joshua Reiss — joshua.reiss@qmul.ac.uk — Queen Mary University of London

💬 WeChat Group: Scan the QR code here to join our Emo discussion group (Emo微信交流群，欢迎加入)

⭐ Star this repository if you find it helpful! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome Emotion Models

📖 Our Survey

🔥 Our Emo Works

EmoBench-M

Other Emo Works

📑 Table of Contents

📚 Classic Books

🧠 Affective Computing & Human-Computer Interaction

😊 Emotion Psychology & Theory

🗣️ Speech & Audio Emotion

📝 Sentiment Analysis & NLP

👁️ Computer Vision & Facial Expression

🤖 Deep Learning & Machine Learning

🌐 Multimodal Learning & Fusion

🧬 Neuroscience & Cognitive Science of Emotion

⚖️ Ethics, Fairness & Society

📋 Quick Reference: Books by Research Focus

Survey Comparison (2020–2026)

📊 Awesome Datasets

Uni-modal Datasets

Facial Expression Datasets

Speech Emotion Datasets

Text Emotion Datasets

Multi-modal Datasets

🏆 Awesome Papers

📈 Benchmark Comparison

Model Performance on Key Benchmarks

Vision Models

Audio Models

Text Models

Uni-modal Emotion Recognition

Facial Emotion Recognition

Speech Emotion Recognition

Text Emotion Recognition

Multi-modal Emotion Recognition

Fusion Strategy

Fusion Granularity

Model Architectures

1. Kernel-based Architectures

2. Graph-based Architectures

3. Neural Network-based Architectures

4. Attention-based Architectures

5. Transformer-based Architectures

6. Generative-based Architectures

7. Large Language Model-Based Architectures

Summary of Multi-modal Architectures

🤝 Contributing

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages