🔥🔥🔥 A Comprehensive Review in Unimodal and Multimodal Emotion Recognition
[📄 Paper] | [🌟 Project Page (This Page)] | [📝 Citation] | [💬 WeChat Group (Emo微信交流群,欢迎加入)]
This survey provides a unified synthesis of deep learning-based uni-modal and multi-modal emotion recognition within a coherent analytical framework that spans the full learning pipeline — from emotion modeling and dataset curation to modality-specific representation learning, fusion strategy design, and evaluation.
Key Contributions:
- 🔬 Deep Analytical Framework: A structured taxonomy covering data preprocessing, input representations, uni-modal learning, multi-modal fusion, and evaluation strategies.
- 📚 Systematic Synthesis: Comprehensive comparison of uni-modal (Face, Speech, Text) and multi-modal emotion recognition methods.
- 🗺️ Future Roadmap: Concrete research directions grounded in identified gaps across modeling, data, and evaluation.
Resources: https://github.com/jackchen69/Awesome-Emotion-Models
🔥🔥🔥 EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
📽 Demo | 📖 Paper | 🌟 GitHub | 🤖 Basic Demo | 💬 WeChat
A representative evaluation benchmark for multimodal emotion recognition. All codes have been released! ✨
| 🔥 Work | Links |
|---|---|
| MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition | [Paper] [GitHub] |
| emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation | [Paper] [GitHub] |
| Uncertain Multimodal Intention and Emotion Understanding in the Wild | [Paper] [GitHub] |
| MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark | [Paper] [GitHub] |
| Belief Mismatch Coefficient (BMC) ⭐ ACII 2023 Best Paper | [Paper] |
| 1st Place Solution to Odyssey Emotion Recognition Challenge Task1 🥇 | [Paper] |
| Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective | [Paper] [GitHub] |
| HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition | [Paper] [GitHub] |
| Spectral Representation of Behaviour Primitives for Depression Analysis ⭐ IEEE TAFFC Best Paper Runner-Up | [Paper] [GitHub] |
| Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning | [Paper] |
| A Scoping Review of Large Language Models for Generative Tasks in Mental Health Care (NPJ Digital Medicine) | [Paper] |
Essential reading for researchers in affective computing, emotion recognition, and related fields. Organized by theme.
| Cover | Title | Author(s) | Year | Publisher | Link |
|---|---|---|---|---|---|
| 📖 | Affective Computing | Rosalind W. Picard | 1997 | MIT Press | Amazon · MIT Press |
| 📖 | The Oxford Handbook of Affective Computing | Calvo, D'Mello, Gratch, Kappas (Eds.) | 2015 | Oxford Univ. Press | Amazon · OUP |
| 📖 | Applied Affective Computing | Schuller, Batliner et al. | 2022 | ACM Books | ACM DL |
| 📖 | The Empathic Brain | Christian Keysers | 2011 | Social Brain Press | Amazon |
| 📖 | Wired for Culture: Origins of the Human Social Mind | Mark Pagel | 2012 | Norton | Amazon |
💡 Picard (1997) is the founding text of affective computing — essential first read. The Oxford Handbook is the most comprehensive reference with 41 chapters on detection, generation, methodology, and applications.
| Cover | Title | Author(s) | Year | Publisher | Link |
|---|---|---|---|---|---|
| 📖 | Emotions Revealed: Recognizing Faces and Feelings | Paul Ekman | 2003 | Times Books | Amazon |
| 📖 | The Expression of the Emotions in Man and Animals | Charles Darwin | 1872 | John Murray | Free PDF · Amazon |
| 📖 | Emotion: Theory, Research, and Experience (Vol. 1) | Robert Plutchik & Henry Kellerman (Eds.) | 1980 | Academic Press | Amazon |
| 📖 | Handbook of Affective Sciences | Davidson, Scherer, Goldsmith (Eds.) | 2003 | Oxford Univ. Press | Amazon · OUP |
| 📖 | The Emotional Brain | Joseph LeDoux | 1996 | Simon & Schuster | Amazon |
| 📖 | Descartes' Error: Emotion, Reason, and the Human Brain | António Damásio | 1994 | Putnam | Amazon |
| 📖 | The Feeling of What Happens: Body, Emotion and the Making of Consciousness | António Damásio | 1999 | Harcourt | Amazon |
| 📖 | How Emotions Are Made: The Secret Life of the Brain | Lisa Feldman Barrett | 2017 | Houghton Mifflin | Amazon |
| 📖 | Emotions in Social Psychology: Essential Readings | W. Gerrod Parrott (Ed.) | 2001 | Psychology Press | Amazon |
| 📖 | The Nature of Emotion: Fundamental Questions | Ekman & Davidson (Eds.) | 1994 | Oxford Univ. Press | Amazon |
💡 Ekman (2003) is the definitive guide to reading facial expressions. Damásio (1994) revolutionized understanding of the emotion-cognition relationship and remains highly influential in affective computing.
| Cover | Title | Author(s) | Year | Publisher | Link |
|---|---|---|---|---|---|
| 📖 | Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing | Björn Schuller & Anton Batliner | 2013 | Wiley | Amazon · Wiley |
| 📖 | Speech and Language Processing (3rd ed.) | Jurafsky & Martin | 2023 | Prentice Hall | Free Draft |
| 📖 | Spoken Language Processing: A Guide to Theory, Algorithm, and System Development | Huang, Acero & Hon | 2001 | Prentice Hall | Amazon |
| 📖 | Fundamentals of Speech Recognition | Rabiner & Juang | 1993 | Prentice Hall | Amazon |
💡 Schuller & Batliner (2013) is the go-to textbook for speech-based emotion and paralinguistics — directly relevant to SER research. Jurafsky & Martin is the standard NLP reference, freely available online.
| Cover | Title | Author(s) | Year | Publisher | Link |
|---|---|---|---|---|---|
| 📖 | Sentiment Analysis: Mining Opinions, Sentiments, and Emotions (2nd ed.) | Bing Liu | 2020 | Cambridge Univ. Press | Amazon · Cambridge |
| 📖 | Sentiment Analysis and Opinion Mining | Bing Liu | 2012 | Morgan & Claypool | Free PDF · Amazon |
| 📖 | Natural Language Processing with Python | Bird, Klein & Loper | 2009 | O'Reilly | Free Online · Amazon |
| 📖 | Neural Network Methods for Natural Language Processing | Yoav Goldberg | 2017 | Morgan & Claypool | Amazon |
| 📖 | Speech and Language Processing (3rd ed.) | Jurafsky & Martin | 2023 | Prentice Hall | Free Draft |
💡 Bing Liu (2020) is the definitive NLP text on sentiment analysis, now including deep learning and multimodal emotion analysis. The 2012 version is freely available as a PDF.
| Cover | Title | Author(s) | Year | Publisher | Link |
|---|---|---|---|---|---|
| 📖 | Facial Action Coding System (FACS): A Technique for the Measurement of Facial Movement | Ekman & Friesen | 1978 | Consulting Psychologists Press | Reference |
| 📖 | Deep Learning | Goodfellow, Bengio & Courville | 2016 | MIT Press | Free Online · Amazon |
| 📖 | Computer Vision: Algorithms and Applications (2nd ed.) | Richard Szeliski | 2022 | Springer | Free Online · Amazon |
| 📖 | Programming Computer Vision with Python | Jan Erik Solem | 2012 | O'Reilly | Free Online |
💡 Ekman & Friesen's FACS (1978) is the foundational system for coding facial expressions used in virtually all FER datasets. Goodfellow et al. is the essential deep learning reference.
| Cover | Title | Author(s) | Year | Publisher | Link |
|---|---|---|---|---|---|
| 📖 | Deep Learning | Goodfellow, Bengio & Courville | 2016 | MIT Press | Free Online · Amazon |
| 📖 | Dive into Deep Learning | Zhang, Lipton, Li & Smola | 2023 | Cambridge Univ. Press | Free Online · Amazon |
| 📖 | Pattern Recognition and Machine Learning | Christopher Bishop | 2006 | Springer | Free PDF · Amazon |
| 📖 | Transformers for Natural Language Processing | Denis Rothman | 2022 | Packt | Amazon |
| 📖 | Attention Is All You Need (paper, but landmark) | Vaswani et al. | 2017 | NeurIPS | arXiv |
| Cover | Title | Author(s) | Year | Publisher | Link |
|---|---|---|---|---|---|
| 📖 | Multimodal Machine Learning: A Survey and Taxonomy | Baltrušaitis, Ahuja & Morency | 2019 | IEEE TPAMI | arXiv · IEEE |
| 📖 | Foundations and Trends in Multimodal Machine Learning | Liang, Zadeh & Morency | 2022 | Now Publishers | arXiv |
| 📖 | Multimodal Deep Learning | Ngiam et al. | 2011 | ICML |
| Cover | Title | Author(s) | Year | Publisher | Link |
|---|---|---|---|---|---|
| 📖 | The Emotional Brain | Joseph LeDoux | 1996 | Simon & Schuster | Amazon |
| 📖 | Descartes' Error | António Damásio | 1994 | Putnam | Amazon |
| 📖 | How Emotions Are Made | Lisa Feldman Barrett | 2017 | Houghton Mifflin | Amazon |
| 📖 | The Handbook of Emotion (4th ed.) | Lewis, Haviland-Jones & Barrett (Eds.) | 2016 | Guilford Press | Amazon |
| 📖 | Cognitive Neuroscience of Emotion | Lane & Nadel (Eds.) | 2000 | Oxford Univ. Press | Amazon |
| Cover | Title | Author(s) | Year | Publisher | Link |
|---|---|---|---|---|---|
| 📖 | The Oxford Handbook of Ethics of AI | Dubber, Pasquale & Das (Eds.) | 2020 | Oxford Univ. Press | Amazon |
| 📖 | Weapons of Math Destruction | Cathy O'Neil | 2016 | Crown | Amazon |
| 📖 | Atlas of AI | Kate Crawford | 2021 | Yale Univ. Press | Amazon |
| 📖 | The Oxford Handbook of Affective Computing (Ethics Section) | Calvo et al. (Eds.) | 2015 | Oxford Univ. Press | Amazon |
| If you work on... | Read this first |
|---|---|
| Affective Computing (foundations) | Picard, Affective Computing (1997) |
| Emotion Theory & Psychology | Ekman, Emotions Revealed (2003) |
| Speech Emotion Recognition | Schuller & Batliner, Computational Paralinguistics (2013) |
| Text / Sentiment Analysis | Bing Liu, Sentiment Analysis (2020) |
| Facial Expression Recognition | Ekman & Friesen, FACS (1978) |
| Deep Learning Methods | Goodfellow et al., Deep Learning (2016) |
| Multimodal Fusion | Baltrušaitis et al., Multimodal ML Survey (2019) |
| Neuroscience of Emotion | Damásio, Descartes' Error (1994) |
| AI Ethics & Fairness | O'Neil, Weapons of Math Destruction (2016) |
| Comprehensive Reference | Calvo et al., Oxford Handbook of Affective Computing (2015) |
A = Audio, T = Text, V = Visual, P = Physiological
| Publication | Year | Modality | Uni-modal | Multi-modal | Evaluation | Pipeline | Dataset |
|---|---|---|---|---|---|---|---|
| Speech Commun | 2020 | A | ✅ | ❌ | ✅ | ❌ | ✅ |
| IEEE TAFFC | 2020 | A | ✅ | ❌ | ❌ | ❌ | ❌ |
| Information Fusion | 2020 | A,T,V | ❌ | ✅ | ❌ | ❌ | ✅ |
| Electronics | 2021 | A,T,V | ✅ | ✅ | ❌ | ✅ | ✅ |
| IEEE Signal Process. Mag. | 2021 | A,T,V | ❌ | ✅ | ❌ | ❌ | ✅ |
| Information Science | 2022 | A,T,V | ✅ | ❌ | ❌ | ❌ | ✅ |
| Neurocomputing | 2022 | A,T,V | ❌ | ✅ | ❌ | ❌ | ✅ |
| Information Fusion | 2022 | A,T,V | ✅ | ✅ | ❌ | ❌ | ✅ |
| IEEE TIM | 2023 | V | ✅ | ❌ | ❌ | ❌ | ✅ |
| Proc. IEEE | 2023 | V | ✅ | ❌ | ✅ | ❌ | ✅ |
| IEEE TAFFC | 2023 | T | ✅ | ❌ | ❌ | ❌ | ✅ |
| Speech Commun | 2023 | A | ✅ | ❌ | ❌ | ❌ | ✅ |
| IEEE Access | 2023 | A | ✅ | ❌ | ❌ | ❌ | ✅ |
| Information Fusion | 2023 | A,T,V | ✅ | ✅ | ❌ | ❌ | ✅ |
| Entropy | 2023 | A,T,V | ✅ | ✅ | ✅ | ✅ | ✅ |
| Neurocomputing | 2023 | A,T,V,P | ✅ | ✅ | ❌ | ❌ | ✅ |
| Information Fusion | 2024 | V | ✅ | ❌ | ❌ | ❌ | ✅ |
| Information Fusion | 2024 | A,T,V,P | ❌ | ✅ | ❌ | ❌ | ✅ |
| IEEE Access | 2024 | A,T,V | ❌ | ✅ | ❌ | ❌ | ✅ |
| Expert Syst. Appl. | 2024 | A,T,V | ✅ | ✅ | ❌ | ❌ | ✅ |
| Expert Systems | 2025 | A,T,V | ✅ | ✅ | ❌ | ❌ | ✅ |
| ACM TOMM | 2025 | A,T,V,P | ❌ | ✅ | ❌ | ❌ | ✅ |
| IEEE Access | 2025 | A,T,V | ✅ | ✅ | ❌ | ❌ | ✅ |
| Information Fusion | 2026 | S | ✅ | ❌ | ✅ | ✅ | ✅ |
| Ours | 2026 | A,T,V,P | ✅ | ✅ | ✅ | ✅ | ✅ |
| Dataset | Modality | Emotion Labels | Samples | Paper/Link |
|---|---|---|---|---|
| CK+ | V | Anger, Disgust, Fear, Happy, Sad, Surprise, Neutral, Contempt | 593 videos | Paper |
| AffectNet | V | Neutral, Happy, Sad, Surprise, Fear, Disgust, Anger, Contempt | 1,000,000 images | Paper |
| FER+ | V | Anger, Disgust, Fear, Happy, Sad, Surprise, Neutral, Contempt | 35,887 images | Paper |
| RAF-DB | V | Basic & compound emotions | 29,672 images | Paper |
| EmoReact | V | Curiosity, Uncertainty, Excitement, Happy, Surprise, Disgust, Fear, Frustration | 1,102 videos | Paper |
| Aff-Wild2 | V | Valence, Arousal | 558 videos | Paper |
| FERV39K | V | 7 basic emotions | 38,935 video clips | Paper |
| Dataset | Modality | Emotion Labels | Samples | Paper/Link |
|---|---|---|---|---|
| TESS | A | Anger, Disgust, Fear, Happy, Pleasant Surprise, Sadness, Neutral | 2,800 utterances | Paper |
| EmoDB 2.0 | A | Anger, Boredom, Disgust, Fear, Happy, Neutral, Sadness | 817 utterances | Paper |
| RAVDESS | A, V | Calm, Happy, Sad, Angry, Fearful, Surprise, Disgust | 7,356 videos | Paper |
| IEMOCAP | A, V, T | Happy, Angry, Sad, Frustrated, Neutral; Valence, Arousal, Dominance | 12.46h video | Paper |
| MSP-Podcast | A | Anger, Contempt, Disgust, Fear, Happy, Neutral, Sadness, Surprise | 264,705 turns | Paper |
| CREMA-D | A, V | Anger, Disgust, Fear, Happy, Neutral, Sad | 7,442 clips | GitHub |
| EMO-DB | A | Anger, Boredom, Disgust, Fear, Happy, Neutral, Sad | 535 utterances | - |
| Dataset | Modality | Emotion Labels | Samples | Paper/Link |
|---|---|---|---|---|
| ISEAR | T | Joy, Fear, Anger, Sadness, Disgust, Shame, Guilt | 7,666 sentences | Paper |
| EmoBank | T | Joy, Anger, Sad, Fear, Disgust, Surprise | 10,548 sentences | Paper |
| SemEval-2018 Task 1 | T | 11 emotions + Neutral | 22,000 sentences | Paper |
| GoEmotions | T | 27 emotion categories | 58,000 Reddit comments | Paper |
| Empathetic Dialogues | T | 32 emotion categories | 24,850 conversations | Paper |
| WRIME | T | 8 emotions (reader/writer) | 17,000 social media posts | Paper |
| Dataset | Modality | Type | Emotion Labels | Samples | Paper/Link |
|---|---|---|---|---|---|
| eNTERFACE'05 | A, V | Acted | Anger, Disgust, Fear, Happy, Sad, Surprise | 1,166 videos | Paper |
| SAVEE | A, V | Acted | Anger, Disgust, Fear, Happy, Sad, Surprise, Neutral | 480 videos | Paper |
| AFEW | A, V | Natural | Anger, Disgust, Fear, Happy, Neutral, Sad, Surprise | 1,426 videos | Paper |
| CHEAVD | A, V | Natural | Anger, Happy, Sad, Worried, Anxious, Surprise, Disgust, Neutral | 140min video | Paper |
| SEWA | A, V | Natural | Valence, Arousal | 2,000min video | Paper |
| AMIGOS | A, V | Natural | Valence, Arousal, Dominance | 40 videos | Paper |
| CMU-MOSI | A, V, T | Induced | Continuous Sentiment Score | 3,702 clips | Paper |
| CMU-MOSEI | A, V, T | Induced | Happy, Sad, Angry, Disgust, Surprise, Fear | 23,500 clips | Paper |
| MELD | A, V, T | Induced | Anger, Disgust, Fear, Joy, Neutral, Sad, Surprise | 13,708 utterances | GitHub |
| IEMOCAP | A, V, T | Induced | Happy, Angry, Sad, Frustrated, Neutral | 12.46h video | Paper |
| CH-SIMS | A, V, T | Induced | 5-class sentiment | 2,281 clips | Paper |
| RAMAS | A, V | Induced | Anger, Sad, Disgust, Happy, Fear, Surprise | 7h video | Paper |
| MER2023 | A, V, T | Natural | 6 discrete + continuous | 5,030 clips | Paper |
| MER2024 | A, V, T | Natural | Multi-label + OV | Extended | Paper |
| MER2025 | A, V, T | Natural | Open-vocabulary | Extended | Paper |
| Model | Framework | Input | Loss | Performance | Dataset | Paper |
|---|---|---|---|---|---|---|
| C3D | 3D Conv | Video | Softmax | Acc: 59.02% | AFEW | Fan et al., 2016 |
| I3D | Inflated 3D | Video | Softmax | Acc: 68.90% | GreSti | Ghaleb et al., 2021 |
| SlowFast | Dual CNN | Video | Softmax | WAR: 49.34% | FERV39K | Neshov et al., 2024 |
| ViT-B/16+SAM | Transformer | Video | Cross-Entropy | Acc: 52.42% | FER-2013 | Arnab et al., 2021 |
| DTL-I-ResNet18 | 3D ResNet | Video | Softmax | Acc: 83.0% | FER2013 | Helaly et al., 2023 |
| ESTLNet | CNN-LSTM | Video | Cross-Entropy | Acc: 53.79% | AFEW | Wang et al., 2022 |
| D2SP | Dual Purification | Video | Cross-Entropy | WAR: 50.5% | FERV39k | CVPR 2025 |
| Model | Framework | Input | Loss | Performance | Dataset | Paper |
|---|---|---|---|---|---|---|
| HuBERT | CNN+Transformer | Raw audio | Contrastive | WA: 79.58% | IEMOCAP | Wang et al., 2021 |
| Wav2Vec | 1D CNN | Raw audio | Contrastive | WA: 77.00% | IEMOCAP | Wang et al., 2021 |
| emotion2vec | Online Distillation | Raw audio | Utterance+Frame | WA: 85.0% | RAVDESS | Ma et al., 2024 |
| SL-GEmo-CLAP | CNN+Transformer | WavLM-large | KL Loss | WAR: 81.43% | IEMOCAP | Pan et al., 2024 |
| WavLM | CNN+Transformer | Raw audio | Discriminative | Macro-F1: 33.6% | IEMOCAP | Wu et al., 2024 |
| Mockingjay | NPC | Raw audio | L1/MSE | Acc: 50.28% | IEMOCAP | Liu et al., 2024 |
| DeCoAR | SVM | Mel FBANK | L1/MSE | UAR: 71.93% | IEMOCAP | Stanea et al., 2023 |
| Vesper | CNN+Transformer | Raw audio | MSE | WA: 54.2% | IEMOCAP | Chen et al., 2024 |
| Audio-Transformer | Transformer | Spectrogram | Cross-Entropy | Acc: 75.42% | EMO-DB | Bayraktar et al., 2023 |
| DTNet | CNN+Transformer | Raw audio | Cross-Entropy | UA: 74.8% | IEMOCAP | Yuan et al., 2024 |
| Model | Framework | Input | Loss | Performance | Dataset | Paper |
|---|---|---|---|---|---|---|
| BERT | Transformer | Text token | MLM+NSP | Acc: 70.09% | ISEAR | Adoma et al., 2020 |
| RoBERTa | Transformer | Text token | Cross-Entropy | Acc: 74.31% | ISEAR | Adoma et al., 2020 |
| XLNet | Transformer | Permuted tokens | Permuted LM | Acc: 72.99% | ISEAR | Adoma et al., 2020 |
| ALBERT | Transformer | Text token | Focal+KL | Acc: 73.86% | ISEAR | Adoma et al., 2020 |
| DistilBERT | Transformer | Text token | MLM+Distillation | Acc: 66.93% | ISEAR | Adoma et al., 2020 |
| DeBERTa-v3 | Transformer | Text token | Cross-Entropy | F1: 66.2% | WRIME | Takenaka et al., 2025 |
| ChatGPT-4o | Transformer | Text token | Prompt-based | F1: 52.7% | WRIME | Atitienei et al., 2024 |
| GloVe | Co-occurrence matrix | Text tokens | Weighted LS | Acc: 95.09% | Gupta et al., 2021 | |
| Word2Vec | CBOW | Text tokens | Hierarchical Softmax | Macro-F1: 73.21% | Tweets | Tang et al., 2014 |
| ELMo | BiLSTM | Context. vectors | Cross-Entropy | Acc: 88.91% | Wikipedia | Yang et al., 2021 |
| COMET | Transformer | Commonsense triple | Cross-Entropy | W-Avg F1: 65.21% | MELD | Zhang et al., 2021 |
FER has evolved from hand-crafted descriptors (LBP, HOG, Gabor) → CNN-based end-to-end learning → spatio-temporal models → Transformer-based architectures → self-supervised pretraining.
SER has transitioned from hand-crafted prosodic/spectral features → deep CNN/LSTM → Transformer-based → Self-Supervised Learning (SSL) as the dominant paradigm.
| Title | Venue | Date | Code |
|---|---|---|---|
| emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation | arXiv | 2023 | GitHub |
| SL-GEmo-CLAP: Contrastive Language-Audio Pretraining for Speech Emotion | Interspeech | 2024 | - |
| HuBERT: Self-Supervised Speech Representation Learning | IEEE/ACM TASLP | 2021 | GitHub |
| WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | IEEE JSTSP | 2022 | GitHub |
| Wav2Vec: Unsupervised Pre-Training for Speech Recognition | Interspeech | 2019 | GitHub |
| Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition | IEEE TASLP | 2024 | GitHub |
| DTNet: Disentanglement Learning for Speech Emotion Recognition | ICASSP | 2024 | - |
| Audio Transformer for Speech Emotion Recognition | ACM MM Asia | 2023 | - |
| Mockingjay: Unsupervised Speech Representation Learning | ICASSP | 2020 | GitHub |
TER has evolved from lexicon-based methods → static embeddings → transformer pretraining (BERT family) → Large Language Models enabling zero-shot generalization.
Fusion strategy determines when and how modalities interact: Early Fusion (feature-level) → Late Fusion (decision-level) → Model-level Fusion → Hybrid Fusion.
| Title | Venue | Date | Code | Fusion Type |
|---|---|---|---|---|
| M²FNet: Multi-scale Multi-modal Fusion Network for Emotion Recognition in Conversations | CVPRW | 2022 | GitHub | Early |
| TDFNet: Text-Directed Fusion Network for Multimodal Sentiment Analysis | arXiv | 2023 | - | Early |
| UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition | EMNLP | 2022 | GitHub | Hybrid |
| Cross-Modal Fusion Network with Dual-Task Interaction | IEEE TAFFC | 2023 | - | Hybrid |
| Memory Fusion Network for Multi-view Sequential Learning | AAAI | 2018 | GitHub | Late |
| MISA: Modality-Invariant and -Specific Representations | ACM MM | 2020 | GitHub | Model-level |
| Efficient Low-rank Multimodal Fusion with Modality-Specific Factors | ACL | 2018 | GitHub | Model-level |
Four core granularity challenges: Modality Alignment · Modality Dominance · Modality Complementarity · Modality Robustness
| Title | Venue | Date | Code | Granularity |
|---|---|---|---|---|
| MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences | ACL | 2019 | GitHub | Alignment |
| TFN: Tensor Fusion Network for Multimodal Sentiment Analysis | EMNLP | 2017 | GitHub | Alignment |
| DialogueMMT: Distribution-Aware Multi-modal Dialogue Emotion Recognition | Interspeech | 2025 | - | Alignment |
| MAG-BERT: Integrating Multimodal Information in Large Pretrained Transformers | ACL | 2020 | GitHub | Dominance |
| MMIN: Multimodal Multiple Instance Learning | AAAI | 2021 | GitHub | Robustness |
| IMDer: Incomplete Multimodal Learning for Emotion Recognition | AAAI | 2024 | - | Robustness |
| GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation | IEEE TPAMI | 2023 | GitHub | Robustness |
Multi-modal fusion architectures are broadly classified into 7 categories: Kernel-based · Graph-based · Neural Network-based · Attention-based · Transformer-based · Generative-based · LLM-based.
MKL established the first principled framework for multi-modal fusion by associating each modality with its own similarity function, recognizing that heterogeneous affective signals cannot be compared through a single shared kernel.
| Title | Venue | Date | Code |
|---|---|---|---|
| Multiple Kernel Learning for Emotion Recognition in the Wild | ACM MM | 2013 | - |
| Emotion Recognition in the Wild via CNN and Mapped Binary Patterns | ACM MM | 2015 | - |
| Ensemble of SVM Trees for Multimodal Emotion Recognition | ACII | 2017 | - |
Graph-based models treat dialogue as a heterogeneous graph where nodes represent utterances and edges encode speaker, temporal, and cross-modal dependencies — capturing that emotional meaning emerges from relational structure, not just individual utterances.
| Title | Venue | Date | Code |
|---|---|---|---|
| DialogueGCN: A Graph Convolutional Neural Network for ERC | EMNLP | 2019 | GitHub |
| COGMEN: COntextualized GNN based Multimodal Emotion recognitioN | NAACL | 2022 | GitHub |
| M3GAT: Multi-granularity Multi-scale Multi-modal Graph Attention Network | ACM TOMM | 2023 | - |
| M²FNet: Multi-scale Multi-modal Fusion Network for ERC | CVPRW | 2022 | GitHub |
| MMGCN: Multi-relational Graph Convolutional Network for Multimodal ERC | ACM MM | 2021 | GitHub |
| Hierarchical Heterogeneous Graph for Multimodal ERC | AAAI | 2025 | - |
| Decoupled Distillation Graph for Cross-modal ERC | ACM MM | 2024 | - |
| Persona-aware ERC with Graph Network and Turn Interaction | EMNLP | 2024 | - |
Early neural MER established the template of separate uni-modal encoders (CNN/LSTM) followed by joint fusion — but the concatenate-then-classify bottleneck allows classifiers to ignore weaker modalities whenever the dominant one minimizes training loss.
| Title | Venue | Date | Code |
|---|---|---|---|
| Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities | IEEE TAFFC | 2011 | - |
| LSTM-based Multimodal Affect Prediction | ACII | 2013 | - |
| End-to-End Multimodal Emotion Recognition using Deep Neural Networks | IEEE JSTSP | 2017 | - |
| Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling | KBS | 2019 | GitHub |
| Semi-supervised Multimodal Emotion Recognition | AAAI | 2017 | - |
| Ensemble CNN for Multimodal Emotion Recognition | IEEE TAFFC | 2020 | - |
| MIST: Multi-modal Integration with Semi-supervised and Transfer Learning | arXiv | 2025 | - |
| DISD-Net: Dynamic Interaction Self-Distillation for Cross-Subject ERC | arXiv | 2025 | - |
Attention mechanisms span four complementary granularities: self-modal (intra-modal noise), cross-modal (inter-modal integration), spatial (discriminative region selection), temporal (salient moment capture) — each targeting distinct failure modes, not interchangeable solutions.
| Title | Venue | Date | Code |
|---|---|---|---|
| MultiEMO: An Attention-Based Correlation-Aware Multimodal Fusion Framework | ACL | 2023 | GitHub |
| CTNet: Conversational Transformer Network for Emotion Recognition | IEEE/ACM TASLP | 2021 | - |
| Attentive Modality Hopping for Speech Emotion Recognition | ICASSP | 2020 | - |
| Multi-Modal Multi-Scale Temporal Self-Attention for Multimodal Sentiment Analysis | ACM MM | 2021 | - |
| Cross-Modal Residual Attention for Multimodal ERC | ACL | 2021 | - |
| Spatial Attention for Image-Text Emotion Recognition | AAAI | 2020 | - |
| Temporal Attention for Video-based Emotion Recognition | ICCV | 2019 | - |
| Knowledge-aware Graph-based Co-Attention for Multimodal ERC | EMNLP | 2023 | - |
| MSER: Multi-Scale Emotion Recognition with Orthogonal Learning | AAAI | 2024 | - |
| Phy-FusionNet: Temporal Attention with Memory-Augmented Periodic Modeling | arXiv | 2025 | - |
| Conv-Attention Adapter for LLM-based ERC | arXiv | 2024 | - |
| Joint Transformer-Attention for Multimodal ERC | ICASSP | 2023 | - |
| Bayesian Co-Attention for Uncertainty-Aware Multimodal ERC | ACL | 2023 | - |
Transformers make modality alignment and cross-modal interaction simultaneous via cross-attention at every layer rather than sequential — providing structural robustness to noise and modality inconsistency. Key open question: gains from architecture or pretrained representations?
Generative models address two problems discriminative models cannot solve by design: (1) data scarcity via GAN/diffusion augmentation, and (2) missing-modality robustness via conditional reconstruction. Critical insight: perceptual plausibility ≠ affective coherence — generative and discriminative objectives must be jointly optimized.
| Title | Venue | Date | Code |
|---|---|---|---|
| IMDer: Incomplete Multimodal Learning via Diffusion for ERC | AAAI | 2024 | - |
| GAN-based Multimodal Emotion Data Augmentation | ICASSP | 2019 | - |
| MALN: Modality-Adversarial Learning Network for Multimodal ERC | ACM MM | 2023 | - |
| Deep Autoencoder-Based Fusion for Multimodal Sentiment Analysis | ACL | 2021 | - |
| Diffusion-Based Multi-modal Emotion Recovery | arXiv | 2025 | - |
| DiffuFuse: Diffusion-based Fusion for Incomplete Multimodal ERC | arXiv | 2025 | - |
| Progressive Cross-Modal Reconstruction under Missing Modality | arXiv | 2025 | - |
| RoHyDr: Robust Hybrid Discriminative-Generative MER | arXiv | 2025 | - |
| Cross-Modal Adversarial-Generative Framework for Robust ERC | ACM MM | 2024 | - |
| Novel Autoencoder Fusion with Affective Discriminative Constraints | ICASSP | 2024 | - |
LLMs shift MER from task-specific fusion toward language-mediated reasoning over heterogeneous inputs — enabling zero-shot generalization, open-vocabulary recognition, and natural language explainability. Key challenges remain: prompt sensitivity, multimodal hallucination, and memory forgetting.
| Architecture | Representative Models | Core Strength | Key Limitation |
|---|---|---|---|
| Kernel-based | MKL, SVM-Ensemble | Modality-specific similarity kernels | Poor scalability, cannot be end-to-end |
| Graph-based | DialogueGCN, COGMEN, M3GAT | Relational & conversational structure | Graph construction sensitivity |
| Neural Network | LSTM-based, CNN-based, MIST | Temporal & sequential emotion modeling | Concatenate-then-classify bottleneck |
| Attention-based | MultiEMO, Phy-FusionNet, MSER | Selective salient cue capture | Noise amplification from unreliable cues |
| Transformer-based | MulT, UniMSE, CTNet, TDFNet | Simultaneous cross-modal alignment | Pretrain bias; compute cost |
| Generative-based | IMDer, DiffuFuse, MALN | Missing-modality robustness; data augmentation | Plausibility ≠ affective coherence |
| LLM-based | AffectGPT, EmoLLM, R1-Omni | Zero-shot + explainability + open-vocabulary | Prompt sensitivity; hallucination |
If you find this repository or our survey useful for your research, please cite:
@article{luo2026comprehensive,
title = {A Comprehensive Review in Unimodal and Multimodal Emotion Recognition},
author = {Luo, Jiachen and Yang, Qu and He, Jiajun and Hua, Yining and
Zheng, Lian and Li, Yuanchao and Song, Siyang and Mathur, Leena and
Wen, Wu and Wang, Dingdong and Shen, Shuai and Wu, Jingyao and
Hu, Guimin and Hu, He and Li, Yong and Zhang, Zixing and
Wang, Jiadong and Zhou, Sifan and Tang, Zuojin and Cao, Canran and
Xu, Sheng and Zhao, Zhenjun and Toda, Tomoki and Xue, Xiangyang and
Zhao, Siyang and Sun, Licai and Zhang, Liyun and Cai, Cong and
Du, Jiamin and Ma, Ziyang and Chen, Mingjie and Qian, Chengxuan and
Phan, Huy and Wang, Lin and Schuller, Bjoern and Reiss, Joshua},
journal = {ACM Transactions on Intelligent Systems and Technology},
year = {2026},
note = {Resources: \url{https://github.com/jackchen69/Awesome-Emotion-Models}}
}We welcome contributions! If you have papers, datasets, or models to add:
- Fork this repository
- Add your entry following the existing table format
- Submit a Pull Request with a brief description
Please ensure the added work is peer-reviewed or on arXiv with verifiable results.
- Jiachen Luo — jiachen.luo@qmul.ac.uk — Queen Mary University of London / TU Munich
- Lin Wang — lin.wang@qmul.ac.uk — Queen Mary University of London
- Bjoern Schuller — schuller@tum.de — Imperial College London / TU Munich
- Joshua Reiss — joshua.reiss@qmul.ac.uk — Queen Mary University of London
💬 WeChat Group: Scan the QR code here to join our Emo discussion group (Emo微信交流群,欢迎加入)
⭐ Star this repository if you find it helpful! ⭐

