Jie Li, Shifei Ding, Lili Guo, and Xuan Li, "Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation". (IJCAI 2025, Pytorch Code)
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.
The following pretrained models are used for feature extraction from the three modalities:
- Text Modality: RoBERTa-large
- Audio Modality: data2vec-audio-base-960h
- Video Modality: Videomae-base and and Timesformer-base-finetuned-k400
Python environment dependencies:
python==3.9.19
torch==1.13.1+cu116
torchvision==0.14.1+cu116
torchaudio==0.13.1+cu116
transformers==4.27.2
We use two widely adopted ERC datasets:
In the first stage, modality-specific features are extracted via knowledge distillation. You may download and use the pre-extracted features provided below.
Project
| - datasets
| - IEMOCAP
| - IEMOCAP_train.csv
| - IEMOCAP_dev.csv
| - IEMOCAP_test.csv
| - Session1
...
| - MELD
| - train_meld_emo.csv
| - dev_meld_emo.csv
| - test_meld_emo.csv
| - dev_splits_complete
| - train_splits
| - output_repeated_splits_test
...
| - pretrained_model
| - roberta-large
| - data2vec-audio-base-960h
| - timesformer-base-finetuned-k400
| - videomae-base
| - IEMOCAP
| - feature
| - video
| - train
| - dev
| - test
| - IEMOCAP
| - save_model
| - text.bin
| - audio.bin
| - video.bin
| - text_KD_audio.bin
| - video_KD_text.bin
| - model.py
| - utils.py
| - preprocessing.py
| - dataset.py
| - text.py
| - audio.py
| - video.py
| - video_feature_extract.py
| - KD.py
| - MELD
To train from scratch on the IEMOCAP dataset:
# 1. Extract text features
python text.py
# 2. Extract audio features
python audio.py
# 3. Extract raw video features
python video_feature_extract.py
# 4. Process video features
python video.py
# 5. Perform knowledge distillation (audio student, text teacher)
python KD.py --student audio --teacher text
# 6. Perform knowledge distillation (video student, text teacher)
python KD.py --student video --teacher text
# 7. Extract fused features from all modalities (first stage)
python extract_first_stage_features.py
# 8. Perform multimodal fusion training and testing
python multimodal_fusion.py
For the MELD dataset: The directory structure and training process are the same as for IEMOCAP.
To run testing only: Please first download and place the distilled first-stage features into the corresponding dataset folders:
Then run:
python multimodal_fusion.py --train True
If you find our work useful, please consider citing the following the paper:
@inproceedings{Li-2025-MAGTKD,
title = {Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation},
author = {Li, Jie and Ding, Shifei and Guo, Lili and Li, Xuan},
booktitle = {Proceedings of the Thirty-Fourth International Joint Conference on
Artificial Intelligence, {IJCAI-25}},
pages = {8141--8149},
year = {2025},
doi = {10.24963/ijcai.2025/905},
}
It is recommended to cite these papers simultaneously.
@inproceedings{song-etal-2022-supervised,
title = "Supervised Prototypical Contrastive Learning for Emotion Recognition in Conversation",
author = "Song, Xiaohui and
Huang, Longtao and
Xue, Hui and
Hu, Songlin",
booktitle = "EMNLP",
year = "2022",
pages = "5197--5206",
}
@inproceedings{yun-etal-2024-telme,
title = "{T}el{ME}: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation",
author = "Yun, Taeyang and
Lim, Hyunkuk and
Lee, Jeonghwan and
Song, Min",
booktitle = "NAACL",
year = "2024",
pages = "82--95",
}
@ARTICLE{10109845,
author={Ma, Hui and Wang, Jian and Lin, Hongfei and Zhang, Bo and Zhang, Yijia and Xu, Bo},
journal={IEEE Transactions on Multimedia},
title={A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations},
year={2024},
volume={26},
number={},
pages={776-788},
}
Our method builds upon the implementations of the following projects:
We thank the authors of these works for sharing their codebases.