This repository provides the official implementation for the paper
"Multimodal Emotion Recognition in the Wild: Corruption Modeling and Relevance-Guided Scoring."
We present a corruption modeling framework and relevance-guided scoring mechanism for robust multimodal emotion recognition (MER) under real-world degraded conditions.
├── MER_gen.py
├── Visual_perturb.py
├── MER_audio_gen.py
├── Audio_perturb.py
├── feature_extractor/
│ ├── Audio_opensmile.py
│ └── face_detection_CLIP.pyThis project uses CMU-MOSI and CMU-MOSEI datasets with external audio/visual resources for corruption modeling. Please follow the instructions below to prepare the required files.
- Download from CMU Multimodal SDK
- Required:
- Raw
.mp4video files - Corresponding
.wavaudio files
- Raw
- We use the MUSAN dataset for audio corruption.
- Recommended categories:
speech,music,noisesubfolders
- Use
--noise_dir ./noise/sample_wavin the audio corruption module.
-
We utilize various occlusion patches from publicly available datasets:
Source Usage Download Link 11k Hands Hand occlusion patches 11k Hands COCO Object Object-shaped masks and regions COCO (2017 images) DTD Texture occlusion (random noise-like) DTD
Directory structure example:
occlusion_patch/
├── image/
│ ├── hand_img/
│ ├── object_img/
│ └── dtd/
├── mask/
│ ├── hand_masks/
│ ├── object_masks/- Facial landmarks for occlusion positioning can be extracted using Mediapipe:
python MER_gen.py \ --MOSI_main_dir ./sample_data/video \ --MOSI_landmark_dir ./sample_data/mediapipe_landmark \ --MOSI_save_loc ./tmp \ --occlusion_img ./occlusion_patch/image \ --occlusion_mask ./occlusion_patch/mask \ --mode occlusion \ --noise_percent 0.0
We simulate real-world corruptions in both the audio and visual modalities to evaluate model robustness.
-
MER_audio_gen.pyInjects noise into .wav files using additive SNR-based corruption. Noise types include speech, music, environmental, and white noise.
-
Audio_perturb.pyNoise injection utility with optional segment masking and configurable SNR levels.
-
MER_gen.pyCorrupts video frames via occlusion (hand/object/texture) or additive noise (blur, Gaussian). Uses Mediapipe to extract facial landmarks for occlusion positioning.
-
Visual_perturb.pyContains core functions for visual occlusion, noise injection, and corruption scheduling.
We extract audio and visual embeddings for model input using the following scripts:
-
Audio_opensmile.pyUses OpenSMILE to extract low-level descriptors (eGeMAPS, ComParE-2016). Outputs a fixed-length feature matrix using resampling or cropping.
python feature_extractor/Audio_opensmile.py \ --raw_dir ./results/audio_perturb/snr_10 \ --pkl_dir ./features/audio
-
Visual_face_detection_CLIP.pyDetects faces using Mediapipe, extracts regions, and encodes them using CLIP (ViT-B/32). Embeddings are resampled to a fixed number of frames (500 by default).
python feature_extractor/Visual_face_detection_CLIP.py \ --input_folder ./results/visual_perturb \ --output_folder ./features/visual
python MER_audio_gen.py \
--main_dir ./sample_data/wav \
--noise_dir ./noise/sample_wav \
--save_loc ./results/audio_perturb/snrpython MER_gen.py \
--MOSI_main_dir ./sample_data/video \
--MOSI_landmark_dir ./sample_data/mediapipe_landmark \
--MOSI_save_loc ./results/visual_perturb \
--occlusion_img ./occlusion_patch/image \
--occlusion_mask ./occlusion_patch/mask \
--mode occlusion \
--noise_percent 0.3Install dependencies:
pip install -r requirements.txtNote: For audio feature extraction, please install OpenSMILE separately:
If you use this code or find it helpful for your research, please cite our paper:
@article{lee2025cars,
title = {Multimodal Emotion Recognition in the Wild: Corruption Modeling and Relevance-Guided Scoring},
author = {Lee, Yoonsun and Cho, Sunyoung},
journal = {Under Review},
year = {2025}
}
@article{hong2023watch,
title={Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring},
author={Hong, Joanna and Kim, Minsu and Choi, Jeongsoo and Ro, Yong Man},
journal={arXiv preprint arXiv:2303.08536},
year={2023}
}This repository extends and adapts AV-RelScore for emotion recognition.