CogniVoice: Multimodal and Multilingual Fusion Networks for Mild Cognitive Impairment Assessment from Spontaneous Speech
INTERSPEECH 2024, Paper
- Jiali Cheng (jiali_cheng@uml.edu)
- Mohamed Elgaar (mohamed_elgaar@uml.com)
- Nidhi Vakil (nidhipiyush_vakil@uml.edu)
- Hadi Amiri (hadi_amiri@uml.edu)
Given an input speech sample, we extract features using transformers for speech and its corresponding text as well as acoustic features obtained from DisVoice. A standard training approach concatenates all features and optimizes the cross-entropy loss. To model potential shortcut signals within each feature sets, we propose to train with PoE, applied to the multi-feature model and several uni-feature models, which predict the labels using only one set of features separately. Our approach obtain ensemble logits using the multi-feature and uni-feature models, see element-wsie product.
In addition, we show how PoE can reduce the loss for samples correctly predicted using both multimodal and unimodal inputs, and increase the loss for samples that cannot be accurately predicted using one of the modalities, which allows for identifying and mitigating weaknesses in the model's predictive capabilities. Therefore, the resulting ensemble logits can account for the spurious correlations in the dataset, while also being regularized to mitigate overfitting.
We use the TAUKADIAL 2024 challenge dataset. Please download the dataset from Link
CogniVoice has been tested using Python >=3.6 and transformers>=4.25.
Due to the limited data size, we use k-fold cross-validation and report the average validation performance.
To run MCI detection (classification)
bash run_cls.shTo run MMSE prediction (regression)
bash run_reg.shSome important arguments are
--poe: whether to use Product of Experts (PoE)
--poe_alpha: alpha to control debiasing strength in PoE
All Huggingface TrainingArguments can be passed in.
If you find CogniVoice useful for your research, please consider citing this paper:
@inproceedings{cheng24c_interspeech,
title = {CogniVoice: Multimodal and Multilingual Fusion Networks for Mild Cognitive Impairment Assessment from Spontaneous Speech},
author = {Jiali Cheng and Mohamed Elgaar and Nidhi Vakil and Hadi Amiri},
year = {2024},
booktitle = {Interspeech 2024},
pages = {4308--4312},
doi = {10.21437/Interspeech.2024-2370},
}
Please send any questions you might have about the code and/or the algorithm to jiali_cheng@uml.edu.
CogniVoice codebase is released under the MIT license.
