Modality-Adaptive Decoding (MAD) is a training-free decoding method for MLLMs that adaptively weights modality-specific branches via self-assessed modality relevance, effectively reducing cross-modal hallucinations.
- ✅ Accepted at CVPR 2026
Cross-modal hallucination differs from conventional hallucination in that it arises in multimodal models where multiple modalities are provided as input. Instead of generating unsupported content purely from language priors, the model produces incorrect predictions due to interference between modalities. (See related work: AVHBench, CMM)
episode1.mp4
Model Prediction: The main source of sound is quacking of the ducks
In Video 1 (Visual-Driven Audio Hallucination), the scene shows a metal duck (a shooting target) being hit by a gun. However, because the model visually detects a duck, it incorrectly concludes that the primary sound source is the quacking of ducks, even though no such sound exists.
episode2.mp4
Model Prediction: He's holding the gun and then he pulls the trigger
In Video 2 (Audio-Driven Video Hallucination), gunshot sounds are heard in the environment. The man in the video is merely preparing to shoot, but the model assumes that he actually pulls the trigger, influenced by the audio cue.
These examples illustrate how information from one modality (visual or audio) can bias the interpretation of another, leading to cross-modal hallucinations.
MAD (Modality-Adaptive Decoding) works in two simple steps:
Step 1. Modality Weight Extraction
The model first identifies which modality (audio, video, or both) is relevant to the question and assigns adaptive weights.
Step 2. Adaptive Decoding
It then computes logits under different modality settings (audio-only, video-only, full input) and fuses them using the extracted weights.
This allows the model to focus on the question-relevant modality and suppress cross-modal interference, reducing hallucinations.
This directory contains evaluation scripts for testing the following models:
- Qwen2.5-Omni (3B / 7B) —
qwen-omni/ - VideoLLaMA2-AV —
VideoLLaMA2/
pip install transformers accelerate tqdmFollow the VideoLLaMA2 repository setup guide (audio-visual branch)
Baseline Evaluation (AVHBench)
# Audio-video evaluation
accelerate launch qwen-omni/eval_batch.py \
--model-path Qwen/Qwen2.5-Omni-7B \
--modal-type av
# Audio-only / Video-only evaluation
accelerate launch qwen-omni/eval_batch.py --model-path Qwen/Qwen2.5-Omni-7B --modal-type a
accelerate launch qwen-omni/eval_batch.py --model-path Qwen/Qwen2.5-Omni-7B --modal-type v
# With memory optimization
accelerate launch qwen-omni/eval_batch.py \
--model-path Qwen/Qwen2.5-Omni-7B \
--modal-type av --load-4bit --disable-talkerMAD Evaluation (AVHBench)
accelerate launch qwen-omni/eval_batch_mad.py \
--model-path Qwen/Qwen2.5-Omni-7B \
--modal-type av \
--use-contrast-decode \
--gamma 2.5Baseline Evaluation (CMM)
accelerate launch qwen-omni/eval_batch_cmm.py \
--model-path Qwen/Qwen2.5-Omni-7B \
--modal-type av \
--category over-reliance_unimodal_priorsMAD Evaluation (CMM)
accelerate launch qwen-omni/eval_batch_cmm_mad.py \
--model-path Qwen/Qwen2.5-Omni-7B \
--modal-type av \
--category over-reliance_unimodal_priors \
--gamma 2.5Baseline Evaluation (AVHBench)
accelerate launch VideoLLaMA2/eval_batch.py --modal-type avMAD Evaluation (AVHBench)
accelerate launch VideoLLaMA2/eval_batch_mad.py \
--modal-type av \
--gamma 2.5Baseline Evaluation (CMM)
accelerate launch VideoLLaMA2/eval_batch_cmm.py \
--modal-type av \
--category over-reliance_unimodal_priorsMAD Evaluation (CMM)
accelerate launch VideoLLaMA2/eval_batch_cmm_mad.py \
--modal-type av \
--category over-reliance_unimodal_priors \
--gamma 2.5accelerate launch --num_processes 4 qwen-omni/eval_batch_mad.py \
--model-path Qwen/Qwen2.5-Omni-7B \
--modal-type av \
--use-contrast-decode \
--gamma 2.5Results are saved as JSON files with the following structure:
[
{
"video_id": "01162",
"task": "AV Matching",
"question": "Are the contexts of audio and visual content matching?",
"ground_truth": "No",
"prediction": "No, the audio and visual content do not match...",
"inference_time": 2.45,
"device": "cuda:0"
}
]# AVHBench scoring
python qwen-omni/score.py --f <result_json>
# CMM scoring
python qwen-omni/score_cmm.py --f <result_json>- Memory Usage: Use
--disable-talkerif you only need text outputs (Qwen2.5-Omni only) - Speed: Increase
--batch-sizeif memory allows (default: 1) - Multi-GPU: Use
accelerate launch --num_processes Nfor N GPUs
Memory Issues
- Use
--load-4bitfor reduced memory footprint - Reduce
--batch-sizeto 1 - Enable
--disable-talker(Qwen2.5-Omni only)
Performance Issues
- Check CUDA compatibility
- Ensure proper GPU utilization with
nvidia-smi - Use multiple GPUs with accelerate
@misc{chung2026madmodalityadaptivedecodingmitigating,
title={MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models},
author={Sangyun Chung and Se Yeon Kim and Youngchae Chee and Yong Man Ro},
year={2026},
eprint={2601.21181},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.21181},
}