Modality-Adaptive Decoding (MAD)

Modality-Adaptive Decoding (MAD) is a training-free decoding method for MLLMs that adaptively weights modality-specific branches via self-assessed modality relevance, effectively reducing cross-modal hallucinations.

News

✅ Accepted at CVPR 2026

What is the Problem?

Cross-modal hallucination differs from conventional hallucination in that it arises in multimodal models where multiple modalities are provided as input. Instead of generating unsupported content purely from language priors, the model produces incorrect predictions due to interference between modalities. (See related work: AVHBench, CMM)

Video 1 (Visual-Driven Audio Hallucination)

episode1.mp4

Model Prediction: The main source of sound is quacking of the ducks

In Video 1 (Visual-Driven Audio Hallucination), the scene shows a metal duck (a shooting target) being hit by a gun. However, because the model visually detects a duck, it incorrectly concludes that the primary sound source is the quacking of ducks, even though no such sound exists.

Video 2 (Audio-Driven Video Hallucination)

episode2.mp4

Model Prediction: He's holding the gun and then he pulls the trigger

In Video 2 (Audio-Driven Video Hallucination), gunshot sounds are heard in the environment. The man in the video is merely preparing to shoot, but the model assumes that he actually pulls the trigger, influenced by the audio cue.

These examples illustrate how information from one modality (visual or audio) can bias the interpretation of another, leading to cross-modal hallucinations.

How Does It Work?

MAD (Modality-Adaptive Decoding) works in two simple steps:

Step 1. Modality Weight Extraction

The model first identifies which modality (audio, video, or both) is relevant to the question and assigns adaptive weights.

Step 2. Adaptive Decoding

It then computes logits under different modality settings (audio-only, video-only, full input) and fuses them using the extracted weights.

This allows the model to focus on the question-relevant modality and suppress cross-modal interference, reducing hallucinations.

Evaluation for AVHBench and CMM

This directory contains evaluation scripts for testing the following models:

Qwen2.5-Omni (3B / 7B) — qwen-omni/
VideoLLaMA2-AV — VideoLLaMA2/

Setup

Qwen2.5-Omni

pip install transformers accelerate tqdm

VideoLLaMA2-AV

Follow the VideoLLaMA2 repository setup guide (audio-visual branch)

Usage

Qwen2.5-Omni

Baseline Evaluation (AVHBench)

# Audio-video evaluation
accelerate launch qwen-omni/eval_batch.py \
    --model-path Qwen/Qwen2.5-Omni-7B \
    --modal-type av

# Audio-only / Video-only evaluation
accelerate launch qwen-omni/eval_batch.py --model-path Qwen/Qwen2.5-Omni-7B --modal-type a
accelerate launch qwen-omni/eval_batch.py --model-path Qwen/Qwen2.5-Omni-7B --modal-type v

# With memory optimization
accelerate launch qwen-omni/eval_batch.py \
    --model-path Qwen/Qwen2.5-Omni-7B \
    --modal-type av --load-4bit --disable-talker

MAD Evaluation (AVHBench)

accelerate launch qwen-omni/eval_batch_mad.py \
    --model-path Qwen/Qwen2.5-Omni-7B \
    --modal-type av \
    --use-contrast-decode \
    --gamma 2.5

Baseline Evaluation (CMM)

accelerate launch qwen-omni/eval_batch_cmm.py \
    --model-path Qwen/Qwen2.5-Omni-7B \
    --modal-type av \
    --category over-reliance_unimodal_priors

MAD Evaluation (CMM)

accelerate launch qwen-omni/eval_batch_cmm_mad.py \
    --model-path Qwen/Qwen2.5-Omni-7B \
    --modal-type av \
    --category over-reliance_unimodal_priors \
    --gamma 2.5

VideoLLaMA2-AV

Baseline Evaluation (AVHBench)

accelerate launch VideoLLaMA2/eval_batch.py --modal-type av

MAD Evaluation (AVHBench)

accelerate launch VideoLLaMA2/eval_batch_mad.py \
    --modal-type av \
    --gamma 2.5

Baseline Evaluation (CMM)

accelerate launch VideoLLaMA2/eval_batch_cmm.py \
    --modal-type av \
    --category over-reliance_unimodal_priors

MAD Evaluation (CMM)

accelerate launch VideoLLaMA2/eval_batch_cmm_mad.py \
    --modal-type av \
    --category over-reliance_unimodal_priors \
    --gamma 2.5

Multi-GPU

accelerate launch --num_processes 4 qwen-omni/eval_batch_mad.py \
    --model-path Qwen/Qwen2.5-Omni-7B \
    --modal-type av \
    --use-contrast-decode \
    --gamma 2.5

Output Format

Results are saved as JSON files with the following structure:

[
  {
    "video_id": "01162",
    "task": "AV Matching",
    "question": "Are the contexts of audio and visual content matching?",
    "ground_truth": "No",
    "prediction": "No, the audio and visual content do not match...",
    "inference_time": 2.45,
    "device": "cuda:0"
  }
]

Scoring

# AVHBench scoring
python qwen-omni/score.py --f <result_json>

# CMM scoring
python qwen-omni/score_cmm.py --f <result_json>

Performance Tips

Memory Usage: Use --disable-talker if you only need text outputs (Qwen2.5-Omni only)
Speed: Increase --batch-size if memory allows (default: 1)
Multi-GPU: Use accelerate launch --num_processes N for N GPUs

Troubleshooting

Memory Issues

Use --load-4bit for reduced memory footprint
Reduce --batch-size to 1
Enable --disable-talker (Qwen2.5-Omni only)

Performance Issues

Check CUDA compatibility
Ensure proper GPU utilization with nvidia-smi
Use multiple GPUs with accelerate

Citation

@misc{chung2026madmodalityadaptivedecodingmitigating,
      title={MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models}, 
      author={Sangyun Chung and Se Yeon Kim and Youngchae Chee and Yong Man Ro},
      year={2026},
      eprint={2601.21181},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.21181}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
VideoLLaMA2		VideoLLaMA2
assets		assets
qwen-omni		qwen-omni
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modality-Adaptive Decoding (MAD)

News

What is the Problem?

Video 1 (Visual-Driven Audio Hallucination)

Video 2 (Audio-Driven Video Hallucination)

How Does It Work?

Evaluation for AVHBench and CMM

Setup

Qwen2.5-Omni

VideoLLaMA2-AV

Usage

Qwen2.5-Omni

VideoLLaMA2-AV

Multi-GPU

Output Format

Scoring

Performance Tips

Troubleshooting

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Modality-Adaptive Decoding (MAD)

News

What is the Problem?

Video 1 (Visual-Driven Audio Hallucination)

Video 2 (Audio-Driven Video Hallucination)

How Does It Work?

Evaluation for AVHBench and CMM

Setup

Qwen2.5-Omni

VideoLLaMA2-AV

Usage

Qwen2.5-Omni

VideoLLaMA2-AV

Multi-GPU

Output Format

Scoring

Performance Tips

Troubleshooting

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages