This is the official repository for Audio-Visual Contrastive Decoding (AVCD), a simple, training-free method for mitigating hallucinations in AV-LLMs during decoding without relying on external tools.
- ✅ AVCD code released !
- ✅ Accepted at NeurIPS 2025
- Reformulates conventional CD (Contrastive Decoding) from single-instance (e.g., video–text) to three-modality interactions
- Dynamically detects the dominant modality and masks less dominant modalities before applying CD
- Introduces entropy-guided adaptive gating to skip unnecessary forward passes and improve inference speed
Follow the VideoLLaMA2 repository setup guide (audio-visual branch):
We use the AVHBench and MUSIC-AVQA datasets for AVCD. The repositories are available below:
| Dataset | Link |
|---|---|
| AVHBench | GitHub |
| MUSIC-AVQA | GitHub |
The data and scripts for running inference and evaluation are organized as follows:
| Purpose | Path |
|---|---|
| Data QA files | json/ |
| Inference scripts | videollama2/inference/ |
| Evaluation scripts | videollama2/eval/ |
git clone https://github.com/kaistmm/AVCD.git
cd AVCDFor research analysis code in this repository, correctness and interpretability are preferred over silent degradation.
- Do not add silent fallback logic by default.
- If a required intermediate quantity is missing, raise an explicit error instead of returning a degraded result.
- Only add downgrade or fallback paths when they are explicitly requested for debugging or batch robustness.
- For probing and visualization scripts, keep the experimental definition strict and avoid "best effort" behavior that can make incorrect results look valid.
This stage saves the generated answers from inference. An example command for running inference with the original model is shown below:
python videollama2/inference/inference_AVH_val.pyTo enable AVCD, add the --use-AVCD argument:
python videollama2/inference/inference_AVH_val.py --use-AVCD TrueThe inference step generates a JSON file that includes the question, the answer, and the prediction.
During evaluation, these JSON files can be used to directly measure accuracy or compute scores using GPT-based evaluation.
Accuracy (AVH)
- AVH: Audio-driven Video Hallucination, Video-driven Audio Hallucination, AV Matching
python videollama2/eval/eval_acc.py --pred-path <path_to_preds>.jsonCaptioning Score (AVH_cap)
- AVH_cap: AV Captioning
python videollama2/eval/eval_caption.py --pred-path <path_to_preds>.json --output-dir <dir>Open-ended QA (MUSIC-AVQA)
python videollama2/eval/eval_gpt.py --pred-path <path_to_preds>.json --output-dir <dir>This repository also includes a modular tri-modal decoding pipeline for video-audio-text research. Before formal decoding, the default probe estimates sample-level audio/video/text demand; at each decoding step the controller selects modality-specific anchor layers from last-token attention summaries, projects anchor-layer hidden states to logits, and softly calibrates final logits:
z_cal = z_final + alpha * sum_m(w_m * g_m * z_anchor_m)
The default demand estimator is a lightweight label-token probe: on top of the original question, it first asks which modality is needed to answer the question, then normalizes the configured audio / vision / text label-token logits into {audio, vision, text} demand weights for the whole sample. Attention-mass and uniform demand modes are available for ablation.
The calibration module supports three modes:
support_only: z_cal = z_final + alpha * z_sup
unsupported_penalty: z_cal = z_final - beta * relu(z_final - z_sup)
support_contrast: z_cal = z_final + alpha * z_sup - beta * relu(z_final - z_sup)
The first supported production path is Qwen2 / VideoLLaMA2.1-7B-AV. The implementation is disabled by default, so existing baseline and AVCD inference calls keep their current behavior unless use_tri_modal_decoding=True is passed.
- 新增
CPI-ACD训练无关三模态 corridor decoding,入口仍复用videollama2.tri_modal_decoding。 - 默认配置在
configs/cpi_acd_default.yaml。 - 详细说明和命令见
docs/cpi_acd.md。
Fake-model smoke demo without downloading model weights:
python -m videollama2.tri_modal_decoding.cli infer \
--model-path dummy \
--prompt "What happens in the clip?" \
--fake-model \
--trace-path outputs/tri_modal_trace.jsonlSingle real-model inference:
python -m videollama2.tri_modal_decoding.cli infer \
--model-path /path/to/VideoLLaMA2.1-7B-AV \
--video-path /path/to/video.mp4 \
--prompt "Answer yes or no." \
--config configs/tri_modal/default.yaml \
--trace-path outputs/tri_modal_trace.jsonlAnalyze traces:
python -m videollama2.tri_modal_decoding.cli analyze \
--trace-path outputs/tri_modal_trace.jsonlAvailable configs:
configs/tri_modal/default.yamlconfigs/tri_modal/ablate_uniform_demand.yamlconfigs/tri_modal/ablate_final_only.yamlconfigs/tri_modal/ablate_max_attention.yamlconfigs/tri_modal/ablate_no_gate.yamlconfigs/tri_modal/ablate_unsupported_penalty.yamlconfigs/tri_modal/ablate_support_contrast.yaml
The main fields are:
use_modality_probe: true
modality_probe_type: label_token_probe
use_audio_anchor: true
use_vision_anchor: true
use_text_anchor: true
anchor_score_type: max_attention
use_dynamic_gate: true
use_contrastive_calibration: false
alpha: 0.5
beta: 0.0
calibration:
entropy_gate_threshold: 0.6
top_p_candidate_only: false
export_trace: false
export_dir: outputs/tri_modalTrace files are JSONL, one generated token per line. Each record contains token id/text, decoding mode, modality spans, per-layer modality attention summaries, demand weights, selected anchor layers, gate values, top final logits, top calibrated logits, and config id.
python scripts/run_tri_modal_infer.py --model-path dummy --prompt "Demo" --fake-model
python scripts/run_tri_modal_eval.py --model-path dummy --input-file data.jsonl --output-file outputs/preds.jsonl --fake-model
python scripts/analyze_tri_modal_trace.py --trace-path outputs/tri_modal_trace.jsonl
python scripts/run_tri_modal_ablation.py --model-path dummy --prompt "Demo" --fake-modelpython scripts/demo_tri_modal_decode.py infer \
--model-path dummy \
--prompt "Demo" \
--fake-model \
--trace-path outputs/demo_trace.jsonl@inproceedings{jung2025avcd,
author = {Jung, Chaeyoung and Jang, Youngjoon and Chung, Joon Son},
title = {AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding},
booktitle = {NeurIPS},
year = {2025}
}
