Skip to content

Latest commit

 

History

History
216 lines (160 loc) · 7.43 KB

File metadata and controls

216 lines (160 loc) · 7.43 KB

AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

This is the official repository for Audio-Visual Contrastive Decoding (AVCD), a simple, training-free method for mitigating hallucinations in AV-LLMs during decoding without relying on external tools.


🚀 Updates

  • ✅ AVCD code released !
  • ✅ Accepted at NeurIPS 2025

Overview of AVCD


📖 Overview

  • Reformulates conventional CD (Contrastive Decoding) from single-instance (e.g., video–text) to three-modality interactions
  • Dynamically detects the dominant modality and masks less dominant modalities before applying CD
  • Introduces entropy-guided adaptive gating to skip unnecessary forward passes and improve inference speed

⚙️ Setup

1. Environment

Follow the VideoLLaMA2 repository setup guide (audio-visual branch):

2. Datasets

We use the AVHBench and MUSIC-AVQA datasets for AVCD. The repositories are available below:

Dataset Link
AVHBench GitHub
MUSIC-AVQA GitHub

3. Repository layout

The data and scripts for running inference and evaluation are organized as follows:

Purpose Path
Data QA files json/
Inference scripts videollama2/inference/
Evaluation scripts videollama2/eval/

4. Usage

git clone https://github.com/kaistmm/AVCD.git
cd AVCD

4.1 Analysis Code Policy

For research analysis code in this repository, correctness and interpretability are preferred over silent degradation.

  • Do not add silent fallback logic by default.
  • If a required intermediate quantity is missing, raise an explicit error instead of returning a degraded result.
  • Only add downgrade or fallback paths when they are explicitly requested for debugging or batch robustness.
  • For probing and visualization scripts, keep the experimental definition strict and avoid "best effort" behavior that can make incorrect results look valid.

5. Inference

This stage saves the generated answers from inference. An example command for running inference with the original model is shown below:

python videollama2/inference/inference_AVH_val.py

To enable AVCD, add the --use-AVCD argument:

python videollama2/inference/inference_AVH_val.py --use-AVCD True

6. Evaluation

The inference step generates a JSON file that includes the question, the answer, and the prediction.
During evaluation, these JSON files can be used to directly measure accuracy or compute scores using GPT-based evaluation.

Accuracy (AVH)

  • AVH: Audio-driven Video Hallucination, Video-driven Audio Hallucination, AV Matching
python videollama2/eval/eval_acc.py --pred-path <path_to_preds>.json

Captioning Score (AVH_cap)

  • AVH_cap: AV Captioning
python videollama2/eval/eval_caption.py --pred-path <path_to_preds>.json --output-dir <dir>

Open-ended QA (MUSIC-AVQA)

python videollama2/eval/eval_gpt.py --pred-path <path_to_preds>.json --output-dir <dir>

Tri-Modal Demand-Aware Anchoring Decoding

This repository also includes a modular tri-modal decoding pipeline for video-audio-text research. Before formal decoding, the default probe estimates sample-level audio/video/text demand; at each decoding step the controller selects modality-specific anchor layers from last-token attention summaries, projects anchor-layer hidden states to logits, and softly calibrates final logits:

z_cal = z_final + alpha * sum_m(w_m * g_m * z_anchor_m)

The default demand estimator is a lightweight label-token probe: on top of the original question, it first asks which modality is needed to answer the question, then normalizes the configured audio / vision / text label-token logits into {audio, vision, text} demand weights for the whole sample. Attention-mass and uniform demand modes are available for ablation.

The calibration module supports three modes:

support_only:        z_cal = z_final + alpha * z_sup
unsupported_penalty: z_cal = z_final - beta * relu(z_final - z_sup)
support_contrast:    z_cal = z_final + alpha * z_sup - beta * relu(z_final - z_sup)

The first supported production path is Qwen2 / VideoLLaMA2.1-7B-AV. The implementation is disabled by default, so existing baseline and AVCD inference calls keep their current behavior unless use_tri_modal_decoding=True is passed.

CPI-ACD

  • 新增 CPI-ACD 训练无关三模态 corridor decoding,入口仍复用 videollama2.tri_modal_decoding
  • 默认配置在 configs/cpi_acd_default.yaml
  • 详细说明和命令见 docs/cpi_acd.md

CLI

Fake-model smoke demo without downloading model weights:

python -m videollama2.tri_modal_decoding.cli infer \
  --model-path dummy \
  --prompt "What happens in the clip?" \
  --fake-model \
  --trace-path outputs/tri_modal_trace.jsonl

Single real-model inference:

python -m videollama2.tri_modal_decoding.cli infer \
  --model-path /path/to/VideoLLaMA2.1-7B-AV \
  --video-path /path/to/video.mp4 \
  --prompt "Answer yes or no." \
  --config configs/tri_modal/default.yaml \
  --trace-path outputs/tri_modal_trace.jsonl

Analyze traces:

python -m videollama2.tri_modal_decoding.cli analyze \
  --trace-path outputs/tri_modal_trace.jsonl

Configs

Available configs:

  • configs/tri_modal/default.yaml
  • configs/tri_modal/ablate_uniform_demand.yaml
  • configs/tri_modal/ablate_final_only.yaml
  • configs/tri_modal/ablate_max_attention.yaml
  • configs/tri_modal/ablate_no_gate.yaml
  • configs/tri_modal/ablate_unsupported_penalty.yaml
  • configs/tri_modal/ablate_support_contrast.yaml

The main fields are:

use_modality_probe: true
modality_probe_type: label_token_probe
use_audio_anchor: true
use_vision_anchor: true
use_text_anchor: true
anchor_score_type: max_attention
use_dynamic_gate: true
use_contrastive_calibration: false
alpha: 0.5
beta: 0.0
calibration:
  entropy_gate_threshold: 0.6
top_p_candidate_only: false
export_trace: false
export_dir: outputs/tri_modal

Trace Schema

Trace files are JSONL, one generated token per line. Each record contains token id/text, decoding mode, modality spans, per-layer modality attention summaries, demand weights, selected anchor layers, gate values, top final logits, top calibrated logits, and config id.

Script Entrypoints

python scripts/run_tri_modal_infer.py --model-path dummy --prompt "Demo" --fake-model
python scripts/run_tri_modal_eval.py --model-path dummy --input-file data.jsonl --output-file outputs/preds.jsonl --fake-model
python scripts/analyze_tri_modal_trace.py --trace-path outputs/tri_modal_trace.jsonl
python scripts/run_tri_modal_ablation.py --model-path dummy --prompt "Demo" --fake-model

Minimal Demo Script

python scripts/demo_tri_modal_decode.py infer \
  --model-path dummy \
  --prompt "Demo" \
  --fake-model \
  --trace-path outputs/demo_trace.jsonl

📝 Citation

@inproceedings{jung2025avcd,
  author    = {Jung, Chaeyoung and Jang, Youngjoon and Chung, Joon Son},
  title     = {AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding},
  booktitle = {NeurIPS},
  year      = {2025}
}