Skip to content

zxpg1/mm-dialog-bench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MM-Dialog-Bench: Multimodal Dialogue Evaluation

A small, reproducible evaluation toolkit for testing whether vision-language assistants actually follow grounded dialogue context.

MM-Dialog-Bench is a Python toolkit for building and scoring multimodal dialogue test sets. It focuses on a practical problem I kept running into while reading multimodal model papers: many demos look good, but it is hard to compare models when an answer depends on both the image and previous turns.

This repository is intentionally lightweight. It does not ship model weights and it does not require a specific vendor API. Instead, it provides dataset schemas, prompt builders, baseline adapters, scoring utilities, and report generation so experiments can be repeated with different local or remote models.

What it evaluates

  • Grounding — whether answers refer to visible evidence instead of generic guesses.
  • Context carry-over — whether a model remembers constraints from previous turns.
  • Refusal quality — whether uncertain visual questions are handled safely.
  • Answer stability — whether paraphrased prompts produce consistent judgments.

Install

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Quick start

mmdb validate examples/tiny_dialogs.jsonl
mmdb run examples/tiny_dialogs.jsonl --adapter echo --out outputs/tiny_predictions.jsonl
mmdb score examples/tiny_dialogs.jsonl outputs/tiny_predictions.jsonl --report outputs/report.md

Dataset format

Each line is a JSON object:

{
  "id": "room-001",
  "image": "images/room.jpg",
  "turns": [
    {"role": "user", "text": "What is on the desk?"},
    {"role": "assistant", "text": "A laptop and a mug."},
    {"role": "user", "text": "What color is the mug?"}
  ],
  "answer": "white",
  "tags": ["grounding", "context"]
}

Architecture

jsonl dataset ──▶ validator ──▶ prompt builder ──▶ model adapter
      │                                                   │
      └──────────────────▶ scorer ◀──────────────────────┘
                              │
                              ▼
                       markdown/json report

Example report

Metric Meaning
exact_match strict normalized answer match
token_f1 bag-of-words overlap after normalization
contradiction_rate fraction of answers matching known negative labels
tag_breakdown per-skill summary for grounding/context/refusal

Notes

This is not meant to replace large academic benchmarks. It is a compact tool for debugging model behavior before running expensive evaluations.

License

Apache-2.0

About

Compact multimodal dialogue evaluation toolkit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%