MM-Dialog-Bench: Multimodal Dialogue Evaluation

A small, reproducible evaluation toolkit for testing whether vision-language assistants actually follow grounded dialogue context.

MM-Dialog-Bench is a Python toolkit for building and scoring multimodal dialogue test sets. It focuses on a practical problem I kept running into while reading multimodal model papers: many demos look good, but it is hard to compare models when an answer depends on both the image and previous turns.

This repository is intentionally lightweight. It does not ship model weights and it does not require a specific vendor API. Instead, it provides dataset schemas, prompt builders, baseline adapters, scoring utilities, and report generation so experiments can be repeated with different local or remote models.

What it evaluates

Grounding — whether answers refer to visible evidence instead of generic guesses.
Context carry-over — whether a model remembers constraints from previous turns.
Refusal quality — whether uncertain visual questions are handled safely.
Answer stability — whether paraphrased prompts produce consistent judgments.

Install

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Quick start

mmdb validate examples/tiny_dialogs.jsonl
mmdb run examples/tiny_dialogs.jsonl --adapter echo --out outputs/tiny_predictions.jsonl
mmdb score examples/tiny_dialogs.jsonl outputs/tiny_predictions.jsonl --report outputs/report.md

Dataset format

Each line is a JSON object:

{
  "id": "room-001",
  "image": "images/room.jpg",
  "turns": [
    {"role": "user", "text": "What is on the desk?"},
    {"role": "assistant", "text": "A laptop and a mug."},
    {"role": "user", "text": "What color is the mug?"}
  ],
  "answer": "white",
  "tags": ["grounding", "context"]
}

Architecture

jsonl dataset ──▶ validator ──▶ prompt builder ──▶ model adapter
      │                                                   │
      └──────────────────▶ scorer ◀──────────────────────┘
                              │
                              ▼
                       markdown/json report

Example report

Metric	Meaning
exact_match	strict normalized answer match
token_f1	bag-of-words overlap after normalization
contradiction_rate	fraction of answers matching known negative labels
tag_breakdown	per-skill summary for grounding/context/refusal

Notes

This is not meant to replace large academic benchmarks. It is a compact tool for debugging model behavior before running expensive evaluations.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
src/mm_dialog_bench		src/mm_dialog_bench
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM-Dialog-Bench: Multimodal Dialogue Evaluation

What it evaluates

Install

Quick start

Dataset format

Architecture

Example report

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MM-Dialog-Bench: Multimodal Dialogue Evaluation

What it evaluates

Install

Quick start

Dataset format

Architecture

Example report

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages