A small, reproducible evaluation toolkit for testing whether vision-language assistants actually follow grounded dialogue context.
MM-Dialog-Bench is a Python toolkit for building and scoring multimodal dialogue test sets. It focuses on a practical problem I kept running into while reading multimodal model papers: many demos look good, but it is hard to compare models when an answer depends on both the image and previous turns.
This repository is intentionally lightweight. It does not ship model weights and it does not require a specific vendor API. Instead, it provides dataset schemas, prompt builders, baseline adapters, scoring utilities, and report generation so experiments can be repeated with different local or remote models.
- Grounding — whether answers refer to visible evidence instead of generic guesses.
- Context carry-over — whether a model remembers constraints from previous turns.
- Refusal quality — whether uncertain visual questions are handled safely.
- Answer stability — whether paraphrased prompts produce consistent judgments.
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]mmdb validate examples/tiny_dialogs.jsonl
mmdb run examples/tiny_dialogs.jsonl --adapter echo --out outputs/tiny_predictions.jsonl
mmdb score examples/tiny_dialogs.jsonl outputs/tiny_predictions.jsonl --report outputs/report.mdEach line is a JSON object:
{
"id": "room-001",
"image": "images/room.jpg",
"turns": [
{"role": "user", "text": "What is on the desk?"},
{"role": "assistant", "text": "A laptop and a mug."},
{"role": "user", "text": "What color is the mug?"}
],
"answer": "white",
"tags": ["grounding", "context"]
}jsonl dataset ──▶ validator ──▶ prompt builder ──▶ model adapter
│ │
└──────────────────▶ scorer ◀──────────────────────┘
│
▼
markdown/json report
| Metric | Meaning |
|---|---|
| exact_match | strict normalized answer match |
| token_f1 | bag-of-words overlap after normalization |
| contradiction_rate | fraction of answers matching known negative labels |
| tag_breakdown | per-skill summary for grounding/context/refusal |
This is not meant to replace large academic benchmarks. It is a compact tool for debugging model behavior before running expensive evaluations.
Apache-2.0