ViVerBench evaluates whether multimodal models can verify if generated visual outputs satisfy prompt-level constraints.
- 3,594 examples across 16 task categories
- Binary verification target (
true/false) - Inputs can contain multiple images (1, 2, or 8)
python -m lmms_eval \
--model <model_name> \
--tasks viverbench \
--batch_size 1 \
--limit 8