Skip to content

Commit eb2cffe

Browse files
Add VisRes Bench benchmark (CVPR 2026) (#1245)
* add visres tasks * add citation * fix pre-commit
1 parent 0e1f330 commit eb2cffe

35 files changed

Lines changed: 314 additions & 0 deletions
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# VisRes-Bench
2+
3+
[VisRes-Bench](https://huggingface.co/datasets/tiiuae/visres_bench) is a visual reasoning benchmark with tasks at three difficulty levels. This folder defines lmms-eval tasks for all dataset configs.
4+
5+
**Dataset:** `tiiuae/visres_bench` (Hugging Face). A valid Hugging Face token may be required; set `HUGGINGFACE_HUB_TOKEN` or run `huggingface-cli login` before evaluation.
6+
7+
---
8+
9+
## Running the tasks
10+
11+
Use the `--tasks` argument with one of the group names or a single task name. Example with `accelerate launch`:
12+
13+
```bash
14+
# From the lmms-eval repo root
15+
accelerate launch --num_processes=1 -m lmms_eval \
16+
--model <your_model> \
17+
--model_args <your_args> \
18+
--tasks <TASK_OR_GROUP> \
19+
--batch_size 1
20+
```
21+
22+
### Run all tasks (27 tasks)
23+
24+
```bash
25+
--tasks visres_bench
26+
```
27+
28+
Includes every config: all level-1 (including random_sampling), all level-2, and all level-3 tasks.
29+
30+
---
31+
32+
### Run Level 1 only (8 tasks, no random_sampling)
33+
34+
```bash
35+
--tasks visres_bench_level_1
36+
```
37+
38+
Tasks: `visres_bench_level_1_global_occlusion_50`, `visres_bench_level_1_global_occlusion_70`, `visres_bench_level_1_global_occlusion_80`, `visres_bench_level_1_edges`, `visres_bench_level_1_brightness`, `visres_bench_level_1_blur`, `visres_bench_level_1_rotation`, `visres_bench_level_1_location`.
39+
40+
---
41+
42+
### Run Level 2 only (12 tasks)
43+
44+
```bash
45+
--tasks visres_bench_level_2
46+
```
47+
48+
Tasks: `visres_bench_level_2_uniform_count`, `visres_bench_level_2_count_progression`, `visres_bench_level_2_uniform_orientation`, `visres_bench_level_2_count_2_same_1_diff`, `visres_bench_level_2_orientation_2same_1diff`, `visres_bench_level_2_uniform_color`, `visres_bench_level_2_count_arithmetic`, `visres_bench_level_2_count_minmax`, `visres_bench_level_2_orientation_3_diff`, `visres_bench_level_2_color_2same_1diff`, `visres_bench_level_2_color_3_diff`, `visres_bench_level_2_count_3_diff`.
49+
50+
---
51+
52+
### Run Level 3 only (5 tasks)
53+
54+
```bash
55+
--tasks visres_bench_level_3
56+
```
57+
58+
Tasks: `visres_bench_level_3_spiral_color_orientation`, `visres_bench_level_3_coupled_color_count`, `visres_bench_level_3_independent_color_object_rientation`, `visres_bench_level_3_coupled_color_orientation`, `visres_bench_level_3_Independent_count_object_color`.
59+
60+
---
61+
62+
## Single task
63+
64+
To run one config only, use the full task name, e.g.:
65+
66+
```bash
67+
--tasks visres_bench_level_1_global_occlusion_50
68+
```
69+
70+
---
71+
72+
## Question type (guided vs generic)
73+
74+
The default prompt uses the **guided_question** column. To use **generic_question** instead, pass the format that selects it (e.g. `--format generic` if your runner supports it). The default template defines:
75+
76+
- `default`: `question_column: guided_question`
77+
- `generic`: `question_column: generic_question`
78+
79+
---
80+
81+
## Summary
82+
83+
| Group | Description | # tasks |
84+
|-----------------------|--------------------------|--------:|
85+
| `visres_bench` | All configs | 27 |
86+
| `visres_bench_level_1`| Level 1, no random_sampling | 8 |
87+
| `visres_bench_level_2`| Level 2 only | 12 |
88+
| `visres_bench_level_3`| Level 3 only | 5 |
89+
90+
---
91+
92+
## Citation
93+
94+
If you use VisRes-Bench in your work, please cite:
95+
96+
```bibtex
97+
@article{tortei2025visres,
98+
title={VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs},
99+
author={T{\"o}rtei, Brigitta Malagurski and Dahou, Yasser and Huynh, Ngoc Dung and Para, Wamiq Reyaz and Khac, Ph{\'u}c H L{\^e} and Singh, Ankit and Chaybouti, Sofian and Narayan, Sanath},
100+
journal={arXiv preprint arXiv:2512.21194},
101+
year={2025}
102+
}
103+
```
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
dataset_path: tiiuae/visres_bench
2+
dataset_kwargs:
3+
token: True
4+
5+
output_type: generate_until
6+
test_split: test
7+
8+
doc_to_text: !function utils.visres_bench_doc_to_text
9+
doc_to_visual: !function utils.visres_bench_doc_to_visual
10+
process_results: !function utils.vp_process_results
11+
doc_to_target: "answer"
12+
13+
metric_list:
14+
- metric: exact_match
15+
aggregation: mean
16+
higher_is_better: true
17+
ignore_case: true
18+
ignore_punctuation: true
19+
20+
# Switch question column: guided_question (default) or generic_question
21+
lmms_eval_specific_kwargs:
22+
default:
23+
question_column: guided_question
24+
pre_prompt: ""
25+
post_prompt: ""
26+
generic:
27+
question_column: generic_question
28+
pre_prompt: ""
29+
post_prompt: ""
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
group: visres_bench
2+
task:
3+
- visres_bench_level_1_global_occlusion_50
4+
- visres_bench_level_1_global_occlusion_70
5+
- visres_bench_level_1_global_occlusion_80
6+
- visres_bench_level_1_edges
7+
- visres_bench_level_1_location_random_sampling
8+
- visres_bench_level_1_brightness
9+
- visres_bench_level_1_blur
10+
- visres_bench_level_1_rotation
11+
- visres_bench_level_1_rotation_random_sampling
12+
- visres_bench_level_1_edges_random_sampling
13+
- visres_bench_level_1_location
14+
- visres_bench_level_2_uniform_count
15+
- visres_bench_level_2_count_progression
16+
- visres_bench_level_2_uniform_orientation
17+
- visres_bench_level_2_count_2_same_1_diff
18+
- visres_bench_level_2_orientation_2same_1diff
19+
- visres_bench_level_2_uniform_color
20+
- visres_bench_level_2_count_arithmetic
21+
- visres_bench_level_2_count_minmax
22+
- visres_bench_level_2_orientation_3_diff
23+
- visres_bench_level_2_color_2same_1diff
24+
- visres_bench_level_2_color_3_diff
25+
- visres_bench_level_2_count_3_diff
26+
- visres_bench_level_3_spiral_color_orientation
27+
- visres_bench_level_3_coupled_color_count
28+
- visres_bench_level_3_independent_color_object_rientation
29+
- visres_bench_level_3_coupled_color_orientation
30+
- visres_bench_level_3_Independent_count_object_color
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
group: visres_bench_level_1
2+
task:
3+
- visres_bench_level_1_global_occlusion_50
4+
- visres_bench_level_1_global_occlusion_70
5+
- visres_bench_level_1_global_occlusion_80
6+
- visres_bench_level_1_edges
7+
- visres_bench_level_1_brightness
8+
- visres_bench_level_1_blur
9+
- visres_bench_level_1_rotation
10+
- visres_bench_level_1_location
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
group: visres_bench_level_2
2+
task:
3+
- visres_bench_level_2_uniform_count
4+
- visres_bench_level_2_count_progression
5+
- visres_bench_level_2_uniform_orientation
6+
- visres_bench_level_2_count_2_same_1_diff
7+
- visres_bench_level_2_orientation_2same_1diff
8+
- visres_bench_level_2_uniform_color
9+
- visres_bench_level_2_count_arithmetic
10+
- visres_bench_level_2_count_minmax
11+
- visres_bench_level_2_orientation_3_diff
12+
- visres_bench_level_2_color_2same_1diff
13+
- visres_bench_level_2_color_3_diff
14+
- visres_bench_level_2_count_3_diff
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
group: visres_bench_level_3
2+
task:
3+
- visres_bench_level_3_spiral_color_orientation
4+
- visres_bench_level_3_coupled_color_count
5+
- visres_bench_level_3_independent_color_object_rientation
6+
- visres_bench_level_3_coupled_color_orientation
7+
- visres_bench_level_3_Independent_count_object_color
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Utils for visres_bench
2+
def visres_bench_doc_to_text(doc, prompt_kwargs=None):
3+
"""Use question_column in lmms_eval_specific_kwargs to switch guided vs generic question."""
4+
if prompt_kwargs is None:
5+
prompt_kwargs = {}
6+
question_column = prompt_kwargs.get("question_column", "guided_question")
7+
text = doc.get(question_column)
8+
pre_prompt = prompt_kwargs.get("pre_prompt", "")
9+
post_prompt = prompt_kwargs.get("post_prompt", "")
10+
return f"{pre_prompt}{text}{post_prompt}"
11+
12+
13+
def visres_bench_doc_to_visual(doc):
14+
if not doc.get("images", None):
15+
return None
16+
imgs = doc["images"]
17+
if isinstance(imgs, list):
18+
return [img.convert("RGB") for img in imgs]
19+
return [imgs.convert("RGB")]
20+
21+
22+
def vp_process_results(doc, result):
23+
answer = doc["answer"]
24+
result = result[0]
25+
result = result.split(")")[0]
26+
if answer == result:
27+
accuracy = 1
28+
else:
29+
accuracy = 0
30+
31+
return {
32+
"exact_match": accuracy,
33+
"submission": {
34+
"image": doc["id"],
35+
"answer": result,
36+
},
37+
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
task: visres_bench_level_1_blur
2+
dataset_name: level_1_blur
3+
include: _default_template_visres_bench_yaml
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
task: visres_bench_level_1_brightness
2+
dataset_name: level_1_brightness
3+
include: _default_template_visres_bench_yaml
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
task: visres_bench_level_1_edges
2+
dataset_name: level_1_edges
3+
include: _default_template_visres_bench_yaml

0 commit comments

Comments
 (0)