Skip to content

Commit 0b71775

Browse files
authored
[Benchmark Backfill] Integrate CountBench into lmms-eval (#1156)
* feat: add CountBench task config and scoring * docs: add CountBench to current task catalog
1 parent ee50410 commit 0b71775

3 files changed

Lines changed: 74 additions & 0 deletions

File tree

docs/current_tasks.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ python -m lmms_eval --tasks list_with_num
4747
- COCO 2017 Caption MiniVal (coco2017_cap_val)
4848
- COCO 2017 Caption MiniTest (coco2017_cap_test)
4949
- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench)
50+
- [CountBench](https://huggingface.co/datasets/vikhyatk/CountBenchQA) (countbench)
5051
- [CV-Bench](https://github.com/nyu-visionx/CV-Bench) (cv_bench)
5152
- [DetailCaps-4870](https://github.com/foundation-multimodal-models/CAPTURE) (detailcaps)
5253
- [Flickr30K](https://github.com/BryanPlummer/flickr30k_entities) (flickr30k)
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
dataset_path: vikhyatk/CountBenchQA
2+
task: countbench
3+
test_split: test
4+
output_type: generate_until
5+
doc_to_visual: !function utils.countbench_doc_to_visual
6+
doc_to_text: !function utils.countbench_doc_to_text
7+
doc_to_target: !function utils.countbench_doc_to_target
8+
generation_kwargs:
9+
max_new_tokens: 16
10+
temperature: 0
11+
do_sample: false
12+
process_results: !function utils.countbench_process_results
13+
metric_list:
14+
- metric: acc
15+
aggregation: mean
16+
higher_is_better: true
17+
lmms_eval_specific_kwargs:
18+
default:
19+
pre_prompt: "Look at the image carefully and count the objects. Answer with just a number, without any additional text. "
20+
post_prompt: ""
21+
metadata:
22+
- version: 0.0
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
NUMBER_WORD_TO_NUMERAL = {
2+
"none": "0",
3+
"zero": "0",
4+
"one": "1",
5+
"two": "2",
6+
"three": "3",
7+
"four": "4",
8+
"five": "5",
9+
"six": "6",
10+
"seven": "7",
11+
"eight": "8",
12+
"nine": "9",
13+
"ten": "10",
14+
"eleven": "11",
15+
"twelve": "12",
16+
"thirteen": "13",
17+
"fourteen": "14",
18+
"fifteen": "15",
19+
"sixteen": "16",
20+
"seventeen": "17",
21+
"eighteen": "18",
22+
"nineteen": "19",
23+
"twenty": "20",
24+
}
25+
26+
27+
def _normalize_count_answer(answer) -> str:
28+
normalized = str(answer).strip().lower()
29+
return NUMBER_WORD_TO_NUMERAL.get(normalized, normalized)
30+
31+
32+
def countbench_doc_to_visual(doc):
33+
return [doc["image"].convert("RGB")]
34+
35+
36+
def countbench_doc_to_text(doc, lmms_eval_specific_kwargs=None):
37+
kwargs = lmms_eval_specific_kwargs or {}
38+
pre_prompt = kwargs.get("pre_prompt", "")
39+
post_prompt = kwargs.get("post_prompt", "")
40+
question = doc["question"].strip()
41+
return f"{pre_prompt}{question}{post_prompt}"
42+
43+
44+
def countbench_doc_to_target(doc):
45+
return _normalize_count_answer(doc["number"])
46+
47+
48+
def countbench_process_results(doc, results):
49+
prediction = _normalize_count_answer(results[0])
50+
target = countbench_doc_to_target(doc)
51+
return {"acc": float(prediction == target)}

0 commit comments

Comments
 (0)