feat: add CrossPoint-Bench task by njb-nvidia · Pull Request #1349 · EvolvingLMMs-Lab/lmms-eval

njb-nvidia · 2026-05-21T00:42:30Z

Summary

Adds CrossPoint-Bench, a 1,000-sample cross-view point correspondence benchmark covering four sub-tasks at two granularity levels (object / part):

Sub-task	Count	Output type	Scoring
Fine-grained Grounding	161	coordinate (JSON)	point-in-mask hit
Visibility Reasoning	220	binary MCQ	letter match
Correspondence-Judgement	156	MCQ	letter match
Correspondence-Pointing	463	coordinate (JSON)	point-in-mask hit

Dataset: `WangYipu2002/CrossPoint-Bench`. The JSONL is loaded via `load_dataset`; image files live alongside under `image/` and are fetched once via `snapshot_download` (cached on subsequent calls).

Paper: https://arxiv.org/abs/2512.04686

Files

`lmms_eval/tasks/crosspoint_bench/crosspoint_bench.yaml` — task config.
`lmms_eval/tasks/crosspoint_bench/utils.py` — doc transforms, answer extraction (JSON + regex fallbacks for coords, multi-pattern letter extraction for MCQ), scoring (mask decoding + per-task-type branch), and aggregation (overall + per-type + per-level + per-type/level).

Coordinate format env var

Models output coordinates in different conventions: absolute pixels, normalized [0, 1], or scaled to [0, 1000] (e.g. Qwen-VL family). The task respects `CROSSPOINT_COORD_FORMAT`:

`absolute` (default) — raw model output is already in pixels.
`relative_1` — multiply by image (w, h).
`relative_1000` — divide by 1000 then multiply by image (w, h).

Parity vs. our local vlmevalkit implementation

Qwen3-VL-2B-Instruct, full 1,000 items, 6x H100 (`CROSSPOINT_COORD_FORMAT=relative_1000`), greedy decoding.

Metric	Fork (vlmevalkit)	Upstream (lmms-eval)	Δ
Overall	21.8	21.5	-0.3
Visibility Reasoning	47.3	46.8	-0.5
Correspondence-Judgement	41.0	39.1	-1.9
Fine-grained Grounding	26.7	27.3	+0.6
Correspondence-Pointing	1.5	1.5	0.0
level/object	22.4	23.1	+0.7
level/part	21.3	20.1	-1.2

All sub-metrics within 2pp — essentially identical given binary-MCQ + small-sample noise plus the `qwen3_vl` model-class drift we have observed on prior ports.

Test plan

`uv run lmms-eval --tasks crosspoint_bench --limit 10` smoke test (4 task types exercised)
Full 1,000-doc run on 6x H100 with Qwen3-VL-2B-Instruct; scores match the fork's vlmevalkit numbers within 2pp on every sub-metric
Per-doc scoring verified: coordinate-in-mask hit detection works against the dataset's base64-encoded PNG ground-truth masks
Image cache via `huggingface_hub.snapshot_download` works (1,445 files cached on first call, reused on subsequent calls)

CrossPoint-Bench is a 1,000-sample cross-view point correspondence benchmark covering four sub-tasks at two granularity levels (object / part): - Fine-grained Grounding (161, coordinate output, in-mask hit) - Visibility Reasoning (220, binary MCQ) - Correspondence-Judgement (156, MCQ) - Correspondence-Pointing (463, coordinate output, in-mask hit) Dataset: WangYipu2002/CrossPoint-Bench on HuggingFace. The JSONL is loaded via load_dataset; image files live alongside under image/ and are fetched once via snapshot_download (cached on subsequent calls). Metric: crosspoint_accuracy — task-type-aware scoring (MCQ letter match for the two MCQ subtypes, point-in-mask hit for the coordinate subtypes). Per-type and per-level breakdowns are printed at aggregation time. The CROSSPOINT_COORD_FORMAT env var (absolute|relative_1|relative_1000) controls how coordinate outputs are interpreted before being checked against the ground-truth mask. Default: absolute. Reference: https://arxiv.org/abs/2512.04686

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CrossPoint-Bench task#1349

feat: add CrossPoint-Bench task#1349
njb-nvidia wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-crosspoint_bench-task

njb-nvidia commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

njb-nvidia commented May 21, 2026

Summary

Files

Coordinate format env var

Parity vs. our local vlmevalkit implementation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant