Skip to content

[Benchmark Backfill] Integrate Point-Bench into lmms-eval#1157

Merged
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-293-point-bench
Feb 23, 2026
Merged

[Benchmark Backfill] Integrate Point-Bench into lmms-eval#1157
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-293-point-bench

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 22, 2026

Summary

  • Integrate new pointbench task with YAML + utils following existing task conventions.
  • Load PointArena metadata from data.json, fetch per-sample images via HF datasets-server rows API, and resolve masks from selected_masks.zip.
  • Add pointbench_acc point-in-mask scoring and register docs mapping in docs/current_tasks.md.

Validation Evidence

  • /Users/luodian/Github/lmms-eval/.venv/bin/python -m lmms_eval --tasks list -> includes pointbench in available tasks.
  • /Users/luodian/Github/lmms-eval/.venv/bin/python -m lmms_eval --model dummy_video_reader --model_args response=\"[(500,500)]\",allow_remote=true,fail_on_missing=false --tasks pointbench --limit 8 --batch_size 1 --output_path ./logs/lmm-293-pointbench-smoke -> succeeds with pointbench_acc=0.125.
  • /Users/luodian/Github/lmms-eval/.venv/bin/python -m unittest discover -s test/eval -p \"test_cli_parse_args.py\" -> OK.

Tracking

Smoke Validation (limit=8)

Status: PASS (LMM-293 / pointbench)

Output Table

Metric Value
pointbench_acc 0.03125

Sample Output

Sample 1 (doc_id: 0)

  • Input: Point to the free space between the person in a black shirt and the car. Your answer should be formatted as a list of tuples, i.e. [(x1, y1), (x2, y2), ...], where each tuple contains the x and y coordinates of a point satisfying the conditions above. The coordinates should be integers between 0 and…
  • Model Output: [LMMS_EVAL_REQUEST_FAILED after 5 retries] Error code: 404 - {'error': {'message': 'No endpoints found for google/gemini-flash-1.5.', 'code': 404}, 'user_id': 'user_2sYkuU3dimruZBqpDO0almnSIBN'}
  • Reference:
  • Scores: N/A
  • Tokens: output=0, reasoning=0

Sample 2 (doc_id: 1)

  • Input: Point to the tool used for cutting wood. Your answer should be formatted as a list of tuples, i.e. [(x1, y1), (x2, y2), ...], where each tuple contains the x and y coordinates of a point satisfying the conditions above. The coordinates should be integers between 0 and 999, representing the pixel loc…
  • Model Output: [LMMS_EVAL_REQUEST_FAILED after 5 retries] Error code: 404 - {'error': {'message': 'No endpoints found for google/gemini-flash-1.5.', 'code': 404}, 'user_id': 'user_2sYkuU3dimruZBqpDO0almnSIBN'}
  • Reference:
  • Scores: N/A
  • Tokens: output=0, reasoning=0

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks pointbench --batch_size 1 --limit 8 --log_samples

@Luodian Luodian merged commit 17bf443 into dev-v0d7 Feb 23, 2026
2 checks passed
@Luodian Luodian deleted the feat/lmm-293-point-bench branch February 23, 2026 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant