Skip to content

[Benchmark Backfill] Integrate OSWorld-Verified parity into lmms-eval#1165

Merged
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-298-osworld-verified
Feb 23, 2026
Merged

[Benchmark Backfill] Integrate OSWorld-Verified parity into lmms-eval#1165
Luodian merged 1 commit into
dev-v0d7from
feat/lmm-298-osworld-verified

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Feb 22, 2026

Summary

  • Align osworld_g scoring with OSWorld-Verified annotation format: box_type=polygon entries in MMInstruction/OSWorld-G are 4-value xyxy, so they are now evaluated as rectangles for click-in-region correctness.
  • Preserve ray-casting polygon handling for true polygons while safely rejecting malformed odd-length coordinate arrays.
  • Add OSWorld-Verified mapping in docs/current_tasks.md under Spatial & Grounding -> Referring Expression Comprehension.

Validation

  • uv run pre-commit run --files lmms_eval/tasks/osworld_g/utils.py docs/current_tasks.md
  • uv run python -c "from datasets import load_dataset; from lmms_eval.tasks.osworld_g.utils import osworld_g_process_results; ds=load_dataset('MMInstruction/OSWorld-G', split='test'); poly=next(d for d in ds if d['box_type']=='polygon'); x1,y1,x2,y2=poly['mimo_bbox']; pred=f'[{(x1+x2)/2:.2f}, {(y1+y2)/2:.2f}]'; print(osworld_g_process_results(poly,[pred])['osworld_g_polygon_acc']['correct'])" -> True
  • uv run python -m lmms_eval --tasks list -> includes osworld_g
  • OPENAI_API_KEY="$OPENROUTER_API_KEY" OPENAI_API_BASE="https://openrouter.ai/api/v1" uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks osworld_g --batch_size 1 --limit 1 --output_path ./logs/lmm_298_osworld_smoke --log_samples --verbosity INFO -> completed and wrote outputs under logs/lmm_298_osworld_smoke/...
  • lsp_diagnostics clean for lmms_eval/tasks/osworld_g/utils.py

Smoke Validation (limit=8)

Status: PASS (LMM-298 / osworld_g)

Output Table

Metric Value
osworld_g_acc 0.0
osworld_g_bbox_acc 0.0
osworld_g_polygon_acc 0.0
osworld_g_refusal_acc 0.0

Sample Output

Sample 1 (doc_id: 0)

  • Input: Identify the UI target for the instruction and output exactly one click point as [x, y]. You may use either normalized coordinates in [0, 1] or absolute pixel coordinates. If the target does not exist, output [-1, -1]. ↵ Instruction: Open the filter function for search settings.
  • Model Output: [678, 322]
  • Reference: Open the filter function for search settings.
  • Scores: N/A
  • Tokens: output=1579, reasoning=1569

Sample 2 (doc_id: 1)

  • Input: Identify the UI target for the instruction and output exactly one click point as [x, y]. You may use either normalized coordinates in [0, 1] or absolute pixel coordinates. If the target does not exist, output [-1, -1]. ↵ Instruction: Unfold the drop-down bar of Auto Save settings.
  • Model Output: [603 485]
  • Reference: Unfold the drop-down bar of Auto Save settings.
  • Scores: N/A
  • Tokens: output=1259, reasoning=1250

Test Params

uv run python -m lmms_eval --model openai_compatible --model_args "model_version=bytedance-seed/seed-1.6-flash" --tasks osworld_g --batch_size 1 --limit 8 --log_samples

@Luodian Luodian merged commit 8f6931b into dev-v0d7 Feb 23, 2026
2 checks passed
@Luodian Luodian deleted the feat/lmm-298-osworld-verified branch February 23, 2026 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant