Skip to content

stuck-inadream/screenspot_pro

Repository files navigation

screenspot_pro

This PR implements ScreenSpot Pro as a Prime Environment: a fully self-contained mock eval with synthetic dataset, baseline, metrics, and CI artifacts.

Tiny baseline + mock eval for ScreenSpot Pro.
This repo is prepared for Prime Environments bounty submission: self-contained mock dataset, simple evaluation, and CI that produces per-example outputs plus a calibration PNG.

Source / Fork Link: https://github.com/stuck-inadream/screenspot_pro


Quickstart (local)

python -m venv .venv && source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
# or: pip install -e .  (if using pyproject.toml)
PYTHONPATH=. pytest -q

# run the tiny smoke eval on generated mock data
PYTHONPATH=. python -m screenspot_pro.eval \
  --annotations data/mock_screenspot_pro/annotations.jsonl \
  --root . --subset 4 --max_resolution 1200 \
  --baseline text \
  --per_example_file out_text_scaled.json \
  --calibration_png calib_text_scaled.png

Outputs
out_text_scaled.json — JSONL with one record per example (success, IoU, etc.)


calib_text_scaled.png — qualitative calibration image



CI
GitHub Actions builds a minimal environment, generates 4 mock screenshots + annotations, runs the smoke eval, summarizes results, and uploads artifacts:
/tmp/out_text_scaled_ci.json


/tmp/summary.json


calib_text_scaled.png


See latest artifacts in Actions → eval-smoke.

Mock Dataset
During CI (and in the quickstart), we synthesize 4 × 1200×337 images with colored UI bars and a single labeled target box each. The paired annotations.jsonl contains rows like:
{"image_path":"data/mock_screenspot_pro/mock_0.png","instruction":"click the File menu","bbox":[10,10,110,40],"target_type":"text"}

Metrics
screenspot_pro/metrics.py implements:
iou(a, b) — intersection over union


center_in_box(pred, gold) — auxiliary


summarize(per) → {"success_rate": ..., "text_success_rate": ..., "icon_success_rate": ...}


On the mock smoke test we typically see ~75% success (3/4) with the trivial baseline.

Structure
screenspot_pro/
  __init__.py
  eval.py          # cli entry: python -m screenspot_pro.eval ...
  metrics.py       # iou + summarize
data/
  mock_screenspot_pro/  # created on the fly
tests/
  ...              # a couple of tiny unit tests
.github/workflows/ci.yml

Notes for Prime Reviewers
Self-contained; no external datasets required for smoke test.


Works with Python 3.10+. No API keys needed.


Produces per-example outputs + a calibration PNG on each CI run.


Stylistic conformance via ruff (config in pyproject.toml).


Contact / Credit: @stuck-inadream


ScreenSpot Pro – Eval Results
Model: gpt-4o-mini


Images max width: 768 px


Examples: 10


Avg IoU (vision): 0.054


Avg IoU (heuristic): 0.054


Notes: Mock UI dataset is simple; a rule-based prior (menus top-left, status bar bottom, etc.) already captures most signal. Vision pipeline runs end-to-end with rate-limit backoff and saves artifacts to outputs/evals/final/.
 Artifacts: screenspot_eval_results.tgz (contains predictions.jsonl, summary.txt).
Conclusion (Mock ScreenSpot-Pro)
Using gpt-4o-mini at MAX_W=768 on K=10 examples, the vision baseline achieved Avg IoU = 0.054, which matches a simple UI-prior heuristic (0.054). Many model outputs were truncated (e.g., vision_raw: "[6, 6, 66"), leading to oversized default boxes and the heuristic dominating. On this tiny mock set, the heuristic is sufficient; for real screenshots, expect larger gains from (a) higher image resolution (MAX_W=1024–1280), (b) a slightly larger output budget (MAX_OUT_TOK≈12), and/or (c) a stronger model (MODEL=gpt-4o). Reproducible artifacts are in outputs/evals/final/summary.txt and outputs/evals/final/predictions.jsonl.
Verifiers quickstart
Install verifiers (if needed)
uv add verifiers

Install this environment into verifiers
vf-install screenspot_pro --from-repo

Run a small eval and save outputs
vf-eval screenspot_pro -s --env-args '{"annotations":"environments/screenspot_pro/data/mock_screenspot_pro/annotations.jsonl","root":"environments/screenspot_pro","subset":4,"baseline":"text"}'

Open the saved run
vf-tui

About

Mock eval + CI for ScreenSpot Pro

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages