This PR implements ScreenSpot Pro as a Prime Environment: a fully self-contained mock eval with synthetic dataset, baseline, metrics, and CI artifacts.
Tiny baseline + mock eval for ScreenSpot Pro.
This repo is prepared for Prime Environments bounty submission: self-contained mock dataset, simple evaluation, and CI that produces per-example outputs plus a calibration PNG.
Source / Fork Link: https://github.com/stuck-inadream/screenspot_pro
python -m venv .venv && source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
# or: pip install -e . (if using pyproject.toml)
PYTHONPATH=. pytest -q
# run the tiny smoke eval on generated mock data
PYTHONPATH=. python -m screenspot_pro.eval \
--annotations data/mock_screenspot_pro/annotations.jsonl \
--root . --subset 4 --max_resolution 1200 \
--baseline text \
--per_example_file out_text_scaled.json \
--calibration_png calib_text_scaled.png
Outputs
out_text_scaled.json — JSONL with one record per example (success, IoU, etc.)
calib_text_scaled.png — qualitative calibration image
CI
GitHub Actions builds a minimal environment, generates 4 mock screenshots + annotations, runs the smoke eval, summarizes results, and uploads artifacts:
/tmp/out_text_scaled_ci.json
/tmp/summary.json
calib_text_scaled.png
See latest artifacts in Actions → eval-smoke.
Mock Dataset
During CI (and in the quickstart), we synthesize 4 × 1200×337 images with colored UI bars and a single labeled target box each. The paired annotations.jsonl contains rows like:
{"image_path":"data/mock_screenspot_pro/mock_0.png","instruction":"click the File menu","bbox":[10,10,110,40],"target_type":"text"}
Metrics
screenspot_pro/metrics.py implements:
iou(a, b) — intersection over union
center_in_box(pred, gold) — auxiliary
summarize(per) → {"success_rate": ..., "text_success_rate": ..., "icon_success_rate": ...}
On the mock smoke test we typically see ~75% success (3/4) with the trivial baseline.
Structure
screenspot_pro/
__init__.py
eval.py # cli entry: python -m screenspot_pro.eval ...
metrics.py # iou + summarize
data/
mock_screenspot_pro/ # created on the fly
tests/
... # a couple of tiny unit tests
.github/workflows/ci.yml
Notes for Prime Reviewers
Self-contained; no external datasets required for smoke test.
Works with Python 3.10+. No API keys needed.
Produces per-example outputs + a calibration PNG on each CI run.
Stylistic conformance via ruff (config in pyproject.toml).
Contact / Credit: @stuck-inadream
ScreenSpot Pro – Eval Results
Model: gpt-4o-mini
Images max width: 768 px
Examples: 10
Avg IoU (vision): 0.054
Avg IoU (heuristic): 0.054
Notes: Mock UI dataset is simple; a rule-based prior (menus top-left, status bar bottom, etc.) already captures most signal. Vision pipeline runs end-to-end with rate-limit backoff and saves artifacts to outputs/evals/final/.
Artifacts: screenspot_eval_results.tgz (contains predictions.jsonl, summary.txt).
Conclusion (Mock ScreenSpot-Pro)
Using gpt-4o-mini at MAX_W=768 on K=10 examples, the vision baseline achieved Avg IoU = 0.054, which matches a simple UI-prior heuristic (0.054). Many model outputs were truncated (e.g., vision_raw: "[6, 6, 66"), leading to oversized default boxes and the heuristic dominating. On this tiny mock set, the heuristic is sufficient; for real screenshots, expect larger gains from (a) higher image resolution (MAX_W=1024–1280), (b) a slightly larger output budget (MAX_OUT_TOK≈12), and/or (c) a stronger model (MODEL=gpt-4o). Reproducible artifacts are in outputs/evals/final/summary.txt and outputs/evals/final/predictions.jsonl.
Verifiers quickstart
Install verifiers (if needed)
uv add verifiers
Install this environment into verifiers
vf-install screenspot_pro --from-repo
Run a small eval and save outputs
vf-eval screenspot_pro -s --env-args '{"annotations":"environments/screenspot_pro/data/mock_screenspot_pro/annotations.jsonl","root":"environments/screenspot_pro","subset":4,"baseline":"text"}'
Open the saved run
vf-tui