screenspot_pro

This PR implements ScreenSpot Pro as a Prime Environment: a fully self-contained mock eval with synthetic dataset, baseline, metrics, and CI artifacts.

Tiny baseline + mock eval for ScreenSpot Pro.
This repo is prepared for Prime Environments bounty submission: self-contained mock dataset, simple evaluation, and CI that produces per-example outputs plus a calibration PNG.

Source / Fork Link: https://github.com/stuck-inadream/screenspot_pro

Quickstart (local)

python -m venv .venv && source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
# or: pip install -e .  (if using pyproject.toml)
PYTHONPATH=. pytest -q

# run the tiny smoke eval on generated mock data
PYTHONPATH=. python -m screenspot_pro.eval \
  --annotations data/mock_screenspot_pro/annotations.jsonl \
  --root . --subset 4 --max_resolution 1200 \
  --baseline text \
  --per_example_file out_text_scaled.json \
  --calibration_png calib_text_scaled.png

Outputs
out_text_scaled.json — JSONL with one record per example (success, IoU, etc.)


calib_text_scaled.png — qualitative calibration image



CI
GitHub Actions builds a minimal environment, generates 4 mock screenshots + annotations, runs the smoke eval, summarizes results, and uploads artifacts:
/tmp/out_text_scaled_ci.json


/tmp/summary.json


calib_text_scaled.png


See latest artifacts in Actions → eval-smoke.

Mock Dataset
During CI (and in the quickstart), we synthesize 4 × 1200×337 images with colored UI bars and a single labeled target box each. The paired annotations.jsonl contains rows like:
{"image_path":"data/mock_screenspot_pro/mock_0.png","instruction":"click the File menu","bbox":[10,10,110,40],"target_type":"text"}

Metrics
screenspot_pro/metrics.py implements:
iou(a, b) — intersection over union


center_in_box(pred, gold) — auxiliary


summarize(per) → {"success_rate": ..., "text_success_rate": ..., "icon_success_rate": ...}


On the mock smoke test we typically see ~75% success (3/4) with the trivial baseline.

Structure
screenspot_pro/
  __init__.py
  eval.py          # cli entry: python -m screenspot_pro.eval ...
  metrics.py       # iou + summarize
data/
  mock_screenspot_pro/  # created on the fly
tests/
  ...              # a couple of tiny unit tests
.github/workflows/ci.yml

Notes for Prime Reviewers
Self-contained; no external datasets required for smoke test.


Works with Python 3.10+. No API keys needed.


Produces per-example outputs + a calibration PNG on each CI run.


Stylistic conformance via ruff (config in pyproject.toml).


Contact / Credit: @stuck-inadream


ScreenSpot Pro – Eval Results
Model: gpt-4o-mini


Images max width: 768 px


Examples: 10


Avg IoU (vision): 0.054


Avg IoU (heuristic): 0.054


Notes: Mock UI dataset is simple; a rule-based prior (menus top-left, status bar bottom, etc.) already captures most signal. Vision pipeline runs end-to-end with rate-limit backoff and saves artifacts to outputs/evals/final/.
 Artifacts: screenspot_eval_results.tgz (contains predictions.jsonl, summary.txt).
Conclusion (Mock ScreenSpot-Pro)
Using gpt-4o-mini at MAX_W=768 on K=10 examples, the vision baseline achieved Avg IoU = 0.054, which matches a simple UI-prior heuristic (0.054). Many model outputs were truncated (e.g., vision_raw: "[6, 6, 66"), leading to oversized default boxes and the heuristic dominating. On this tiny mock set, the heuristic is sufficient; for real screenshots, expect larger gains from (a) higher image resolution (MAX_W=1024–1280), (b) a slightly larger output budget (MAX_OUT_TOK≈12), and/or (c) a stronger model (MODEL=gpt-4o). Reproducible artifacts are in outputs/evals/final/summary.txt and outputs/evals/final/predictions.jsonl.
Verifiers quickstart
Install verifiers (if needed)
uv add verifiers

Install this environment into verifiers
vf-install screenspot_pro --from-repo

Run a small eval and save outputs
vf-eval screenspot_pro -s --env-args '{"annotations":"environments/screenspot_pro/data/mock_screenspot_pro/annotations.jsonl","root":"environments/screenspot_pro","subset":4,"baseline":"text"}'

Open the saved run
vf-tui

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.bak		.bak
.github/workflows		.github/workflows
baselines		baselines
data/mock_screenspot_pro		data/mock_screenspot_pro
envs/screenspot_pro		envs/screenspot_pro
out_samples		out_samples
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
adapter_vf.py		adapter_vf.py
dataset.py		dataset.py
eval.py		eval.py
metrics.py		metrics.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

screenspot_pro

Quickstart (local)

About

Uh oh!

Releases

Packages

Languages

stuck-inadream/screenspot_pro

Folders and files

Latest commit

History

Repository files navigation

screenspot_pro

Quickstart (local)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages