Skip to content

Commit ab2edf7

Browse files
iphone: coordinate-space handling for tap/swipe + screenshot metadata (#1317)
## Why `iphone.tap`/`swipe` take **points**; `iphone.screenshot` returns **pixels** (points × display scale, 3× on this device). Reading a coordinate off a screenshot and passing it to `tap` lands `scale`× too far — often off-screen entirely. Hit this live driving a real iPhone: taps read off screenshots silently missed. ## What - `tap(x, y, *, space="points")` and `swipe(..., space="points")` accept `"points"` (default, back-compat), `"pixels"` (screenshot coords), or `"fraction"` (0..1, robust to image rescaling). Conversion factored into a pure, unit-tested `_to_points`. - `screen()` and `window_size()` helpers expose geometry (points, pixels, scale). - `screenshot()` stamps `img.info` with `scale` / `point_size` / `pixel_size` so an image is self-describing. - UI actions (`tap`/`swipe`/`press`/`type_text`/`home`/`tap_element`) raise a one-line `IphoneError` naming `wda_start` when WDA is down, instead of a deep urllib stack. - Docstrings (the `api()`/`help()` source of truth) steer toward `tap_element(label=...)` and `space="fraction"` over hand-converted pixels. ## Verification Local: `nix build .#mcp.tests.iphoneTests` → 18 passed; `.#mcp.tests.iphoneBundled` → ok; `nix run .#lint` clean. Live experiment (real iPhone, scale 3, 402×874 pt), 8 on-screen targets: | arm | hit rate | |---|---| | baseline (pixels-as-points) | **0/8** | | `space="pixels"` | **8/8** | | `space="fraction"` | **8/8** | End-to-end: tapping a dock icon's pixel center as points did **not** launch the app (reproduces the bug); `space="pixels"` launched it. `tap_element(label=...)` and `screenshot().info` stamping confirmed live. <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will append its summary at the bottom of the description. --> > [!NOTE] > ### Add coordinate-space handling for `tap`/`swipe` and screenshot metadata to iPhone module > - `tap` and `swipe` now accept float coordinates and a `space` parameter (`'points'`, `'pixels'`, or `'fraction'`), resolving coordinates via new helpers `_to_points` and `_resolve_points` in [__init__.py](https://github.com/indexable-inc/index/pull/1317/files#diff-f92c2ac3b6f1cee30bc3c534269f16357522a2d23c4c9e1e984a8f32030227b7). > - New `screen` and `window_size` async helpers expose screen geometry (points, pixels, scale) by querying WDA. > - `screenshot` now populates `Image.info` with `scale`, `pixel_size`, and `point_size` when captured via WDA. > - All UI actions (`tap`, `swipe`, `press`, `type_text`, `home`, `tap_element`) now call `_require_wda()` and raise `IphoneError` immediately if WDA is unreachable. > - Cached display scale (`_wda_scale`) is reset on `wda_start` and `wda_stop` to prevent stale values across sessions or devices. > - Behavioral Change: `tap` and `swipe` signatures now accept `float` instead of `int`; passing an unknown space or out-of-range fraction raises `IphoneError`. > > <!-- Macroscope's review summary starts here --> > > <sup><a href="https://app.macroscope.com">Macroscope</a> summarized 1720daa.</sup> > <!-- Macroscope's review summary ends here --> > <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here -->
1 parent 1b47010 commit ab2edf7

2 files changed

Lines changed: 226 additions & 15 deletions

File tree

packages/mcp/src/iphone/iphone/__init__.py

Lines changed: 158 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,9 @@
2929
full Xcode + an Apple ID in Xcode) and started (``wda_start(sudo=True)``);
3030
``source()`` returns the live accessibility tree (with rects) as a polars
3131
frame, which is ground truth for where to tap even when a GPU/Metal view
32-
screenshots black.
32+
screenshots black. Taps take POINTS; screenshots come back in PIXELS (``scale``×
33+
bigger). Prefer ``tap_element(label=...)`` or ``tap(px, py, space="pixels")`` /
34+
``space="fraction"`` over hand-converting (see ``screen()``).
3335
3436
Run ``iphone.doctor()`` to see exactly which prerequisite is missing.
3537
@@ -76,6 +78,7 @@
7678
"kill",
7779
"launch",
7880
"press",
81+
"screen",
7982
"screenshot",
8083
"simulate_location",
8184
"source",
@@ -92,6 +95,7 @@
9295
"wda_start",
9396
"wda_status",
9497
"wda_stop",
98+
"window_size",
9599
]
96100

97101
_TUNNELD_HOST = "127.0.0.1"
@@ -242,6 +246,11 @@ async def screenshot(udid: str | None = None) -> Image.Image:
242246
running, captures via WDA instead: while an XCUITest session holds the
243247
display, the DVT screenshot returns a black frame, so WDA is the reliable
244248
source during automation.
249+
250+
The image is in PIXELS, which is ``scale``× the point grid taps use. When WDA
251+
is up the returned image's ``.info`` carries ``scale``, ``point_size`` and
252+
``pixel_size`` so coordinates can be reconciled; to tap a spot you see here,
253+
pass ``tap(px, py, space="pixels")`` rather than the raw numbers.
245254
"""
246255
import base64
247256
import io
@@ -254,7 +263,12 @@ async def screenshot(udid: str | None = None) -> Image.Image:
254263
value = await _wda("GET", "/screenshot")
255264
png = base64.b64decode(str(value))
256265
with Image.open(io.BytesIO(png)) as img:
257-
return img.copy()
266+
shot = img.copy()
267+
scale = await _screen_scale()
268+
shot.info["scale"] = scale
269+
shot.info["pixel_size"] = shot.size
270+
shot.info["point_size"] = (round(shot.width / scale), round(shot.height / scale))
271+
return shot
258272

259273
target = await _resolve(udid)
260274
with tempfile.TemporaryDirectory() as tmp:
@@ -542,6 +556,9 @@ async def clear_location(udid: str | None = None) -> None:
542556
_wda_xcuitest_proc: asyncio.subprocess.Process | None = None
543557
_wda_forward_proc: asyncio.subprocess.Process | None = None
544558
_wda_session: str | None = None
559+
# Display scale (pixels per point) of the bound device. Fixed per device, so
560+
# cache it after the first lookup; cleared whenever a fresh runner is bound.
561+
_wda_scale: float | None = None
545562
# The device the running WDA runner/forward target, so wda_start does not silently
546563
# report "already running" for a different udid than the caller asked for.
547564
_wda_device: str | None = None
@@ -676,8 +693,111 @@ def walk(node: object) -> None:
676693
return pl.DataFrame(rows, schema=schema)
677694

678695

679-
async def tap(x: int, y: int) -> None:
680-
"""Tap at screen coordinates (points) on the active WDA device, via W3C actions."""
696+
async def _require_wda() -> None:
697+
"""Raise a one-line, actionable error when WDA is not reachable.
698+
699+
UI actions otherwise fail deep inside a urllib stack on the first HTTP call,
700+
which reads as a crash rather than a missing precondition.
701+
"""
702+
if not await _wda_up():
703+
raise IphoneError(
704+
"WebDriverAgent is not reachable; run iphone.wda_start(sudo=True) "
705+
"first (one-time install via iphone.wda_build_install())."
706+
)
707+
708+
709+
# Coordinate spaces a caller may supply. Taps/swipes/element rects are in POINTS
710+
# (what WDA wants); screenshots come back in PIXELS (points × display scale); a
711+
# FRACTION is a 0..1 share of the screen, invariant to how an image was rendered.
712+
_COORD_SPACES = ("points", "pixels", "fraction")
713+
714+
715+
def _to_points(
716+
x: float,
717+
y: float,
718+
space: str,
719+
scale: float,
720+
size: tuple[int, int],
721+
) -> tuple[int, int]:
722+
"""Convert an (x, y) in ``space`` to integer device points.
723+
724+
Pure (no I/O) so it is unit-testable: ``scale`` is pixels-per-point and
725+
``size`` is the screen's (width, height) in points.
726+
"""
727+
if space == "points":
728+
return round(x), round(y)
729+
if space == "pixels":
730+
return round(x / scale), round(y / scale)
731+
if space == "fraction":
732+
if not (0.0 <= x <= 1.0 and 0.0 <= y <= 1.0):
733+
raise IphoneError(f"fraction coordinates must be in 0..1, got ({x}, {y})")
734+
return round(x * size[0]), round(y * size[1])
735+
raise IphoneError(f"unknown coordinate space {space!r}; use one of {_COORD_SPACES}")
736+
737+
738+
async def _screen_scale() -> float:
739+
"""Display scale (pixels per point) of the bound device, cached."""
740+
global _wda_scale
741+
if _wda_scale is None:
742+
info = await _wda_in_session("GET", "/wda/screen")
743+
scale = info.get("scale") if isinstance(info, dict) else None
744+
if not isinstance(scale, (int, float)) or scale <= 0:
745+
raise IphoneError(f"WDA reported no usable display scale: {info!r}")
746+
_wda_scale = float(scale)
747+
return _wda_scale
748+
749+
750+
async def window_size() -> tuple[int, int]:
751+
"""Screen size in points (width, height) for the active orientation."""
752+
value = await _wda_in_session("GET", "/window/size")
753+
if not isinstance(value, dict) or "width" not in value or "height" not in value:
754+
raise IphoneError(f"WDA window size returned no dimensions: {value!r}")
755+
return int(value["width"]), int(value["height"])
756+
757+
758+
async def screen() -> dict[str, object]:
759+
"""Screen geometry: points (w, h), pixels (w, h), and the pixels/point scale.
760+
761+
Use this to reconcile a screenshot (pixels) with taps (points): a control at
762+
pixel (px, py) in a screenshot is at point (px/scale, py/scale). Prefer
763+
``tap_element`` or ``tap(..., space="fraction")`` so you never do the math by
764+
hand.
765+
"""
766+
pts = await window_size()
767+
scale = await _screen_scale()
768+
return {
769+
"points": pts,
770+
"pixels": (round(pts[0] * scale), round(pts[1] * scale)),
771+
"scale": scale,
772+
}
773+
774+
775+
async def _resolve_points(x: float, y: float, space: str) -> tuple[int, int]:
776+
"""Turn caller coordinates in ``space`` into device points (fetching geometry
777+
only when the space actually needs it)."""
778+
if space == "points":
779+
return _to_points(x, y, space, 1.0, (0, 0))
780+
scale = await _screen_scale() if space == "pixels" else 1.0
781+
size = await window_size() if space == "fraction" else (0, 0)
782+
return _to_points(x, y, space, scale, size)
783+
784+
785+
async def tap(x: float, y: float, *, space: str = "points") -> None:
786+
"""Tap the screen via WDA's W3C actions.
787+
788+
``space`` selects the coordinate system of ``x``/``y``:
789+
``"points"`` (default, what WDA and ``source()`` rects use), ``"pixels"``
790+
(as read off a ``screenshot()``), or ``"fraction"`` (0..1 share of the
791+
screen, robust to image rescaling).
792+
793+
Prefer ``tap_element(label=...)`` when the control has an accessibility
794+
label, or ``space="fraction"`` when eyeballing a screenshot. Never feed pixel
795+
coordinates read off a screenshot as default (points): a screenshot is
796+
``scale``× larger than the point grid (see ``screen()``), so they overshoot
797+
and land off-screen.
798+
"""
799+
await _require_wda()
800+
px, py = await _resolve_points(x, y, space)
681801
await _wda_in_session(
682802
"POST",
683803
"/actions",
@@ -688,7 +808,7 @@ async def tap(x: int, y: int) -> None:
688808
"id": "finger1",
689809
"parameters": {"pointerType": "touch"},
690810
"actions": [
691-
{"type": "pointerMove", "duration": 0, "x": x, "y": y},
811+
{"type": "pointerMove", "duration": 0, "x": px, "y": py},
692812
{"type": "pointerDown", "button": 0},
693813
{"type": "pause", "duration": 90},
694814
{"type": "pointerUp", "button": 0},
@@ -700,7 +820,13 @@ async def tap(x: int, y: int) -> None:
700820

701821

702822
async def tap_element(*, name: str | None = None, label: str | None = None) -> None:
703-
"""Find an element by accessibility name or label and tap its center."""
823+
"""Find an element by accessibility name or label and tap its center.
824+
825+
The most reliable way to tap: rects come straight from the accessibility tree
826+
in points, so there is no coordinate-space guesswork. Prefer this over raw
827+
``tap`` whenever the target has a name/label.
828+
"""
829+
await _require_wda()
704830
frame = await source()
705831
import polars as pl
706832

@@ -720,14 +846,22 @@ async def tap_element(*, name: str | None = None, label: str | None = None) -> N
720846

721847

722848
async def swipe(
723-
start_x: int,
724-
start_y: int,
725-
end_x: int,
726-
end_y: int,
849+
start_x: float,
850+
start_y: float,
851+
end_x: float,
852+
end_y: float,
727853
*,
728854
duration: float = 0.3,
855+
space: str = "points",
729856
) -> None:
730-
"""Swipe between two screen coordinates (points) via the W3C Actions API."""
857+
"""Swipe between two screen coordinates via the W3C Actions API.
858+
859+
``space`` is the coordinate system of both endpoints — ``"points"`` (default),
860+
``"pixels"`` (screenshot coordinates), or ``"fraction"`` (0..1) — see ``tap``.
861+
"""
862+
await _require_wda()
863+
sx, sy = await _resolve_points(start_x, start_y, space)
864+
ex, ey = await _resolve_points(end_x, end_y, space)
731865
await _wda_in_session(
732866
"POST",
733867
"/actions",
@@ -738,9 +872,9 @@ async def swipe(
738872
"id": "finger1",
739873
"parameters": {"pointerType": "touch"},
740874
"actions": [
741-
{"type": "pointerMove", "duration": 0, "x": start_x, "y": start_y},
875+
{"type": "pointerMove", "duration": 0, "x": sx, "y": sy},
742876
{"type": "pointerDown", "button": 0},
743-
{"type": "pointerMove", "duration": int(duration * 1000), "x": end_x, "y": end_y},
877+
{"type": "pointerMove", "duration": int(duration * 1000), "x": ex, "y": ey},
744878
{"type": "pointerUp", "button": 0},
745879
],
746880
}
@@ -751,16 +885,19 @@ async def swipe(
751885

752886
async def type_text(text: str) -> None:
753887
"""Type into the focused field (tap a field first) via WDA."""
888+
await _require_wda()
754889
await _wda_in_session("POST", "/wda/keys", {"value": list(text)})
755890

756891

757892
async def press(button: str) -> None:
758893
"""Press a hardware button via WDA (home / volumeUp / volumeDown)."""
894+
await _require_wda()
759895
await _wda_in_session("POST", "/wda/pressButton", {"name": button})
760896

761897

762898
async def home() -> None:
763899
"""Go to the home screen via WDA (the home-button gesture)."""
900+
await _require_wda()
764901
await _wda_in_session("POST", "/wda/pressButton", {"name": "home"})
765902

766903

@@ -912,7 +1049,7 @@ async def wda_start(*, sudo: bool = False, udid: str | None = None) -> str:
9121049
the developer tunnel, ``sudo=True`` (same root-daemon opt-in as
9131050
``start_tunneld``). Returns once WDA answers, with a session ready for taps.
9141051
"""
915-
global _wda_xcuitest_proc, _wda_forward_proc, _wda_session, _wda_device
1052+
global _wda_xcuitest_proc, _wda_forward_proc, _wda_session, _wda_device, _wda_scale
9161053

9171054
target = await _resolve(udid)
9181055
if await _wda_up():
@@ -935,7 +1072,9 @@ async def wda_start(*, sudo: bool = False, udid: str | None = None) -> str:
9351072
)
9361073
if _wda_xcuitest_proc is None or _wda_xcuitest_proc.returncode is not None:
9371074
# A fresh runner means any cached session id is dead; force a new one.
1075+
# The scale belongs to the device being (re)bound, so re-read it too.
9381076
_wda_session = None
1077+
_wda_scale = None
9391078
_wda_device = target
9401079
_wda_xcuitest_proc = await asyncio.create_subprocess_exec(
9411080
*_pmd3_argv(),
@@ -983,10 +1122,14 @@ async def _terminate(proc: asyncio.subprocess.Process | None) -> None:
9831122

9841123
async def wda_stop() -> None:
9851124
"""Stop the WDA runner and port-forward started by this module."""
986-
global _wda_xcuitest_proc, _wda_forward_proc, _wda_session, _wda_device
1125+
global _wda_xcuitest_proc, _wda_forward_proc, _wda_session, _wda_device, _wda_scale
9871126

9881127
_wda_session = None
9891128
_wda_device = None
1129+
# Drop the cached scale too: a later wda_start(udid=None) against an
1130+
# externally-reachable WDA takes the "already running" early return and never
1131+
# re-reads it, so a stale scale would mis-convert taps on a different device.
1132+
_wda_scale = None
9901133
await _terminate(_wda_forward_proc)
9911134
await _terminate(_wda_xcuitest_proc)
9921135
_wda_forward_proc = None

packages/mcp/tests/test_iphone.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,74 @@ def test_tap_is_coordinate_based() -> None:
8686
# tap moved from a (broken) WDA selector to W3C coordinate taps.
8787
params = list(inspect.signature(iphone.tap).parameters)
8888
assert params[:2] == ["x", "y"], params
89+
# The coordinate space is opt-in and defaults to points (back-compat).
90+
assert inspect.signature(iphone.tap).parameters["space"].default == "points"
91+
92+
93+
def test_to_points_identity_for_points() -> None:
94+
# Points pass through unchanged (only rounded to ints).
95+
assert iphone._to_points(10.4, 20.6, "points", 3.0, (402, 874)) == (10, 21)
96+
97+
98+
def test_to_points_pixels_divides_by_scale() -> None:
99+
# A screenshot pixel (1206, 2622) on a scale-3 device is point (402, 874):
100+
# exactly the bug we hit (pixels read as points landed 3× too far).
101+
assert iphone._to_points(1206, 2622, "pixels", 3.0, (402, 874)) == (402, 874)
102+
103+
104+
def test_to_points_fraction_scales_to_size() -> None:
105+
assert iphone._to_points(0.5, 0.5, "fraction", 3.0, (402, 874)) == (201, 437)
106+
assert iphone._to_points(0.0, 1.0, "fraction", 3.0, (402, 874)) == (0, 874)
107+
108+
109+
def test_to_points_pixels_roundtrip_over_grid() -> None:
110+
# center(points) -> pixels(×scale) -> back must recover the point, for a
111+
# spread of real-looking coordinates and scales.
112+
for scale in (2.0, 3.0):
113+
for px in range(0, 402, 37):
114+
for py in range(0, 874, 53):
115+
bx, by = px * scale, py * scale
116+
assert iphone._to_points(bx, by, "pixels", scale, (402, 874)) == (px, py)
117+
118+
119+
def test_to_points_fraction_out_of_range_raises() -> None:
120+
with pytest.raises(iphone.IphoneError, match=r"0\.\.1"):
121+
iphone._to_points(1.5, 0.5, "fraction", 3.0, (402, 874))
122+
123+
124+
def test_to_points_unknown_space_raises() -> None:
125+
with pytest.raises(iphone.IphoneError, match="coordinate space"):
126+
iphone._to_points(1, 2, "inches", 3.0, (402, 874))
127+
128+
129+
def test_wda_stop_clears_cached_scale(monkeypatch: pytest.MonkeyPatch) -> None:
130+
# The display scale is cached per device; wda_stop must drop it so a later
131+
# wda_start against a different-scale device cannot reuse a stale value.
132+
monkeypatch.setattr(iphone, "_wda_scale", 3.0)
133+
monkeypatch.setattr(iphone, "_wda_session", "sid")
134+
monkeypatch.setattr(iphone, "_wda_device", "udid")
135+
asyncio.run(iphone.wda_stop())
136+
assert iphone._wda_scale is None
137+
assert iphone._wda_session is None
138+
assert iphone._wda_device is None
139+
140+
141+
def test_ui_actions_require_wda(monkeypatch: pytest.MonkeyPatch) -> None:
142+
# When WDA is down, UI actions must raise a one-line precondition error
143+
# (naming wda_start) rather than failing deep in a urllib stack.
144+
async def _down() -> bool:
145+
return False
146+
147+
monkeypatch.setattr(iphone, "_wda_up", _down)
148+
for call in (
149+
iphone.tap(1, 2),
150+
iphone.swipe(1, 2, 3, 4),
151+
iphone.press("home"),
152+
iphone.type_text("x"),
153+
iphone.home(),
154+
):
155+
with pytest.raises(iphone.IphoneError, match="wda_start"):
156+
asyncio.run(call)
89157

90158

91159
def test_wda_down_raises_cleanly(monkeypatch: pytest.MonkeyPatch) -> None:

0 commit comments

Comments
 (0)