|
2 | 2 |
|
3 | 3 | `macos-cua` is a macOS-only low-level computer-use runtime for agents. |
4 | 4 |
|
5 | | -It is designed around three defaults: |
| 5 | +Its default path is simple: |
6 | 6 |
|
7 | | -- Coordinates default to the frontmost-window coordinate space. |
8 | | -- The frontmost window is first-class. |
9 | | -- Human-readable stdout is the default; add `--json` for structured output. |
| 7 | +- coordinates are absolute |
| 8 | +- coordinates are window-first |
| 9 | +- stdout is human-readable by default |
10 | 10 |
|
11 | | -## Permissions |
| 11 | +## Install |
12 | 12 |
|
13 | | -`macos-cua` relies on standard macOS permissions: |
| 13 | +Build locally: |
14 | 14 |
|
15 | | -- `Accessibility`: required for synthetic mouse, keyboard, and window actions. |
16 | | -- `Screen Recording`: required for screenshots. |
17 | | - |
18 | | -Use `macos-cua doctor` to inspect current readiness. |
| 15 | +```bash |
| 16 | +swift build |
| 17 | +``` |
19 | 18 |
|
20 | | -Use `macos-cua onboard` to trigger the native prompts, open the relevant System Settings panes, and guide a human through granting both permissions. In a tty session it waits by default; in non-tty mode it triggers the flow and returns immediately unless you pass `--wait`. When Screen Recording appears to have been granted but the process has not yet been restarted, `onboard` surfaces a targeted restart hint rather than a generic enable instruction. Add `--json` for structured output including per-permission `granted`, `waited`, and `likelyNeedsRestart` fields. |
| 19 | +Run with either: |
21 | 20 |
|
22 | | -## Commands |
| 21 | +```bash |
| 22 | +swift run macos-cua <command> |
| 23 | +``` |
23 | 24 |
|
24 | | -```text |
25 | | -macos-cua [--json] <command> [args...] |
| 25 | +or: |
26 | 26 |
|
27 | | -onboard [--wait|--no-wait] [--timeout <seconds>] [--no-request] [--no-open] |
28 | | -doctor |
29 | | -state |
30 | | -record enable|disable|status |
31 | | -screenshot [--screen] [--region x y w h] <path.png> |
32 | | -move <x> <y> [--screen] [--fast|--precise] |
33 | | -click <x> <y> [left|right|middle] [--screen] [--fast|--precise] |
34 | | -double-click <x> <y> [left|right|middle] [--screen] [--fast|--precise] |
35 | | -scroll <dx> <dy> |
36 | | -keypress <key[+key...]> |
37 | | -type [--fast] <text> |
38 | | -wait <ms> |
39 | | -clipboard get|set|copy|paste |
40 | | -app list|frontmost|activate |
41 | | -window frontmost|list|activate|minimize|maximize|close |
| 27 | +```bash |
| 28 | +./.build/debug/macos-cua <command> |
42 | 29 | ``` |
43 | 30 |
|
44 | | -## Examples |
| 31 | +## Authorize |
| 32 | + |
| 33 | +`macos-cua` needs two standard macOS permissions: |
| 34 | + |
| 35 | +- `Accessibility` for mouse, keyboard, and window actions |
| 36 | +- `Screen Recording` for screenshots |
| 37 | + |
| 38 | +First check readiness: |
45 | 39 |
|
46 | 40 | ```bash |
47 | 41 | swift run macos-cua doctor |
| 42 | +``` |
| 43 | + |
| 44 | +Then start the built-in onboarding flow: |
| 45 | + |
| 46 | +```bash |
48 | 47 | swift run macos-cua onboard |
| 48 | +``` |
| 49 | + |
| 50 | +If you are running in a terminal and want it to wait for you: |
| 51 | + |
| 52 | +```bash |
49 | 53 | swift run macos-cua onboard --wait --timeout 180 |
50 | | -swift run macos-cua --json onboard --no-wait |
51 | | -swift run macos-cua record enable |
52 | | -swift run macos-cua --json state |
| 54 | +``` |
| 55 | + |
| 56 | +## Happy Path |
| 57 | + |
| 58 | +Daily use should start with absolute coordinates. |
| 59 | + |
| 60 | +1. Inspect the current coordinate space: |
| 61 | + |
| 62 | +```bash |
| 63 | +swift run macos-cua state |
| 64 | +``` |
| 65 | + |
| 66 | +2. Capture the frontmost window: |
| 67 | + |
| 68 | +```bash |
53 | 69 | swift run macos-cua screenshot /tmp/frontmost.png |
54 | | -swift run macos-cua screenshot --screen /tmp/screen.png |
| 70 | +``` |
| 71 | + |
| 72 | +3. Or Capture a region in the active coordinate space: |
| 73 | + |
| 74 | +```bash |
55 | 75 | swift run macos-cua screenshot --region 100 100 300 200 /tmp/region.png |
| 76 | +``` |
| 77 | + |
| 78 | +4. Move and click with absolute coordinates: |
| 79 | + |
| 80 | +```bash |
56 | 81 | swift run macos-cua move 800 400 --precise |
57 | | -swift run macos-cua move 800 400 --screen --precise |
58 | 82 | swift run macos-cua click 800 400 --fast |
| 83 | +``` |
| 84 | + |
| 85 | +5. Use screen-global coordinates only when needed: |
| 86 | + |
| 87 | +```bash |
| 88 | +swift run macos-cua screenshot --screen /tmp/screen.png |
| 89 | +swift run macos-cua move 800 400 --screen --precise |
59 | 90 | swift run macos-cua click 800 400 --screen --fast |
60 | | -swift run macos-cua keypress cmd+n |
61 | | -swift run macos-cua type "hello from macos-cua" |
62 | | -swift run macos-cua clipboard set "hello" |
63 | | -swift run macos-cua clipboard paste |
64 | | -swift run macos-cua app activate Code |
65 | | -swift run macos-cua window list |
66 | 91 | ``` |
67 | 92 |
|
68 | | -## Notes |
69 | | - |
70 | | -- Default `screenshot`, `move`, `click`, and `double-click` use frontmost-window coordinates. |
71 | | -- `--screen` switches `screenshot`, `move`, `click`, and `double-click` to screen-global coordinates. |
72 | | -- If no usable frontmost window is available, default coordinate-taking commands fall back to screen coordinates and report that fallback in output. |
73 | | -- `window list` is AX-first when Accessibility is available, then falls back to CoreGraphics window discovery. |
74 | | -- `window list`, `window frontmost`, and `state.frontmostWindow.bounds` remain screen-global diagnostics; they are not window-local action coordinates. |
75 | | -- Missing permission errors point back to `macos-cua onboard` so agent and human flows land on the same recovery path. |
76 | | -- Browser DOM/ref actions are intentionally out of scope for this repo. |
77 | | -- `record enable` starts a persistent session under `~/Library/Application Support/macos-cua/records/`; each subsequent command appends an action log entry, a full-screen timeline screenshot, failure-only snapshots, and a replayable `replay.sh` trace until `record disable`. |
78 | | -- A shareable VS Code debug example lives at `.vscode/launch.example.json`; local `.vscode/launch.json` stays ignored. |
79 | | -- GitHub Actions can be triggered manually to build release CLI archives for both `arm64` and `x86_64` macOS runners. |
80 | | -- Pointer movement anti-bot research notes live in [`docs/research/movement-anti-bot.md`](docs/research/movement-anti-bot.md). |
81 | | -- `move`, `click`, and `double-click` use humanized pointer motion profiles; default is `--fast`, with `--precise` available for tighter target acquisition. |
82 | | - |
83 | | -## State Output |
84 | | - |
85 | | -- `state.defaultCoordinateSpace` reports whether default coordinate-taking commands currently resolve to `window` or `screen`. |
86 | | -- `state.defaultCoordinateFallback` is `true` when default commands had to fall back from window coordinates to screen coordinates. |
87 | | -- `state.pointerScreen` is the current pointer in screen-global coordinates. |
88 | | -- `state.pointerWindow` is the current pointer relative to the frontmost window when available, otherwise `null`. |
89 | | -- `state.frontmostWindow.bounds` stays in screen-global coordinates to make window-local to screen-global translation explicit. |
90 | | - |
91 | | -## Screenshot Resolution |
92 | | - |
93 | | -- Screenshot output is normalized to logical action-space dimensions, not Retina/native pixel dimensions. |
94 | | -- This keeps screenshot coordinates aligned with the active coordinate space for `move` and `click` without requiring callers to divide by `scale`. |
95 | | -- `actionSpace.width` and `actionSpace.height` still describe the main-screen logical action space. |
96 | | -- For `screenshot --screen`, `image.width` and `image.height` match the main-screen action space. |
97 | | -- For default window screenshots, `image.width` and `image.height` match the frontmost window size and the returned `bounds` are window-local. |
98 | | -- For default region screenshots, `bounds` are interpreted in the active coordinate space and align with the returned raster. |
99 | | -- Window screenshots are captured without the macOS drop shadow so the image edges line up with the reported window bounds. |
100 | | -- Native pixel fidelity is intentionally discarded during screenshot export to preserve direct coordinate compatibility for agent actions. |
| 93 | +## Dense UI Fallback |
| 94 | + |
| 95 | +When a page is visually dense and the target is a small icon, URL, or toolbar |
| 96 | +item, treat `screenshot --region` as the fallback inspection step instead of |
| 97 | +guessing a final click from the full screenshot alone. |
| 98 | + |
| 99 | +Recommended pattern: |
| 100 | + |
| 101 | +1. Capture the full frontmost window for global context. |
| 102 | +2. Take a second local crop tightly around the likely target area. |
| 103 | +3. Re-read the local crop, then issue the final click. |
| 104 | + |
| 105 | +Example: |
| 106 | + |
| 107 | +```bash |
| 108 | +swift run macos-cua screenshot /tmp/frontmost.png |
| 109 | +swift run macos-cua screenshot --region 720 88 220 96 /tmp/toolbar-crop.png |
| 110 | +swift run macos-cua click 812 132 --fast |
| 111 | +``` |
| 112 | + |
| 113 | +## Model Resolution |
| 114 | + |
| 115 | +Use absolute coordinates when your screenshot can be consumed at full useful resolution by the model. If the image must be resized or compressed before inference, switch to the relative-mode workflow in the docs. |
| 116 | + |
| 117 | +As of 2026-04-18, public vendor docs indicate the following: |
| 118 | + |
| 119 | +| Model | Documented vision resolution support | |
| 120 | +| --- | --- | |
| 121 | +| [`gpt-5.4`](https://developers.openai.com/api/docs/models/gpt-5.4) | Up to 6000 px max dimension with `detail: "original"`; 2048 px max dimension with `detail: "high"`. | |
| 122 | +| [`gpt-5.4-mini`](https://developers.openai.com/api/docs/models/gpt-5.4-mini) | Up to 2048 px max dimension with `detail: "high"`. | |
| 123 | +| [Claude Opus 4.7](https://platform.claude.com/docs/en/build-with-claude/vision) | Up to 2576 px on the long edge. | |
| 124 | +| [Claude Sonnet 4.5 / 4.6 and other current Claude vision models](https://platform.claude.com/docs/en/build-with-claude/vision) | Up to 1568 px on the long edge. See the [relative-mode workflow](docs/README.md#relative-mode-and-resized-images). | |
| 125 | + |
| 126 | +Older or weaker vision models are not recommended for coordinate-driven desktop control. |
| 127 | + |
| 128 | +## More |
| 129 | + |
| 130 | +Advanced workflows and reference material live under [docs/README.md](docs/README.md). |
0 commit comments