Skip to content

Commit 7c6d1d8

Browse files
authored
Merge pull request #3 from timzhong1024/feat/relative-coordinates
Add relative coordinates and improve modal/window handling
2 parents 1353ffe + d3018f1 commit 7c6d1d8

24 files changed

Lines changed: 2418 additions & 323 deletions

.github/workflows/build-cli.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ jobs:
5353
echo "ARCHIVE_BASENAME=${ARCHIVE_BASENAME}" >> "${GITHUB_ENV}"
5454
5555
- name: Upload packaged CLI
56-
uses: actions/upload-artifact@v4
56+
uses: actions/upload-artifact@v6
5757
with:
5858
name: ${{ env.ARCHIVE_BASENAME }}
5959
path: |

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,14 @@
11
.DS_Store
22
/.build
3+
/.build-npm-arm64
4+
/.build-npm-x86_64
35
/Packages
46
xcuserdata/
57
DerivedData/
68
.swiftpm/configuration/registries.json
79
.swiftpm/xcode/package.xcworkspace/contents.xcworkspacedata
810
.netrc
911
.vscode/launch.json
12+
/npm/bin/macos-cua
13+
/npm/LICENSE
14+
/npm/*.tgz

README.md

Lines changed: 104 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -2,99 +2,129 @@
22

33
`macos-cua` is a macOS-only low-level computer-use runtime for agents.
44

5-
It is designed around three defaults:
5+
Its default path is simple:
66

7-
- Coordinates default to the frontmost-window coordinate space.
8-
- The frontmost window is first-class.
9-
- Human-readable stdout is the default; add `--json` for structured output.
7+
- coordinates are absolute
8+
- coordinates are window-first
9+
- stdout is human-readable by default
1010

11-
## Permissions
11+
## Install
1212

13-
`macos-cua` relies on standard macOS permissions:
13+
Build locally:
1414

15-
- `Accessibility`: required for synthetic mouse, keyboard, and window actions.
16-
- `Screen Recording`: required for screenshots.
17-
18-
Use `macos-cua doctor` to inspect current readiness.
15+
```bash
16+
swift build
17+
```
1918

20-
Use `macos-cua onboard` to trigger the native prompts, open the relevant System Settings panes, and guide a human through granting both permissions. In a tty session it waits by default; in non-tty mode it triggers the flow and returns immediately unless you pass `--wait`. When Screen Recording appears to have been granted but the process has not yet been restarted, `onboard` surfaces a targeted restart hint rather than a generic enable instruction. Add `--json` for structured output including per-permission `granted`, `waited`, and `likelyNeedsRestart` fields.
19+
Run with either:
2120

22-
## Commands
21+
```bash
22+
swift run macos-cua <command>
23+
```
2324

24-
```text
25-
macos-cua [--json] <command> [args...]
25+
or:
2626

27-
onboard [--wait|--no-wait] [--timeout <seconds>] [--no-request] [--no-open]
28-
doctor
29-
state
30-
record enable|disable|status
31-
screenshot [--screen] [--region x y w h] <path.png>
32-
move <x> <y> [--screen] [--fast|--precise]
33-
click <x> <y> [left|right|middle] [--screen] [--fast|--precise]
34-
double-click <x> <y> [left|right|middle] [--screen] [--fast|--precise]
35-
scroll <dx> <dy>
36-
keypress <key[+key...]>
37-
type [--fast] <text>
38-
wait <ms>
39-
clipboard get|set|copy|paste
40-
app list|frontmost|activate
41-
window frontmost|list|activate|minimize|maximize|close
27+
```bash
28+
./.build/debug/macos-cua <command>
4229
```
4330

44-
## Examples
31+
## Authorize
32+
33+
`macos-cua` needs two standard macOS permissions:
34+
35+
- `Accessibility` for mouse, keyboard, and window actions
36+
- `Screen Recording` for screenshots
37+
38+
First check readiness:
4539

4640
```bash
4741
swift run macos-cua doctor
42+
```
43+
44+
Then start the built-in onboarding flow:
45+
46+
```bash
4847
swift run macos-cua onboard
48+
```
49+
50+
If you are running in a terminal and want it to wait for you:
51+
52+
```bash
4953
swift run macos-cua onboard --wait --timeout 180
50-
swift run macos-cua --json onboard --no-wait
51-
swift run macos-cua record enable
52-
swift run macos-cua --json state
54+
```
55+
56+
## Happy Path
57+
58+
Daily use should start with absolute coordinates.
59+
60+
1. Inspect the current coordinate space:
61+
62+
```bash
63+
swift run macos-cua state
64+
```
65+
66+
2. Capture the frontmost window:
67+
68+
```bash
5369
swift run macos-cua screenshot /tmp/frontmost.png
54-
swift run macos-cua screenshot --screen /tmp/screen.png
70+
```
71+
72+
3. Or Capture a region in the active coordinate space:
73+
74+
```bash
5575
swift run macos-cua screenshot --region 100 100 300 200 /tmp/region.png
76+
```
77+
78+
4. Move and click with absolute coordinates:
79+
80+
```bash
5681
swift run macos-cua move 800 400 --precise
57-
swift run macos-cua move 800 400 --screen --precise
5882
swift run macos-cua click 800 400 --fast
83+
```
84+
85+
5. Use screen-global coordinates only when needed:
86+
87+
```bash
88+
swift run macos-cua screenshot --screen /tmp/screen.png
89+
swift run macos-cua move 800 400 --screen --precise
5990
swift run macos-cua click 800 400 --screen --fast
60-
swift run macos-cua keypress cmd+n
61-
swift run macos-cua type "hello from macos-cua"
62-
swift run macos-cua clipboard set "hello"
63-
swift run macos-cua clipboard paste
64-
swift run macos-cua app activate Code
65-
swift run macos-cua window list
6691
```
6792

68-
## Notes
69-
70-
- Default `screenshot`, `move`, `click`, and `double-click` use frontmost-window coordinates.
71-
- `--screen` switches `screenshot`, `move`, `click`, and `double-click` to screen-global coordinates.
72-
- If no usable frontmost window is available, default coordinate-taking commands fall back to screen coordinates and report that fallback in output.
73-
- `window list` is AX-first when Accessibility is available, then falls back to CoreGraphics window discovery.
74-
- `window list`, `window frontmost`, and `state.frontmostWindow.bounds` remain screen-global diagnostics; they are not window-local action coordinates.
75-
- Missing permission errors point back to `macos-cua onboard` so agent and human flows land on the same recovery path.
76-
- Browser DOM/ref actions are intentionally out of scope for this repo.
77-
- `record enable` starts a persistent session under `~/Library/Application Support/macos-cua/records/`; each subsequent command appends an action log entry, a full-screen timeline screenshot, failure-only snapshots, and a replayable `replay.sh` trace until `record disable`.
78-
- A shareable VS Code debug example lives at `.vscode/launch.example.json`; local `.vscode/launch.json` stays ignored.
79-
- GitHub Actions can be triggered manually to build release CLI archives for both `arm64` and `x86_64` macOS runners.
80-
- Pointer movement anti-bot research notes live in [`docs/research/movement-anti-bot.md`](docs/research/movement-anti-bot.md).
81-
- `move`, `click`, and `double-click` use humanized pointer motion profiles; default is `--fast`, with `--precise` available for tighter target acquisition.
82-
83-
## State Output
84-
85-
- `state.defaultCoordinateSpace` reports whether default coordinate-taking commands currently resolve to `window` or `screen`.
86-
- `state.defaultCoordinateFallback` is `true` when default commands had to fall back from window coordinates to screen coordinates.
87-
- `state.pointerScreen` is the current pointer in screen-global coordinates.
88-
- `state.pointerWindow` is the current pointer relative to the frontmost window when available, otherwise `null`.
89-
- `state.frontmostWindow.bounds` stays in screen-global coordinates to make window-local to screen-global translation explicit.
90-
91-
## Screenshot Resolution
92-
93-
- Screenshot output is normalized to logical action-space dimensions, not Retina/native pixel dimensions.
94-
- This keeps screenshot coordinates aligned with the active coordinate space for `move` and `click` without requiring callers to divide by `scale`.
95-
- `actionSpace.width` and `actionSpace.height` still describe the main-screen logical action space.
96-
- For `screenshot --screen`, `image.width` and `image.height` match the main-screen action space.
97-
- For default window screenshots, `image.width` and `image.height` match the frontmost window size and the returned `bounds` are window-local.
98-
- For default region screenshots, `bounds` are interpreted in the active coordinate space and align with the returned raster.
99-
- Window screenshots are captured without the macOS drop shadow so the image edges line up with the reported window bounds.
100-
- Native pixel fidelity is intentionally discarded during screenshot export to preserve direct coordinate compatibility for agent actions.
93+
## Dense UI Fallback
94+
95+
When a page is visually dense and the target is a small icon, URL, or toolbar
96+
item, treat `screenshot --region` as the fallback inspection step instead of
97+
guessing a final click from the full screenshot alone.
98+
99+
Recommended pattern:
100+
101+
1. Capture the full frontmost window for global context.
102+
2. Take a second local crop tightly around the likely target area.
103+
3. Re-read the local crop, then issue the final click.
104+
105+
Example:
106+
107+
```bash
108+
swift run macos-cua screenshot /tmp/frontmost.png
109+
swift run macos-cua screenshot --region 720 88 220 96 /tmp/toolbar-crop.png
110+
swift run macos-cua click 812 132 --fast
111+
```
112+
113+
## Model Resolution
114+
115+
Use absolute coordinates when your screenshot can be consumed at full useful resolution by the model. If the image must be resized or compressed before inference, switch to the relative-mode workflow in the docs.
116+
117+
As of 2026-04-18, public vendor docs indicate the following:
118+
119+
| Model | Documented vision resolution support |
120+
| --- | --- |
121+
| [`gpt-5.4`](https://developers.openai.com/api/docs/models/gpt-5.4) | Up to 6000 px max dimension with `detail: "original"`; 2048 px max dimension with `detail: "high"`. |
122+
| [`gpt-5.4-mini`](https://developers.openai.com/api/docs/models/gpt-5.4-mini) | Up to 2048 px max dimension with `detail: "high"`. |
123+
| [Claude Opus 4.7](https://platform.claude.com/docs/en/build-with-claude/vision) | Up to 2576 px on the long edge. |
124+
| [Claude Sonnet 4.5 / 4.6 and other current Claude vision models](https://platform.claude.com/docs/en/build-with-claude/vision) | Up to 1568 px on the long edge. See the [relative-mode workflow](docs/README.md#relative-mode-and-resized-images). |
125+
126+
Older or weaker vision models are not recommended for coordinate-driven desktop control.
127+
128+
## More
129+
130+
Advanced workflows and reference material live under [docs/README.md](docs/README.md).

0 commit comments

Comments
 (0)