replacing the floating pill bar with an AI cursor#7453
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Introduces an AI “computer use” capability for the macOS desktop app, including prompt instructions, parsing <computer_use> plans from model output, and executing those plans via Accessibility/CGEvent automation while shifting PTT UX to a cursor-following overlay rather than showing the floating bar.
Changes:
- Append computer-use instructions to the model system prompt and parse
<computer_use>JSON blocks from assistant responses. - Add an execution pipeline (plan model, context substitution, element resolution, action driver/executor, plan progress window, Escape-to-cancel monitor).
- Replace/augment PTT UI flows with a full-screen cursor overlay (idle/listening/processing/responding/notification/executing states) and keep the floating bar hidden during PTT.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| desktop/Desktop/Sources/Providers/ChatProvider.swift | Appends computer-use prompt fragment; parses <computer_use> blocks and triggers plan execution. |
| desktop/Desktop/Sources/MainWindow/DesktopHomeView.swift | Initializes cursor overlay idle state during PTT setup; stops auto-showing the floating bar. |
| desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift | Routes PTT UI state changes to the cursor overlay and updates transcripts used by the overlay. |
| desktop/Desktop/Sources/FloatingControlBar/OmiWorkflowPlan.swift | Defines the Codable plan/step schema for computer-use execution. |
| desktop/Desktop/Sources/FloatingControlBar/OmiPlanWindow.swift | Adds a standalone floating plan window showing step progress and failure states. |
| desktop/Desktop/Sources/FloatingControlBar/OmiEscapeMonitor.swift | Adds Escape key monitoring to cancel an executing plan. |
| desktop/Desktop/Sources/FloatingControlBar/OmiElementResolver.swift | Resolves human labels to UI element coordinates via the AX tree. |
| desktop/Desktop/Sources/FloatingControlBar/OmiContextResolver.swift | Substitutes context variables (selection/clipboard/transcript/app) into plan step values. |
| desktop/Desktop/Sources/FloatingControlBar/OmiComputerUseTool.swift | Provides the system prompt fragment plus parsing/stripping logic for <computer_use> blocks. |
| desktop/Desktop/Sources/FloatingControlBar/OmiActionExecutor.swift | Executes plans step-by-step, updates overlay + plan window, and handles cancellation. |
| desktop/Desktop/Sources/FloatingControlBar/OmiActionDriver.swift | Introduces the driver protocol and error types for UI automation. |
| desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift | Keeps the bar hidden for PTT; routes notifications to the cursor overlay; adjusts timings. |
| desktop/Desktop/Sources/FloatingControlBar/CursorPTTOverlayState.swift | Adds state model for the cursor overlay phases and content. |
| desktop/Desktop/Sources/FloatingControlBar/CursorPTTOverlayManager.swift | Adds full-screen overlay management, cursor tracking, streaming subscriptions, and auto-dismiss logic. |
| desktop/Desktop/Sources/FloatingControlBar/CursorBubbleView.swift | Renders the cursor-following UI for all overlay phases including execution progress. |
| desktop/Desktop/Sources/FloatingControlBar/CuaActionDriver.swift | Implements click/type/shortcut/scroll/open-app automation using AX + CGEvents. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Strip computer_use tag from final text and fire executor | ||
| if let computeResult = OmiComputerUseTool.parse(from: messageText) { | ||
| messageText = computeResult.cleanText | ||
| OmiActionExecutor.shared.execute(plan: computeResult.plan, transcript: "") | ||
| log("OmiComputerUseTool: fired executor for plan '\(computeResult.plan.description)'") | ||
| } |
| // Strip computer_use tag from final text and fire executor | ||
| if let computeResult = OmiComputerUseTool.parse(from: messageText) { | ||
| messageText = computeResult.cleanText | ||
| OmiActionExecutor.shared.execute(plan: computeResult.plan, transcript: "") |
| monitor = NSEvent.addLocalMonitorForEvents(matching: .keyDown) { [weak self] event in | ||
| // Escape keyCode is 53 | ||
| if event.keyCode == 53 { | ||
| self?.fire() | ||
| return nil // consume the event | ||
| } | ||
| return event | ||
| } |
| let hosting = NSHostingView( | ||
| rootView: CursorBubbleView(state: overlayState) | ||
| .onTapGesture { [weak self] in self?.dismiss() } | ||
| ) |
| let source = DispatchSource.makeTimerSource(queue: .main) | ||
| source.schedule(deadline: .now(), repeating: .milliseconds(16)) | ||
| source.setEventHandler { [weak self] in | ||
| guard let self, let panel = self.panel else { return } | ||
| let globalLoc = NSEvent.mouseLocation | ||
| let localX = globalLoc.x - panel.frame.minX | ||
| let localY = panel.frame.height - (globalLoc.y - panel.frame.minY) | ||
| let newPos = CGPoint(x: localX, y: localY) |
| FloatingControlBarManager.shared.setup( | ||
| appState: appState, chatProvider: viewModelContainer.chatProvider) | ||
| if FloatingControlBarManager.shared.isEnabled { | ||
| FloatingControlBarManager.shared.show() | ||
| } | ||
|
|
||
| // Set up push-to-talk voice input | ||
| if let barState = FloatingControlBarManager.shared.barState { | ||
| PushToTalkManager.shared.setup(barState: barState) | ||
| CursorPTTOverlayManager.shared.showIdle() | ||
| } |
| let pasteboard = NSPasteboard.general | ||
| let previous = pasteboard.string(forType: .string) | ||
|
|
||
| pasteboard.clearContents() | ||
| pasteboard.setString(text, forType: .string) |
| if let previous { | ||
| pasteboard.clearContents() | ||
| pasteboard.setString(previous, forType: .string) | ||
| } |
| let screenHeight = NSScreen.screens.first?.frame.height ?? 0 | ||
| let cgPoint = CGPoint(x: mouseLocation.x, y: screenHeight - mouseLocation.y) |
Greptile SummaryThis PR replaces the floating pill bar with a cursor-anchored PTT overlay and adds computer-use automation for the Omi desktop app, allowing the assistant to drive macOS via CGEvents and the Accessibility API.
Confidence Score: 2/5Three functional defects in the core execution path cause silent wrong-behavior on every computer-use invocation. The ESC cancel monitor is wired as a local (app-focused) event handler, but the app loses focus the instant automation starts driving another app — cancel never fires. The {{transcript}} context variable always resolves to an empty string because ChatProvider passes "" to the executor. The clipboard-based type fallback restores the original clipboard after a fixed 100 ms sleep, which is not enough time to guarantee the target app has processed the Cmd+V paste, meaning the wrong content can silently be pasted. All three issues are in the hot path of every computer-use action. OmiEscapeMonitor.swift, ChatProvider.swift, and CuaActionDriver.swift each contain a distinct defect in the execution path. Important Files Changed
Sequence DiagramsequenceDiagram
participant U as User (PTT)
participant PTT as PushToTalkManager
participant CP as ChatProvider
participant OV as CursorPTTOverlayManager
participant CUT as OmiComputerUseTool
participant EX as OmiActionExecutor
participant CR as OmiContextResolver
participant ER as OmiElementResolver
participant DRV as CuaActionDriver
participant PW as OmiPlanWindow
participant ESC as OmiEscapeMonitor
U->>PTT: Option key down
PTT->>OV: startListening(barState)
PTT->>CP: send voice query
CP->>OV: startResponding(barState)
CP-->>OV: stream tokens (sanitized, no JSON block)
CP->>CUT: parse(from: finalText)
CUT-->>CP: (plan, cleanText)
CP->>EX: execute(plan, transcript:"") ⚠️ always empty
EX->>ESC: arm(onCancel) ⚠️ local monitor only
EX->>OV: startExecution(description, steps)
EX->>PW: startExecution(steps)
loop Each step
EX->>CR: resolve(plan, transcript)
CR-->>EX: substituted plan
EX->>OV: updateExecutingStep(index)
EX->>PW: updateStep(index)
alt click
EX->>ER: resolve(label) — AX tree walk
ER-->>EX: (point, app)
EX->>DRV: click(at:targetApp:)
else type
EX->>DRV: type(text:targetApp:nil)
DRV-->>DRV: AX insert or clipboard+Cmd+V ⚠️ 100ms race
else open_app / shortcut / scroll
EX->>DRV: openApp/pressShortcut/scroll
end
end
EX->>ESC: disarm
EX->>PW: finishExecution
EX->>OV: finishExecution → dismiss after 0.8s
Reviews (1): Last reviewed commit: "feat(desktop): add computer-use action e..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7d2e6c565f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Strip computer_use tag from final text and fire executor | ||
| if let computeResult = OmiComputerUseTool.parse(from: messageText) { | ||
| messageText = computeResult.cleanText | ||
| OmiActionExecutor.shared.execute(plan: computeResult.plan, transcript: "") |
There was a problem hiding this comment.
Pass voice transcript into computer-use executor
OmiActionExecutor is always invoked with transcript: "", but OmiContextResolver uses that value to substitute {{transcript}} in action steps. Any plan that relies on that token (for example, voice flows like “open Notes and add this”) will type an empty string instead of the user’s utterance, so the advertised voice-to-action path loses user content.
Useful? React with 👍 / 👎.
| // Disarm any existing monitor first (safety) | ||
| disarm() | ||
| self.onCancel = onCancel | ||
| monitor = NSEvent.addLocalMonitorForEvents(matching: .keyDown) { [weak self] event in |
There was a problem hiding this comment.
Capture Escape globally while plan drives other apps
The cancel path arms NSEvent.addLocalMonitorForEvents, which only observes events routed through this app, but action execution repeatedly activates external target apps before input events are sent. In that common case, pressing Escape is delivered to the target app, so the local monitor never fires and in-flight plans cannot be canceled as promised.
Useful? React with 👍 / 👎.
| } else { | ||
| OmiPlanWindow.shared.finishExecution() | ||
| } | ||
| CursorPTTOverlayManager.shared.finishExecution() |
There was a problem hiding this comment.
Avoid marking failed plans as done in cursor overlay
When a step throws, the code records failedIndex and marks the plan window step as failed, but it still unconditionally calls CursorPTTOverlayManager.finishExecution(), which sets the executing label to "done" and auto-dismisses. This presents a success state for failed executions and hides actionable failure feedback from users.
Useful? React with 👍 / 👎.
| // US-ANSI key code map for a-z and 0-9 | ||
| private let charKeyCodeMap: [UInt32: CGKeyCode] = [ | ||
| // a-z |
There was a problem hiding this comment.
Add keycode support for prompted Cmd+, shortcut
The shortcut parser only recognizes keys present in charKeyCodeMap (letters, digits, -, =), so punctuation shortcuts like Cmd+, resolve to nil and throw unparseableShortcut. Since the system prompt explicitly recommends Cmd+, for settings, this causes valid model outputs to fail at runtime.
Useful? React with 👍 / 👎.
| monitor = NSEvent.addLocalMonitorForEvents(matching: .keyDown) { [weak self] event in | ||
| // Escape keyCode is 53 | ||
| if event.keyCode == 53 { | ||
| self?.fire() | ||
| return nil // consume the event | ||
| } | ||
| return event | ||
| } |
There was a problem hiding this comment.
ESC cancel silently broken during plan execution
NSEvent.addLocalMonitorForEvents only intercepts key events when the Omi app itself is frontmost. However, during plan execution the automation driver activates the target app (Notes, Spotify, etc.), making Omi the background process. Any ESC key the user presses in the target app to abort the plan is never delivered to this monitor, so the escape handler never fires. A addGlobalMonitorForEvents(matching: .keyDown) monitor is needed so the ESC fires regardless of which app has focus.
| if let index = messages.firstIndex(where: { $0.id == aiMessageId }) { | ||
| // Message still in memory — update it in-place | ||
| messageText = messages[index].text.isEmpty ? queryResult.text : messages[index].text | ||
|
|
||
| // Strip computer_use tag from final text and fire executor |
There was a problem hiding this comment.
{{transcript}} context variable never substitutes
OmiActionExecutor.shared.execute(plan: computeResult.plan, transcript: "") always passes an empty string for transcript. OmiContextResolver maps "{{transcript}}" to this value, so any plan step containing {{transcript}} (e.g., "value": "{{transcript}}" to paste what the user just said) will always resolve to an empty string. The voice transcript is already stored in barState.voiceTranscript — that value should be forwarded here.
| let pasteboard = NSPasteboard.general | ||
| let previous = pasteboard.string(forType: .string) | ||
|
|
||
| pasteboard.clearContents() | ||
| pasteboard.setString(text, forType: .string) | ||
|
|
||
| // Post Cmd+V (key code 9) | ||
| let src = CGEventSource(stateID: .hidSystemState) | ||
| let keyDown = CGEvent(keyboardEventSource: src, virtualKey: 9, keyDown: true) | ||
| let keyUp = CGEvent(keyboardEventSource: src, virtualKey: 9, keyDown: false) | ||
| keyDown?.flags = .maskCommand | ||
| keyUp?.flags = .maskCommand | ||
| keyDown?.post(tap: .cghidEventTap) | ||
| keyUp?.post(tap: .cghidEventTap) | ||
|
|
||
| try await Task.sleep(for: .milliseconds(100)) | ||
|
|
||
| if let previous { | ||
| pasteboard.clearContents() | ||
| pasteboard.setString(previous, forType: .string) | ||
| } |
There was a problem hiding this comment.
Clipboard restore race can paste wrong content
CGEvent.post(tap:) enqueues the Cmd+V into the HID event stream but does not block until the target application has read the pasteboard. The 100 ms sleep is a best-effort delay; if the target app is mid-animation, loading, or the system is under load, it may not have consumed the paste before setString(_:forType:) restores the previous clipboard value. When that race is lost, the app pastes the user's original clipboard content instead of the text argument, silently writing the wrong string.
| app.activate(options: .activateIgnoringOtherApps) | ||
| try await Task.sleep(for: .milliseconds(80)) | ||
| } | ||
|
|
||
| let mouseLocation = NSEvent.mouseLocation | ||
| let screenHeight = NSScreen.screens.first?.frame.height ?? 0 | ||
| let cgPoint = CGPoint(x: mouseLocation.x, y: screenHeight - mouseLocation.y) | ||
|
|
There was a problem hiding this comment.
Multi-monitor scroll lands at wrong coordinates
NSScreen.screens.first?.frame.height is the height of the primary display, not the display that currently contains the cursor. In a multi-monitor setup where the cursor is on a secondary screen, the AppKit-to-CG flip (screenHeight - mouseLocation.y) uses the wrong height, producing a cgPoint that lands on a different screen entirely. The correct height to use is the frame height of the screen that contains mouseLocation (the same screen lookup already used in updatePanelScreenIfNeeded).
| if focusResult == .success, let focused = focusedElement { | ||
| let axElement = focused as! AXUIElement | ||
| let setResult = AXUIElementSetAttributeValue(axElement, kAXSelectedTextAttribute as CFString, text as CFTypeRef) |
There was a problem hiding this comment.
Force-casting
focused to AXUIElement after a successful AXUIElementCopyAttributeValue call is safe in practice, but the as! will crash if the API ever returns a different type. A conditional cast (as? AXUIElement) keeps the same safe-falling-back-to-clipboard behavior without the crash risk.
| if focusResult == .success, let focused = focusedElement { | |
| let axElement = focused as! AXUIElement | |
| let setResult = AXUIElementSetAttributeValue(axElement, kAXSelectedTextAttribute as CFString, text as CFTypeRef) | |
| if focusResult == .success, let axElement = focusedElement as? AXUIElement { | |
| let setResult = AXUIElementSetAttributeValue(axElement, kAXSelectedTextAttribute as CFString, text as CFTypeRef) |
Summary
adds computer use to the Omi desktop app on top of a new cursor-anchored PTT overlay. The assistant can now drive the Mac directly: "open Notes and add this", etc.
Demo
What's in it
Cursor PTT overlay — replaces the floating pill bar with a compact indicator that lives next to the mouse pointer. Idle dot → pulsing listening dot + live transcript → spinner → streaming response bubble. Doesn't steal focus, doesn't fight for screen space.
Computer use — the assistant emits a
<computer_use>JSON plan; the app parses it and executes each step.click,type,shortcut,scroll,open_appOmiElementResolverresolvesclicktargets against the frontmost app's macOS Accessibility tree (batched reads, role-aware scoring, label normalization, Jaro-Winkler fuzzy fallback, 500ms / 3000-node deadline)CuaActionDriverposts CGEvents; typing prefers AX direct insertion with a clipboard-staged Cmd+V fallback{{selection}},{{clipboard}},{{transcript}},{{app}}Plan window — standalone floating panel listing every step with done / active / pending / failed indicators. Auto-dismisses on completion.
Streaming hygiene — the cursor bubble strips the
<computer_use>block (including partial mid-stream openers) so the user only sees the brief prose preamble, not the JSON.Design notes
AXIsProcessTrusted()is not used as a gate — it returns stalefalseon macOS 26 / after re-signs.AXManualAccessibility = trueis set per-app so Electron apps that respect it (VS Code, Slack, Notion) populate their AX tree.Known limitations
Time-boxed demo, so a few rough edges remain — all solvable, just out of scope here:
AXManualAccessibility(notably Spotify) expose almost nothing through AX, soclicksteps miss. Solvable with a vision fallback (CoreML/OCR refinement of the LLM's coordinate hint).How to try it
Push to talk, then:
Watch the plan window list each step as it runs.