Skip to content

replacing the floating pill bar with an AI cursor#7453

Open
vendz wants to merge 4 commits into
BasedHardware:mainfrom
vendz:feat/ai-cursor
Open

replacing the floating pill bar with an AI cursor#7453
vendz wants to merge 4 commits into
BasedHardware:mainfrom
vendz:feat/ai-cursor

Conversation

@vendz
Copy link
Copy Markdown

@vendz vendz commented May 22, 2026

Summary

adds computer use to the Omi desktop app on top of a new cursor-anchored PTT overlay. The assistant can now drive the Mac directly: "open Notes and add this", etc.

Demo

Watch Demo Here

What's in it

Cursor PTT overlay — replaces the floating pill bar with a compact indicator that lives next to the mouse pointer. Idle dot → pulsing listening dot + live transcript → spinner → streaming response bubble. Doesn't steal focus, doesn't fight for screen space.

Computer use — the assistant emits a <computer_use> JSON plan; the app parses it and executes each step.

  • Action types: click, type, shortcut, scroll, open_app
  • OmiElementResolver resolves click targets against the frontmost app's macOS Accessibility tree (batched reads, role-aware scoring, label normalization, Jaro-Winkler fuzzy fallback, 500ms / 3000-node deadline)
  • CuaActionDriver posts CGEvents; typing prefers AX direct insertion with a clipboard-staged Cmd+V fallback
  • Context variables: {{selection}}, {{clipboard}}, {{transcript}}, {{app}}
  • ESC cancels in-flight plans

Plan window — standalone floating panel listing every step with done / active / pending / failed indicators. Auto-dismisses on completion.

Streaming hygiene — the cursor bubble strips the <computer_use> block (including partial mid-stream openers) so the user only sees the brief prose preamble, not the JSON.

Design notes

  • AXIsProcessTrusted() is not used as a gate — it returns stale false on macOS 26 / after re-signs.
  • AXManualAccessibility = true is set per-app so Electron apps that respect it (VS Code, Slack, Notion) populate their AX tree.
  • Coordinates flow end-to-end in CG screen space — no AppKit round-trip, so multi-monitor clicks land correctly.
  • System prompt steers the model toward short, AX-style click targets and a brief plain-English preamble (no narrated step lists, no echoing typed contents).

Known limitations

Time-boxed demo, so a few rough edges remain — all solvable, just out of scope here:

  • Electron apps that ignore AXManualAccessibility (notably Spotify) expose almost nothing through AX, so click steps miss. Solvable with a vision fallback (CoreML/OCR refinement of the LLM's coordinate hint).
  • No retry on a failed step. Plan aborts on first error; a one-shot re-resolve after a short settle would handle most transient misses.

How to try it

Push to talk, then:

  • "Open Notes and write down: meeting with Sofia tomorrow at 3"
  • "Open Spoify"

Watch the plan window list each step as it runs.

Copilot AI review requested due to automatic review settings May 22, 2026 05:44
@vendz vendz changed the title replaceing the floating pill bar with an AI cursor replacing the floating pill bar with an AI cursor May 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Introduces an AI “computer use” capability for the macOS desktop app, including prompt instructions, parsing <computer_use> plans from model output, and executing those plans via Accessibility/CGEvent automation while shifting PTT UX to a cursor-following overlay rather than showing the floating bar.

Changes:

  • Append computer-use instructions to the model system prompt and parse <computer_use> JSON blocks from assistant responses.
  • Add an execution pipeline (plan model, context substitution, element resolution, action driver/executor, plan progress window, Escape-to-cancel monitor).
  • Replace/augment PTT UI flows with a full-screen cursor overlay (idle/listening/processing/responding/notification/executing states) and keep the floating bar hidden during PTT.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
desktop/Desktop/Sources/Providers/ChatProvider.swift Appends computer-use prompt fragment; parses <computer_use> blocks and triggers plan execution.
desktop/Desktop/Sources/MainWindow/DesktopHomeView.swift Initializes cursor overlay idle state during PTT setup; stops auto-showing the floating bar.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift Routes PTT UI state changes to the cursor overlay and updates transcripts used by the overlay.
desktop/Desktop/Sources/FloatingControlBar/OmiWorkflowPlan.swift Defines the Codable plan/step schema for computer-use execution.
desktop/Desktop/Sources/FloatingControlBar/OmiPlanWindow.swift Adds a standalone floating plan window showing step progress and failure states.
desktop/Desktop/Sources/FloatingControlBar/OmiEscapeMonitor.swift Adds Escape key monitoring to cancel an executing plan.
desktop/Desktop/Sources/FloatingControlBar/OmiElementResolver.swift Resolves human labels to UI element coordinates via the AX tree.
desktop/Desktop/Sources/FloatingControlBar/OmiContextResolver.swift Substitutes context variables (selection/clipboard/transcript/app) into plan step values.
desktop/Desktop/Sources/FloatingControlBar/OmiComputerUseTool.swift Provides the system prompt fragment plus parsing/stripping logic for <computer_use> blocks.
desktop/Desktop/Sources/FloatingControlBar/OmiActionExecutor.swift Executes plans step-by-step, updates overlay + plan window, and handles cancellation.
desktop/Desktop/Sources/FloatingControlBar/OmiActionDriver.swift Introduces the driver protocol and error types for UI automation.
desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift Keeps the bar hidden for PTT; routes notifications to the cursor overlay; adjusts timings.
desktop/Desktop/Sources/FloatingControlBar/CursorPTTOverlayState.swift Adds state model for the cursor overlay phases and content.
desktop/Desktop/Sources/FloatingControlBar/CursorPTTOverlayManager.swift Adds full-screen overlay management, cursor tracking, streaming subscriptions, and auto-dismiss logic.
desktop/Desktop/Sources/FloatingControlBar/CursorBubbleView.swift Renders the cursor-following UI for all overlay phases including execution progress.
desktop/Desktop/Sources/FloatingControlBar/CuaActionDriver.swift Implements click/type/shortcut/scroll/open-app automation using AX + CGEvents.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2766 to +2771
// Strip computer_use tag from final text and fire executor
if let computeResult = OmiComputerUseTool.parse(from: messageText) {
messageText = computeResult.cleanText
OmiActionExecutor.shared.execute(plan: computeResult.plan, transcript: "")
log("OmiComputerUseTool: fired executor for plan '\(computeResult.plan.description)'")
}
// Strip computer_use tag from final text and fire executor
if let computeResult = OmiComputerUseTool.parse(from: messageText) {
messageText = computeResult.cleanText
OmiActionExecutor.shared.execute(plan: computeResult.plan, transcript: "")
Comment on lines +14 to +21
monitor = NSEvent.addLocalMonitorForEvents(matching: .keyDown) { [weak self] event in
// Escape keyCode is 53
if event.keyCode == 53 {
self?.fire()
return nil // consume the event
}
return event
}
Comment on lines +199 to +202
let hosting = NSHostingView(
rootView: CursorBubbleView(state: overlayState)
.onTapGesture { [weak self] in self?.dismiss() }
)
Comment on lines +220 to +227
let source = DispatchSource.makeTimerSource(queue: .main)
source.schedule(deadline: .now(), repeating: .milliseconds(16))
source.setEventHandler { [weak self] in
guard let self, let panel = self.panel else { return }
let globalLoc = NSEvent.mouseLocation
let localX = globalLoc.x - panel.frame.minX
let localY = panel.frame.height - (globalLoc.y - panel.frame.minY)
let newPos = CGPoint(x: localX, y: localY)
Comment on lines 205 to 212
FloatingControlBarManager.shared.setup(
appState: appState, chatProvider: viewModelContainer.chatProvider)
if FloatingControlBarManager.shared.isEnabled {
FloatingControlBarManager.shared.show()
}

// Set up push-to-talk voice input
if let barState = FloatingControlBarManager.shared.barState {
PushToTalkManager.shared.setup(barState: barState)
CursorPTTOverlayManager.shared.showIdle()
}
Comment on lines +53 to +57
let pasteboard = NSPasteboard.general
let previous = pasteboard.string(forType: .string)

pasteboard.clearContents()
pasteboard.setString(text, forType: .string)
Comment on lines +70 to +73
if let previous {
pasteboard.clearContents()
pasteboard.setString(previous, forType: .string)
}
Comment on lines +130 to +131
let screenHeight = NSScreen.screens.first?.frame.height ?? 0
let cgPoint = CGPoint(x: mouseLocation.x, y: screenHeight - mouseLocation.y)
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR replaces the floating pill bar with a cursor-anchored PTT overlay and adds computer-use automation for the Omi desktop app, allowing the assistant to drive macOS via CGEvents and the Accessibility API.

  • Cursor overlay (CursorPTTOverlayManager, CursorBubbleView, CursorPTTOverlayState): a full-screen transparent NSPanel tracks the cursor at ~60 fps and renders idle dot → listening → processing → responding → executing states using SwiftUI, with streaming-text sanitization to strip the <computer_use> JSON block before it reaches the bubble.
  • Computer-use pipeline (OmiComputerUseToolOmiActionExecutorCuaActionDriver / OmiElementResolver): the LLM emits a <computer_use> JSON plan that is parsed on stream completion; each step (click, type, shortcut, scroll, open_app) is executed sequentially via CGEvent injection, with AX tree resolution for click targets (batched reads, role-aware scoring, Jaro-Winkler fuzzy fallback, 500 ms / 3 000-node deadline).
  • Supporting infra: OmiEscapeMonitor arms a key-down handler for ESC cancel; OmiPlanWindow shows a floating step-progress panel; OmiContextResolver substitutes {{selection}}, {{clipboard}}, {{transcript}}, and {{app}} into step values before execution.

Confidence Score: 2/5

Three functional defects in the core execution path cause silent wrong-behavior on every computer-use invocation.

The ESC cancel monitor is wired as a local (app-focused) event handler, but the app loses focus the instant automation starts driving another app — cancel never fires. The {{transcript}} context variable always resolves to an empty string because ChatProvider passes "" to the executor. The clipboard-based type fallback restores the original clipboard after a fixed 100 ms sleep, which is not enough time to guarantee the target app has processed the Cmd+V paste, meaning the wrong content can silently be pasted. All three issues are in the hot path of every computer-use action.

OmiEscapeMonitor.swift, ChatProvider.swift, and CuaActionDriver.swift each contain a distinct defect in the execution path.

Important Files Changed

Filename Overview
desktop/Desktop/Sources/FloatingControlBar/OmiEscapeMonitor.swift Local event monitor never fires when a target app is frontmost, making ESC cancel non-functional during plan execution.
desktop/Desktop/Sources/Providers/ChatProvider.swift Executor always receives an empty transcript string, breaking {{transcript}} context variable substitution in all computer-use plans.
desktop/Desktop/Sources/FloatingControlBar/CuaActionDriver.swift Clipboard fallback has a timing race (100 ms restore sleep) and the scroll action uses the wrong screen height on multi-monitor setups.
desktop/Desktop/Sources/FloatingControlBar/OmiElementResolver.swift Batched AX attribute reads, role-aware scoring, and Jaro-Winkler fallback look solid; deadline/budget guards prevent hangs.
desktop/Desktop/Sources/FloatingControlBar/OmiComputerUseTool.swift JSON parsing and tag-stripping logic are correct; system prompt fragment is well-structured.
desktop/Desktop/Sources/FloatingControlBar/CursorPTTOverlayManager.swift State machine transitions and sanitizeForCursorBubble mid-stream tag stripping are implemented correctly; cursor timer logic is sound.
desktop/Desktop/Sources/FloatingControlBar/OmiActionExecutor.swift Sequential step execution with cancellation checks and inter-step settle delay is correct; passes empty transcript to executor.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift Overlay state transitions look correct; old FloatingControlBarManager resize paths cleanly replaced.

Sequence Diagram

sequenceDiagram
    participant U as User (PTT)
    participant PTT as PushToTalkManager
    participant CP as ChatProvider
    participant OV as CursorPTTOverlayManager
    participant CUT as OmiComputerUseTool
    participant EX as OmiActionExecutor
    participant CR as OmiContextResolver
    participant ER as OmiElementResolver
    participant DRV as CuaActionDriver
    participant PW as OmiPlanWindow
    participant ESC as OmiEscapeMonitor

    U->>PTT: Option key down
    PTT->>OV: startListening(barState)
    PTT->>CP: send voice query
    CP->>OV: startResponding(barState)
    CP-->>OV: stream tokens (sanitized, no JSON block)
    CP->>CUT: parse(from: finalText)
    CUT-->>CP: (plan, cleanText)
    CP->>EX: execute(plan, transcript:"") ⚠️ always empty
    EX->>ESC: arm(onCancel) ⚠️ local monitor only
    EX->>OV: startExecution(description, steps)
    EX->>PW: startExecution(steps)
    loop Each step
        EX->>CR: resolve(plan, transcript)
        CR-->>EX: substituted plan
        EX->>OV: updateExecutingStep(index)
        EX->>PW: updateStep(index)
        alt click
            EX->>ER: resolve(label) — AX tree walk
            ER-->>EX: (point, app)
            EX->>DRV: click(at:targetApp:)
        else type
            EX->>DRV: type(text:targetApp:nil)
            DRV-->>DRV: AX insert or clipboard+Cmd+V ⚠️ 100ms race
        else open_app / shortcut / scroll
            EX->>DRV: openApp/pressShortcut/scroll
        end
    end
    EX->>ESC: disarm
    EX->>PW: finishExecution
    EX->>OV: finishExecution → dismiss after 0.8s
Loading

Reviews (1): Last reviewed commit: "feat(desktop): add computer-use action e..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7d2e6c565f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Strip computer_use tag from final text and fire executor
if let computeResult = OmiComputerUseTool.parse(from: messageText) {
messageText = computeResult.cleanText
OmiActionExecutor.shared.execute(plan: computeResult.plan, transcript: "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass voice transcript into computer-use executor

OmiActionExecutor is always invoked with transcript: "", but OmiContextResolver uses that value to substitute {{transcript}} in action steps. Any plan that relies on that token (for example, voice flows like “open Notes and add this”) will type an empty string instead of the user’s utterance, so the advertised voice-to-action path loses user content.

Useful? React with 👍 / 👎.

// Disarm any existing monitor first (safety)
disarm()
self.onCancel = onCancel
monitor = NSEvent.addLocalMonitorForEvents(matching: .keyDown) { [weak self] event in
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Capture Escape globally while plan drives other apps

The cancel path arms NSEvent.addLocalMonitorForEvents, which only observes events routed through this app, but action execution repeatedly activates external target apps before input events are sent. In that common case, pressing Escape is delivered to the target app, so the local monitor never fires and in-flight plans cannot be canceled as promised.

Useful? React with 👍 / 👎.

} else {
OmiPlanWindow.shared.finishExecution()
}
CursorPTTOverlayManager.shared.finishExecution()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid marking failed plans as done in cursor overlay

When a step throws, the code records failedIndex and marks the plan window step as failed, but it still unconditionally calls CursorPTTOverlayManager.finishExecution(), which sets the executing label to "done" and auto-dismisses. This presents a success state for failed executions and hides actionable failure feedback from users.

Useful? React with 👍 / 👎.

Comment on lines +223 to +225
// US-ANSI key code map for a-z and 0-9
private let charKeyCodeMap: [UInt32: CGKeyCode] = [
// a-z
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add keycode support for prompted Cmd+, shortcut

The shortcut parser only recognizes keys present in charKeyCodeMap (letters, digits, -, =), so punctuation shortcuts like Cmd+, resolve to nil and throw unparseableShortcut. Since the system prompt explicitly recommends Cmd+, for settings, this causes valid model outputs to fail at runtime.

Useful? React with 👍 / 👎.

Comment on lines +14 to +21
monitor = NSEvent.addLocalMonitorForEvents(matching: .keyDown) { [weak self] event in
// Escape keyCode is 53
if event.keyCode == 53 {
self?.fire()
return nil // consume the event
}
return event
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ESC cancel silently broken during plan execution

NSEvent.addLocalMonitorForEvents only intercepts key events when the Omi app itself is frontmost. However, during plan execution the automation driver activates the target app (Notes, Spotify, etc.), making Omi the background process. Any ESC key the user presses in the target app to abort the plan is never delivered to this monitor, so the escape handler never fires. A addGlobalMonitorForEvents(matching: .keyDown) monitor is needed so the ESC fires regardless of which app has focus.

Comment on lines 2762 to +2766
if let index = messages.firstIndex(where: { $0.id == aiMessageId }) {
// Message still in memory — update it in-place
messageText = messages[index].text.isEmpty ? queryResult.text : messages[index].text

// Strip computer_use tag from final text and fire executor
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 {{transcript}} context variable never substitutes

OmiActionExecutor.shared.execute(plan: computeResult.plan, transcript: "") always passes an empty string for transcript. OmiContextResolver maps "{{transcript}}" to this value, so any plan step containing {{transcript}} (e.g., "value": "{{transcript}}" to paste what the user just said) will always resolve to an empty string. The voice transcript is already stored in barState.voiceTranscript — that value should be forwarded here.

Comment on lines +53 to +73
let pasteboard = NSPasteboard.general
let previous = pasteboard.string(forType: .string)

pasteboard.clearContents()
pasteboard.setString(text, forType: .string)

// Post Cmd+V (key code 9)
let src = CGEventSource(stateID: .hidSystemState)
let keyDown = CGEvent(keyboardEventSource: src, virtualKey: 9, keyDown: true)
let keyUp = CGEvent(keyboardEventSource: src, virtualKey: 9, keyDown: false)
keyDown?.flags = .maskCommand
keyUp?.flags = .maskCommand
keyDown?.post(tap: .cghidEventTap)
keyUp?.post(tap: .cghidEventTap)

try await Task.sleep(for: .milliseconds(100))

if let previous {
pasteboard.clearContents()
pasteboard.setString(previous, forType: .string)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Clipboard restore race can paste wrong content

CGEvent.post(tap:) enqueues the Cmd+V into the HID event stream but does not block until the target application has read the pasteboard. The 100 ms sleep is a best-effort delay; if the target app is mid-animation, loading, or the system is under load, it may not have consumed the paste before setString(_:forType:) restores the previous clipboard value. When that race is lost, the app pastes the user's original clipboard content instead of the text argument, silently writing the wrong string.

Comment on lines +125 to +132
app.activate(options: .activateIgnoringOtherApps)
try await Task.sleep(for: .milliseconds(80))
}

let mouseLocation = NSEvent.mouseLocation
let screenHeight = NSScreen.screens.first?.frame.height ?? 0
let cgPoint = CGPoint(x: mouseLocation.x, y: screenHeight - mouseLocation.y)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Multi-monitor scroll lands at wrong coordinates

NSScreen.screens.first?.frame.height is the height of the primary display, not the display that currently contains the cursor. In a multi-monitor setup where the cursor is on a secondary screen, the AppKit-to-CG flip (screenHeight - mouseLocation.y) uses the wrong height, producing a cgPoint that lands on a different screen entirely. The correct height to use is the frame height of the screen that contains mouseLocation (the same screen lookup already used in updatePanelScreenIfNeeded).

Comment on lines +42 to +44
if focusResult == .success, let focused = focusedElement {
let axElement = focused as! AXUIElement
let setResult = AXUIElementSetAttributeValue(axElement, kAXSelectedTextAttribute as CFString, text as CFTypeRef)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Force-casting focused to AXUIElement after a successful AXUIElementCopyAttributeValue call is safe in practice, but the as! will crash if the API ever returns a different type. A conditional cast (as? AXUIElement) keeps the same safe-falling-back-to-clipboard behavior without the crash risk.

Suggested change
if focusResult == .success, let focused = focusedElement {
let axElement = focused as! AXUIElement
let setResult = AXUIElementSetAttributeValue(axElement, kAXSelectedTextAttribute as CFString, text as CFTypeRef)
if focusResult == .success, let axElement = focusedElement as? AXUIElement {
let setResult = AXUIElementSetAttributeValue(axElement, kAXSelectedTextAttribute as CFString, text as CFTypeRef)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants