Skip to content
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3b52ce1
Add 'New browser': live streamed Chromium with human-in-the-loop + br…
MT-GoCode Jun 22, 2026
8d31853
Anthropic-key-only + instant 'Interrupt & take control'
MT-GoCode Jun 23, 2026
74b9771
Add agentic-browser-fleet: many agents, many browsers, one atomic own…
MT-GoCode Jun 23, 2026
0ede6a0
Fix browser fleet "+" menu label for free browsers
MT-GoCode Jun 23, 2026
316df9b
Pivot agentic-browser-fleet to direct control (Claude drives the brow…
MT-GoCode Jun 23, 2026
1f5c202
Remove the Anthropic-key gate on browser sessions (direct control is …
MT-GoCode Jun 24, 2026
ceaedec
Auto-open browser panes, uniform tab resolution, viewer nav + ownersh…
MT-GoCode Jun 24, 2026
9e1973a
Release browser when done; keep browser work on the foreground agent;…
MT-GoCode Jun 24, 2026
c56e6c0
Control bar lists the waiting queue; hand-back button only when an ag…
MT-GoCode Jun 24, 2026
a607038
Make human take-control a true handoff: queue the agent and auto-resu…
MT-GoCode Jun 24, 2026
31429f1
Fix handoff/queue bugs found by code review (double-queue re-grant, l…
MT-GoCode Jun 24, 2026
b98c219
Detect browser crashes and report them cleanly (no more silent freeze)
MT-GoCode Jun 24, 2026
a09c838
Add `agentic-browser-fleet close <id>` to shut down a whole browser
MT-GoCode Jun 24, 2026
06a2ed9
Persist the browser fleet across a workspace restart (profiles + mani…
MT-GoCode Jun 24, 2026
4d27454
Fix persistence/restore durability bugs found by the dead-browser cod…
MT-GoCode Jun 24, 2026
ed0cfb4
Always-on control indicator, one-pane-per-browser, clearer non-primar…
MT-GoCode Jun 24, 2026
6a0f039
Open a new browser as its own pane; let any user-started agent show p…
MT-GoCode Jun 24, 2026
eb7b0a0
Sticky human control + agent CAPTCHA handoff + cross-modality sandbox
MT-GoCode Jun 25, 2026
fe83e04
Fix Lima 504 (root sandbox), gate New browser on fleet readiness
MT-GoCode Jun 25, 2026
e6b6a71
Merge remote-tracking branch 'origin/main' into feat/agentic-browser-…
MT-GoCode Jun 25, 2026
f874e42
Give real-Chromium integration tests a 120s timeout (CI green)
MT-GoCode Jun 25, 2026
3991b55
Skip real-Chromium integration tests under GitHub Actions CI
MT-GoCode Jun 25, 2026
a7c56b0
test_e2e: open New terminal (not removed 'New URL') for the split-pla…
MT-GoCode Jun 25, 2026
a1041c1
Rewrite agentic-browser-fleet skill: directly instructional, less sys…
MT-GoCode Jun 26, 2026
fc62b05
Refactor browser daemon to Flask + sync web layer; quarantine async b…
MT-GoCode Jun 26, 2026
6ea712e
Named browser fleet: no default browser-0, word-name ids, serialized …
MT-GoCode Jun 26, 2026
67ba6e8
Fix stale "Starting browser..." pane + make crash overlay full-cover
MT-GoCode Jun 26, 2026
1f6c523
Merge remote-tracking branch 'origin/main' into feat/agentic-browser-…
MT-GoCode Jun 26, 2026
5807ffc
Explicit browser lifecycle state (init/running/crashed); deterministi…
MT-GoCode Jun 26, 2026
565bea0
Fix code-review findings; surface create failures; close acquire-disc…
MT-GoCode Jun 26, 2026
6ebd29f
Fix: remove key from CreateBrowserModal vnode (broke all "+" modals)
MT-GoCode Jun 26, 2026
b616a9d
Make the sync web / async engine boundary legible (reviewer feedback)
MT-GoCode Jun 29, 2026
361e716
Harden ownership/queuing across every control edge case
MT-GoCode Jun 29, 2026
29cbd61
Consistent task/lock busy messages (promise the wake + re-check fallb…
MT-GoCode Jun 29, 2026
246b285
Merge origin/main into feat/agentic-browser-fleet
MT-GoCode Jun 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
266 changes: 266 additions & 0 deletions .agents/skills/agentic-browser-fleet/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
---
name: agentic-browser-fleet
description: Drive a fleet of shared Chromium browsers yourself, one command at a time, from your shell. Use when the user wants you to do something on the web (log in somewhere, fill a form, click through a flow, read a page that needs interaction) rather than just fetch a URL. YOU run the `agentic-browser-fleet` commands, look at the page, decide what to click, and click it -- in this same chat, with your own reasoning.
---

# Driving the browser fleet

You operate the browser by running `agentic-browser-fleet` commands directly: ask the page what's on it, decide what to do, do it, look again. Run every command from the repo root via `uv run`:

```bash
uv run agentic-browser-fleet <command> ...
```

## The loop

1. `state <id>` -- prints the page as a numbered list of clickable elements.
2. Read it, decide which element you want.
3. Act: `click <id> <index>` (or `input` / `select` / `scroll` / `keys` / `open`).
4. `state <id>` again to see what changed. Repeat.

Worked end to end:

```text
uv run agentic-browser-fleet open 0 https://example.com -> ok: navigate
uv run agentic-browser-fleet state 0
browser 0 @ https://example.com/ (Example Domain)
Example Domain
This domain is for use in illustrative examples...
[18]<a /> Learn more
uv run agentic-browser-fleet click 0 18 -> ok: click
uv run agentic-browser-fleet state 0 # re-state: page is now iana.org
browser 0 @ https://www.iana.org/help/example-domains (Example Domains)
...
```

Every clickable thing gets a `[number]`; that number is what you pass to `click` / `input` / `select`.

### Requery before you act

The `[number]` indices come from the **latest** `state` and are **ephemeral** -- the page re-numbers its elements whenever it changes. So:

- **Always `state <id>` before you `click`.**
- **Re-run `state <id>` after every `open`, `click`, `select`, `scroll`, or `tab`, and after you regain control of a browser.**

If you click against a stale index the CLI refuses rather than mis-clicking:

```text
uv run agentic-browser-fleet click 0 18
that element index is stale -- run `state 0` again first (exit 1)
```

Treat that as "look first": run `state 0`, find the element under its new number, click that.

### No API key needed

`state` / `open` / `click` / `input` / `select` / `scroll` / `keys` / `screenshot` / `tab` are deterministic mechanical operations -- no LLM, no API key. (Only the optional `task` fallback at the end uses an LLM and needs a key.)

## Commands

Every command's **first argument is the browser id** (`0` is the default browser). The fleet-level commands `ls` and `new` are the exception.

### Picking and making browsers

```bash
uv run agentic-browser-fleet ls
```

```text
browser 0: you -- 2 tab(s), active: https://example.com/invoices
browser 1: agent alice -- 1 tab(s), active: https://news.example.com
browser 2: human (took control) -- 1 tab(s), active: https://bank.example.com
browser 3: free -- 1 tab(s), active: (no tab)
```

`ls` shows the whole fleet: each browser's id, who controls it (`you`, `agent <name>`, `human (took control)`, or `free`), tab count, and active tab URL -- so you can pick one.

- `ls --include-tabs` lists every tab of every browser:

```text
[0]* Invoices https://example.com/invoices
[1] Dashboard https://example.com/home
```

- `new` starts another browser and prints its id (`-> started browser 4`).
- `close <id>` closes an entire browser (all its tabs) and retires its id (never reused). Use when permanently done with a browser. For a single tab, use `tab <id> close`.
- The fleet is **capped (5 by default)**. `new` past the cap returns `Too many open browsers (5/5)` -- `release` or `close` one you're done with first.
- If there are no browsers yet, `ls` says so; running any command on browser 0 (e.g. `state 0`) starts it.

### Looking at the page

```bash
uv run agentic-browser-fleet state 0
```

Prints `browser 0 @ <url> (<title>)`, a `tabs:` line if more than one tab is open, then the numbered interactive elements. A page with none prints `(no interactive elements -- try screenshot)`. This is your eyes -- run it constantly.

```bash
uv run agentic-browser-fleet screenshot 0
# -> screenshot saved: /path/to/shot.png (Read it to view)
```

`screenshot` saves a PNG and prints its path. **Read that path** with your Read tool to see the page -- use it for visual layouts, canvas, charts, captchas, or anything `state`'s text list can't convey.

### Acting on the page

| Command | What it does |
|---|---|
| `open <id> <url>` | Navigate the browser to a URL. (`-> ok: navigate`) |
| `click <id> <index>` | Click the element with that index from the last `state`. (`-> ok: click`) |
| `input <id> <index> "text"` | Type text into the field at that index. (`-> ok: input`) |
| `select <id> <index> "value"` | Choose an option in a `<select>` dropdown by its visible text. (`-> ok: select`) |
| `scroll <id> [down\|up] [--amount N]` | Scroll the page. Direction defaults to `down`; `--amount` is pixels (default 500). |
| `keys <id> "Enter"` | Send keyboard keys, e.g. `"Enter"`, `"Control+a"`, `"Tab"`. |

A typical fill-and-submit:

```bash
uv run agentic-browser-fleet state 0 # find the field indices
uv run agentic-browser-fleet input 0 5 "alice@example.com" # email field
uv run agentic-browser-fleet input 0 6 "hunter2" # password field
uv run agentic-browser-fleet click 0 7 # the "Log in" button
uv run agentic-browser-fleet state 0 # re-state: landed on the dashboard?
```

(Or `keys 0 "Enter"` to submit the focused form instead of clicking the button.)

### Tabs within one browser

```bash
uv run agentic-browser-fleet tab 0 list # list this browser's tabs
uv run agentic-browser-fleet tab 0 new --url https://example.com/help
uv run agentic-browser-fleet tab 0 switch 1 # make tab index 1 active
uv run agentic-browser-fleet tab 0 close 2 # close tab index 2
```

`tab` with no action defaults to `list`. After `switch` / `new` / `close` the active page changed, so **`state <id>` again** before clicking.

### Ownership commands

```bash
uv run agentic-browser-fleet acquire 0 # reserve browser 0 across commands
uv run agentic-browser-fleet acquire 0 --reclaim # take it back from a human -- ONLY on their say-so
uv run agentic-browser-fleet release 0 # let it go (alias: unlock 0)
uv run agentic-browser-fleet handoff 0 "solve the CAPTCHA" # hand to the human (alias: request-human)
```

You usually don't need `acquire`: your first command on a browser auto-acquires it and you keep a sticky lease across the commands that follow. Use `acquire` to explicitly reserve a browser, or to queue behind / reclaim one that's held. `release` (alias `unlock`) hands it back; releasing one that wasn't yours prints `browser <id> was not yours to release` and still exits `0`.

## Ownership rules

Every browser has exactly one controller; every command's output names the owner.

- **You auto-acquire and hold a sticky lease.** No manual `acquire` needed for normal driving.
- **Release when a browser leaves your active work** (`release <id>`), so control returns to the human immediately rather than after the ~90s idle timeout:
- Task finished on that browser -> release it.
- User tells you to stop -> release every browser you were driving.
- You switch to a different browser for the rest of the task -> release the one you're leaving.
- Driving several at once -> keep them until fully done, then release each.
- If you forget, an idle lease auto-frees after ~90s; if a later command says you no longer hold it, just acquire it again.
- **The human always wins.** If a human takes control, your next command comes back with status `busy_human`/`lost_control` (exit 2). You lost control: **stop, tell the user the human took the wheel, and end your turn.** Do not retry, poll, or `--reclaim` on your own. You're queued to resume first; you'll be messaged when they hand it back. On resume, **re-run `state <id>` first** (the page changed), then continue. Resume early only on an explicit "keep going": `acquire <id> --reclaim`, then `state <id>`.
- **Agents never preempt each other.** A browser another agent holds returns (exit 3):

```text
browser 1 is held by another agent -- you're queued for it...
```

Default to a **different** browser (or `new`) for unrelated work; only `acquire 1` to queue when you specifically need that browser. (Contrast with a human taking *your* browser: there your work is on it, so you wait and resume the same one.)
- **A browser can crash.** If Chromium is killed, your next command on it returns (exit 1):

```text
browser 0 crashed (Chromium was killed -- e.g. out of memory) and is gone.
Start a fresh one with `new` (it gets a new number).
```

That browser is gone for good -- don't retry or re-`state` it; run `new` and carry on there.
- **Browsers persist across a restart** (browsers, tabs, cookies/logins/history are saved), so a site you logged into earlier is probably still logged in. Right after a restart the fleet is restoring; a state-changing command may briefly return (exit 3):

```text
the browser fleet is still starting up (restoring your saved browsers) -- try again in a few seconds.
```

Wait a moment and retry (`ls` and `state` work during this window).

## Hitting a wall a human must clear (CAPTCHA / 2FA / login)

When you hit a CAPTCHA, reCAPTCHA / hCaptcha / Cloudflare "verify you're human" challenge, an "I'm not a robot" checkbox, an SMS / 2FA / OTP code you don't have, or a login needing the user's own credentials -- **do not try to solve it yourself** (you'll fail and may get the account flagged). Hand it off:

```bash
uv run agentic-browser-fleet handoff 0 "solve the CAPTCHA on the sign-in page"
```

`handoff` (alias `request-human`) puts you at the **front** of that browser's resume queue, hands control to the human (pinned -- won't pass to another agent), and surfaces the pane so they can see it. In the **same turn**: tell the user exactly what to do and on which page, then **end your turn** (exit 2). You're woken first when they hand control back -- **re-run `state 0`** to confirm the challenge cleared, then carry on.

## Live view vs. your output

The browser shows up live in a UI pane next to your chat so the human can watch you operate it. That pane is **viewer only** -- your actual output (the `state` listings, the `ok:`/error lines, the screenshot paths) is in your CLI output here in the chat. Read and relay the CLI output; don't tell the user to "check the tab" for results.

## Multiple browsers, tabs, sub-agents

- **Multiple browsers:** use different ids; they're independent and don't queue against each other. Drive several at once just by varying the id.
- **Tabs:** `tab <id> ...` manages tabs within one browser.
- **Drive the browser yourself, here in this chat.** A `launch-task` sub-agent runs in a separate, isolated container with no access to this workspace's browser fleet, so do web/browser work yourself. If a sub-agent needs something from the web, have it tell you what it needs and you do the browsing. (A parent passing its chat to a sub-agent can set `BROWSER_FLEET_ANCHOR` so panes anchor to that chat, but the daemon/fleet still isn't reachable from an isolated sub-agent.)

## Exit codes -- branch on these

| Code | Meaning | What to do |
|---|---|---|
| `0` | ok | Read the output; for `state`, decide your next action. |
| `1` | error / stale index / crashed browser | Stale index: `state <id>`, find the new number, retry. Crashed: `new`. Else read the message. |
| `2` | preempted (human took control, or you ran `handoff`) | **Stop and end your turn.** Tell the user; you'll be messaged to resume (re-run `state <id>` first). Don't poll or `--reclaim` on your own. |
| `3` | busy (another agent holds it, or fleet full / still restoring) | Use a different browser (or `new`); you're queued and will be messaged when it frees. For "restoring", wait and retry. |
| `4` | timed out (waited via `--max-wait` and another agent still held it) | Try later, or pick a different browser. |
| `64` | usage (`MNGR_AGENT_ID` unset / bad arguments) | Run from inside an agent shell; fix the command. |
| `69` | no daemon (can't reach the browser service) | The service isn't running -- report it; don't blindly retry. |

## Quick recipes

```bash
# Look, then act.
uv run agentic-browser-fleet state 0
uv run agentic-browser-fleet click 0 12
uv run agentic-browser-fleet state 0 # always re-state after acting

# Read a page by eye when the text list isn't enough.
uv run agentic-browser-fleet open 0 https://example.com/pricing
uv run agentic-browser-fleet screenshot 0 # then Read the printed PNG path

# Search and submit with the keyboard.
uv run agentic-browser-fleet open 0 https://news.ycombinator.com
uv run agentic-browser-fleet state 0
uv run agentic-browser-fleet input 0 3 "browser automation"
uv run agentic-browser-fleet keys 0 "Enter"
uv run agentic-browser-fleet state 0

# Two browsers, independently (no queueing -- different ids).
uv run agentic-browser-fleet open 0 https://site-a.com
uv run agentic-browser-fleet open 1 https://site-b.com

# Hit a CAPTCHA -- hand it to the user, then STOP.
uv run agentic-browser-fleet handoff 0 "solve the CAPTCHA on the sign-in page"
# -> tell the user what to do, end your turn; you resume first when they hand back.
uv run agentic-browser-fleet state 0 # (on resume) confirm the challenge cleared

# Human took over, then said "keep going" -- and ONLY then:
uv run agentic-browser-fleet acquire 1 --reclaim
uv run agentic-browser-fleet state 1 # re-state after regaining control
```

## Fallback only: `task <id> "<goal>"`

When a page is genuinely beyond step-by-step control -- a `<canvas>` app, a drag-heavy visual editor, a flow where `state` shows nothing useful even with a screenshot -- you can hand the whole goal to an autonomous browser-use agent instead of driving it yourself:

```bash
uv run agentic-browser-fleet task 0 "log into example.com and download last month's invoice"
```

This streams the agent's `[thinking]`/`[action]` trace into your output and ends with a `done:` line you relay. It **uses an LLM and needs an API key**, and it takes the wheel away from your direct control for its duration. Flags: `--reclaim` (resume a human-held browser, same rules as above), `--no-wait` (fail fast instead of queueing behind another agent), `--max-wait S` (bound the queue wait, then exit `4`), `--no-pane` (don't pull it into a UI pane). **Prefer driving it yourself** -- reach for `task` only when direct control truly can't see or manipulate the page.

## Don'ts

- Don't `click <index>` without a fresh `state` first -- indices go stale the moment the page changes.
- Don't "take control" -- that's a human-only UI action. You drive by issuing commands.
- Don't pass `--reclaim` unless the human explicitly told you to resume a browser they took over.
- Don't auto-retry on exit `2` (preempted). Stop and wait for the human.
- Don't tell the user to "look in the tab" for results -- your CLI output is the source of truth; the tab is just the live picture.
- Don't jump to `task` for ordinary pages. Drive them yourself.
38 changes: 38 additions & 0 deletions apps/system_interface/changelog/feat-agentic-browser-fleet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
Integrated the agentic browser fleet into the workspace UI.

- The "+" menu now lists the currently-active browsers (from the browser
daemon's `GET /browsers`) alongside "New browser". Clicking an already-open
browser focuses its existing pane instead of opening a duplicate; "New browser"
starts one and opens its pane, surfacing a clear error if the fleet is full.

- The agent-driven layout system can address a specific browser as a pane
(`service:browser?session=<id>`), so an agent can pull the exact browser it is
working on into a split-pane next to its chat, and panes for different browsers
are treated as distinct (focus-if-open, no collisions).

- Updated the frontend to the browser daemon's new fleet endpoints (`/browsers`
and `/browsers/{id}/cast`) in place of the previous single-session routes.

- "New browser" is no longer gated on an Anthropic API key. Direct control is
keyless, so a browser can always be started; the old menu item that disabled
itself and showed a "Browser sessions need an Anthropic API key" dialog was a
leftover from the delegation model and has been removed.

- Dropped the placeholder "web" example server AND the "New URL" item from the
"+" menu -- "New browser" is the real web surface / replacement for opening an
ad-hoc URL. (The split-placement E2E that drove the old "New URL" item now
exercises the same placement path via "New terminal".) Also removed the per-tab
Refresh button from
browser panes -- reloading the pane only reconnects the live view, which read
as "restart the browser"; the browser viewer has its own in-page Reload button
for the actual page.

- "New browser" is now gated on the fleet being ready to start one. The "+" menu
reads `can_create` / `create_reason` from `GET /browsers`: while the fleet is
still starting up / restoring saved browsers, or is at its cap, the item is shown
disabled with the reason in parentheses (e.g. "New browser (browsers are still
starting up)" / "New browser (5/5 open -- close one first)"), and clicking it pops
a modal with the reason instead of firing a create that would fail. This prevents
a "New browser" click during startup from racing the fleet's restore (e.g. while
browser 0 or several saved browsers are still launching). Disabled "+" menu items
no longer run their action on click (previously they were only greyed out).
3 changes: 3 additions & 0 deletions apps/system_interface/changelog/feat-new-browser-cdp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
The "+" tab menu's "New URL" item is replaced with "New browser", which opens a live streamed Chromium session (see the `browser` service) in a new tab.

The item is enabled only when an Anthropic API key is resolvable inside the compute (required by the browser-use agent); otherwise it is greyed out with a message explaining which workspace providers supply a key. The key status is re-checked each time the menu opens, so a key added after boot enables the item without a reload.
Loading