Skip to content

fix(cockpit): auto-spawn ACP workers for sessions added while serve runs#953

Open
Seluj78 wants to merge 3 commits intonjbrake:nativefrom
Seluj78:fix/cockpit-reconcile-spawn
Open

fix(cockpit): auto-spawn ACP workers for sessions added while serve runs#953
Seluj78 wants to merge 3 commits intonjbrake:nativefrom
Seluj78:fix/cockpit-reconcile-spawn

Conversation

@Seluj78
Copy link
Copy Markdown

@Seluj78 Seluj78 commented May 7, 2026

Description

Stacks on top of #868 (the native branch). Fixes the "session started but nothing in the webui" report from #868's review thread, plus the surrounding silent-failure UX cliffs that surface from the same root cause.

1. Daemon-side reconciler — d8c21bf

Bug. Cockpit-mode sessions added via aoe add . --cmd <tool> --cockpit while aoe serve was already running never got an ACP worker. The dashboard showed the session (the 2s status poller picks it up from disk) but the agent stayed silent because no code path called cockpit_supervisor.spawn(...) for it. aoe session start <name> looked like it worked but is a no-op for cockpit sessions: Instance::start_with_size_opts returns early on cockpit_mode == true.

Worker spawning previously fired only at three entry points:

  • serve-startup auto-scan (src/server/mod.rs)
  • POST /api/sessions (web wizard create)
  • POST /api/cockpit/sessions/:id/enable (substrate switch)

Nothing watched for cockpit sessions appearing in the on-disk list after startup.

Fix. Move auto-spawn out of the one-shot startup block and into the existing status_poll_loop. Each tick reconciles cockpit-mode sessions on disk against the supervisor's worker map and spawns any that are missing.

  • attempted_cockpit_spawns: HashSet<String> guards against retry storms when a spawn permanently fails (e.g. claude-agent-acp not installed).
  • The set is pruned to the live id set each tick so a delete + recreate of the same id can spawn again.
  • Cold-startup latency is unchanged: tokio::time::interval's first tick fires immediately, so the reconciler runs at the same point the old startup loop did.

2. CLI bail for cockpit lifecycle commands — ca4c112

aoe session start <name> (and stop/restart/attach) on a cockpit-mode session previously printed ✓ Started session: ... while doing nothing, because Instance::start_with_size_opts early-returns on cockpit_mode. The CLI now refuses these commands with a message pointing at aoe serve + the dashboard. aoe session restart --all skips cockpit sessions instead of erroring on each one. cockpit_mode is gated on the serve feature, so TUI-only builds compile to a no-op shim.

3. aoe add --cockpit next-steps redirect — ed0c90f

The post-add summary printed the same aoe session start <name> / aoe (TUI attach) hint regardless of substrate, sending cockpit users straight into the lifecycle commands that now bail. The cockpit branch now points at aoe serve + the dashboard. For --launch, surface that the flag is a no-op so the user understands why the run didn't open a terminal.

4. aoe add --cockpit precondition check — 1bf3eb6

Persisting a --cockpit session whose ACP adapter binary isn't on $PATH used to fail silently end-to-end: aoe add returned 0, the dashboard listed the session, the supervisor's reconciler tried to spawn the worker, AcpClient::spawn errored with "No such file or directory", and the user only learned something was wrong when the first prompt POST returned 404 ("session has no running cockpit").

Verify the resolved adapter binary is on $PATH at add-time and bail loudly with the install hint, the bundled-fallback (--agent aoe-agent) suggestion, and the tmux-passthrough escape hatch (--no-cockpit). Only sessions the user explicitly opted into cockpit for hit this check; the implicit default-for-claude branch keeps falling back to tmux, so users without cockpit tooling on $PATH aren't suddenly blocked.

Drive-by fix in command_present (src/cli/cockpit.rs): the placeholder branch was shadowed by the /-branch, so any agent whose command embeds a ${aoe_data_dir}/... placeholder (notably aoe-agent, our bundled fallback) was reported as missing in aoe cockpit doctor and would have failed this new precondition. Reorder the checks so placeholders resolve at runtime as documented.

PR Type

  • Bug Fix

Checklist

  • I understand the code I am submitting
  • New and existing tests pass
    • cargo test --features serve --lib server::tests 16/16
    • cargo test --features serve --lib cockpit 39/39
    • cargo build --features serve --profile dev-release, cargo build
    • cargo clippy --features serve -- -D warnings, cargo clippy -- -D warnings
    • cargo fmt --check
  • Documentation was updated where necessary (no user-facing surface changes; aoe cockpit doctor output is now correct for placeholder-based agents)
  • For UI changes: included screenshot or recording (n/a — CLI + daemon-internal)

AI Usage

  • AI was used for drafting/refactoring

AI Model/Tool used: Claude Opus 4.7 via Claude Code

Any Additional AI Details you'd like to share:
The original investigation comment on #868 was authored by me with Claude's help; this PR implements option #1 from that comment (commit 1) and option #3 from the same comment, expanded to cover the surrounding silent-failure UX cliffs (commits 2-4). I read every change and verified the build/clippy/fmt/unit-tests locally.

  • I am an AI Agent filling out this form (check box if true)

Test plan

  • cargo build --features serve --profile dev-release
  • Two-terminal repro from the investigation comment (validates commit 1):
    Term A: aoe serve
    Term B: aoe add . --cmd claude --cockpit  # adapter installed
    
    Confirm the cockpit conversation comes online within ~2s without an aoe serve restart.
  • Without claude-agent-acp on $PATH (validates commit 4):
    aoe add . --cmd claude --cockpit
    
    Confirm it fails fast with the install hint instead of persisting a broken session.
  • On a cockpit session, run aoe session start <name> and aoe session restart <name> (validates commit 2). Confirm both bail with the "managed by aoe serve" message.
  • aoe session restart --all in a profile mixing cockpit and tmux sessions (validates commit 2). Confirm the cockpit ones are skipped silently and tmux ones restart.
  • aoe add . --cmd claude --cockpit (validates commit 3). Confirm the next-steps block points at aoe serve + the dashboard, not aoe session start.
  • aoe cockpit doctor (validates the drive-by). Confirm aoe-agent is reported as [OK].

Out of scope

A separate (pre-existing) sidebar bug surfaced while testing this PR: when multiple sessions share the same (project_path, branch) key, useWorkspaces collapses them into one workspace and the sidebar renders only workspace.sessions[0]. Reproduces with tmux-only sessions too. I'll open a separate issue.

🤖 Generated with Claude Code

Cockpit-mode sessions added via `aoe add . --cmd <tool> --cockpit` while
`aoe serve` was already running never got an ACP worker. The dashboard
showed the session (the 2s status poller picks it up from disk) but the
agent was silent because no code path called `cockpit_supervisor.spawn(...)`
for it. `aoe session start <name>` looked like it worked but is a no-op for
cockpit sessions: `Instance::start_with_size_opts` returns early on
`cockpit_mode == true`.

Worker spawning previously fired only at three entry points: serve-
startup auto-scan, `POST /api/sessions` (web wizard create), and
`POST /api/cockpit/sessions/:id/enable` (substrate switch). Nothing
watched for cockpit sessions appearing in the on-disk list after
startup.

Move the auto-spawn out of the one-shot startup block and into the
existing `status_poll_loop`. Each tick reconciles cockpit-mode sessions
on disk against the supervisor's worker map and spawns any that are
missing. An `attempted_cockpit_spawns` HashSet guards against retry
storms when a spawn permanently fails (e.g. `claude-agent-acp` not
installed) and is pruned to the live id set each tick so a delete +
recreate of the same id can spawn again. Cold startup latency is
unchanged: tokio interval's first tick fires immediately, so the
reconciler runs at the same point the old startup loop did.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Seluj78 Seluj78 requested a review from njbrake as a code owner May 7, 2026 16:50
…'ing

`aoe session start <name>` (and stop/restart/attach) on a cockpit-mode
session previously printed `✓ Started session: ...` while doing nothing:
`Instance::start_with_size_opts` returns early on `cockpit_mode == true`
because cockpit sessions aren't backed by tmux. The CLI never checked
the flag, so the success line was misleading and the agent stayed silent
until the user noticed (or the daemon-side reconciler kicked in).

Bail loudly from the CLI with the actual remediation: cockpit lifecycle
is owned by `aoe serve`'s supervisor; use the dashboard or REST API to
control individual sessions, or restart serve to force a re-spawn.

`aoe session restart --all` skips cockpit sessions instead of erroring
on each one, so the batch command stays usable in mixed-substrate
profiles.

Both `cockpit_mode` field and the `bail_if_cockpit` helper are gated on
the `serve` feature; TUI-only builds compile to a no-op shim and behave
exactly as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Seluj78
Copy link
Copy Markdown
Author

Seluj78 commented May 7, 2026

Pushed a follow-up commit (ca4c11) that adds option #3 from the investigation comment: aoe session start|stop|restart|attach on a cockpit session now bails with a clear "managed by aoe serve" message instead of silently no-op'ing. aoe session restart --all skips cockpit sessions so the batch command stays usable in mixed-substrate profiles. The cockpit_mode field and bail helper are both feature-gated on serve; TUI-only builds compile to a no-op shim.

Two commits total now — happy to split into a separate PR if you'd rather review them independently.

🤖 Generated with Claude Code

The post-add summary printed the same `aoe session start <name>` /
`aoe` (TUI attach) hint regardless of substrate, which sent cockpit
users straight into the lifecycle commands that now bail
(`fix(cli): refuse cockpit lifecycle commands ...`). Tmux next-steps
were also flat-out wrong for a cockpit session that has no pane to
attach to.

For cockpit sessions, point at `aoe serve` + the dashboard. For
`--launch`, surface that the flag is a no-op so the user understands
why the run didn't open a terminal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Seluj78 Seluj78 marked this pull request as draft May 7, 2026 17:27
@Seluj78
Copy link
Copy Markdown
Author

Seluj78 commented May 7, 2026

Converted back to draft because I'm still fixing bugs :D But I'm having loads of fun

Comment thread src/cli/session.rs
Comment on lines +200 to +212
#[cfg(feature = "serve")]
fn bail_if_cockpit(inst: &crate::session::Instance, verb: &str) -> Result<()> {
if inst.cockpit_mode {
bail!(
"cockpit sessions are managed by `aoe serve`; \
cannot `aoe session {verb}` from the CLI.\n\
The ACP worker is auto-spawned within ~2s of `aoe add --cockpit` \
while serve is running, or on next `aoe serve` startup.\n\
To control a cockpit session, use the web dashboard or the REST API."
);
}
Ok(())
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is quite opinionated, and the "easy" way to do things. I went this way because the fix, according to claude, to do it properly was at least 200 more lines. Should I do a follow up issue for this?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this means that cockpit would be a web dashboard only feature to start? I think in order to keep the PR tractable I'm ok with that as long as in the short term we have some sort of icon in the TUI to tell someone that a session exists but must be loaded via the web dash

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me, or I can do the long fix that reconciles correctly the cockpit sessions to the TUI. You're the boss boss

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just tell me what you want me to do)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol. I think to make this less risky lets just say that cockpit sessions are only started outside the TUI, and that in the TUI it should track that the session exists, but it should mark it as 'cockpit session' and not let you attach to it. Then we'll worry about that piece later

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you want to do any more cleanup now that we're working under this assumption, or if you want me to merge this into my cockpit pr branch as it is now 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants