Skip to content

Fix stop button leaving agents stuck on "Delegating to sub-agent…"#157

Draft
gnguralnick wants to merge 5 commits into
mainfrom
gabriel/interrupt-subagent
Draft

Fix stop button leaving agents stuck on "Delegating to sub-agent…"#157
gnguralnick wants to merge 5 commits into
mainfrom
gabriel/interrupt-subagent

Conversation

@gnguralnick

Copy link
Copy Markdown
Contributor

Clicking the chat stop button while a sub-agent was running left the agent in a permanently "working" state: the activity indicator kept reading "Delegating to sub-agent…" (with the stop button stuck visible) even after new messages, and the delegation's card could read "Running…" forever. Root-caused against a live agent's transcript: the stop endpoint restarts Claude mid-delegation, the abandoned Agent tool_use never receives a tool_result (not even on a later resume), and nothing downstream ever retires it.

Activity indicator (activity_state.py / agent_manager.py)

has_unmatched_tool_use scans the whole transcript, so one dangling tool_use pinned the derived state at TOOL_RUNNING for the rest of the transcript's life. The existing guards each cover only a window: the interrupt endpoint's explicit reset lasts until the next watcher recompute, and the stale-tail guard (#146) stops applying the moment a new message lands — which is exactly when the dangling call re-pins the state.

The tracker now also caches the timestamp of the newest unmatched tool_use and fences it against the claude_process_started marker (same positive-evidence principle as the stale-tail guard): a tool call issued before the current Claude process started cannot still be running. The newest-pending semantics keep in-flight turns correct — a fresh turn's running tool still derives TOOL_RUNNING; once it resolves, only the abandoned call remains and the fence retires it. Missing marker or timestamp leaves behavior unchanged.

Sub-agent card (session_watcher.py + frontend)

A delegation killed before its session linkage ever landed (no meta.json toolUseId discovered, and the closing tool_result will never arrive) showed a "Running…" card forever, and the watcher retried re-linking its parent on every poll cycle, indefinitely. The enrichment pass now marks such never-linked dead calls is_interrupted using the same marker fence, the re-broadcast pass re-emits the parent once so the card flips live (terminal, so the retry loop also ends), and the frontend renders "Stopped". A card whose tool_result exists but never linked to a transcript (denied, errored, or cleaned-up session) now also shows "Stopped"/"Finished" by error state instead of the previous permanent "Running…". Linked delegations are untouched — "View conversation" is already a terminal state.

Both fixes are retroactive for already-stuck agents: the marker is already newer than the dead calls, so the next recompute/enrichment clears them.

Testing

  • Root-caused against a live agent before coding: its session JSONL has two Agent tool_uses from stop clicks with no matching tool_result anywhere, no interrupt sentinel, and a normal turn following — exactly the shape the new regression tests encode.
  • system_interface: uv run pytest — 505 passed, 10 skipped, coverage 86.06%; ratchets pass. Two failures are pre-existing on the base commit and environmental to the local machine (mngr's is_allowed_in_pytest opt-in, and the e2e page-title test needing the built frontend), verified by re-running both on a clean tree.
  • Frontend: npm run test 145 passed; npm run lint (tsc + eslint) clean.

Gabriel Guralnick added 5 commits June 11, 2026 17:26
Clicking stop during a sub-agent run kills Claude mid-delegation: the
Agent tool_use never receives a tool_result (not even on resume), so
has_unmatched_tool_use stayed true for the rest of the transcript's
life. The stale-tail guard only covers the window before the next
message; after that the activity state re-pinned at TOOL_RUNNING and
the indicator read 'Delegating to sub-agent' forever, with the stop
button stuck visible.

Backend: track the timestamp of the newest unmatched tool_use and fence
it against the claude_process_started marker (same principle as the
stale-tail guard) -- a tool call issued before the current Claude
process cannot still be running.

The sub-agent card had the sibling problem: a delegation stopped before
its linkage ever landed showed 'Running...' forever, and its parent
was retried for re-linking on every poll cycle. The watcher now marks
such never-linked dead delegations is_interrupted, re-broadcasts the
parent once (terminal, so the retry stops), and the card renders
'Stopped' -- or, when a tool_result exists without a linked transcript,
'Stopped'/'Finished' by error state -- instead of 'Running...'.
The committed lock predated the vendor/mngr sync's version bumps
(imbue-common/concurrency-group 0.1.19, imbue-mngr/mngr-claude 0.2.12,
resource-guards 0.1.8); any uv run regenerates it to match the
worktree's pyprojects.
Architecture-review cleanups: extract read_process_started_at /
CLAUDE_PROCESS_STARTED_MARKER into activity_state.py (replacing the
duplicated private readers in agent_manager and session_watcher),
rename mergeLateSubagentMetadata to mergeLateSubagentState to match its
widened role, and simplify the update_session_events short-circuit to a
tuple comparison.
Problem: newest_unmatched_tool_use_timestamp rebuilt the matched
tool_result id set that has_unmatched_tool_use also builds, so the
matching rule was maintained twice in the same file.
Fix: extract a shared pure matched_tool_result_ids helper and use it
from both scanners; semantics are unchanged (has_unmatched_tool_use
additionally gains an early return).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant