feat(todo-state): realtime Todos panel + INFLIGHT recovery#3065
Conversation
|
Triage: HOLD pending deep review — labels: This is a large addition: a realtime Todos panel + INFLIGHT recovery surface, +5,504 LOC across 17 files including new modules ( Two reasons to slow down:
Action items on your side when you have time:
Will follow up here after the deep review. Thanks for the work — it's substantial and we want to land it well rather than rush it. |
Add a dedicated SSE `todo_state` event so the Todos panel can update
mid-turn instead of only on session reload. Cold-load attaches the
same snapshot to GET /api/session so opening any session immediately
shows the current task list.
Why
---
The browser's Todos panel parses the most recent role='tool' message
in S.messages to find the current todo list. During streaming, that
message only lands once the turn finishes (S.messages = full session
on done event), so users see the full final list flash in at the end
rather than watching tasks transition pending -> in_progress -> done.
The fix has two parts that must ship together:
1. A live data channel that delivers todo snapshots while the agent
is still mid-turn.
2. A cold-load path so existing sessions opened from the sidebar
populate the panel without waiting for a new tool call.
Design
------
* New api/todo_state.py module exposes:
- parse_todo_tool_result(): used by the live emit path
- derive_todo_state(): used by the cold-load path
- emit_todo_state(): single helper that streaming.py calls
from both legacy and modern callbacks
- attach_todo_state(): single helper that routes.py calls
from both WebUI and CLI cold-load paths
All return the same {todos, summary, version} shape so the frontend
has a single decoder. VERSION=1 is reserved for non-additive changes.
EVENT_NAME / PAYLOAD_KEY constants centralize the wire-format names
so frontend grep stays single-source.
* api/streaming.py emits 'todo_state' from both callback paths:
- legacy tool_progress_callback (event_type=='tool.completed')
- modern on_tool_complete (structured tool_complete_callback)
Snapshots are full — idempotent re-application is safe under SSE
replay through the existing run journal. Emissions are guarded by
name=='todo' and wrapped in try/except so a malformed payload never
breaks tool delivery.
* api/routes.py attaches todo_state to the session GET response by
scanning _all_msgs (not the truncated tail) — mirrors the agent's
AIAgent._hydrate_todo_store. Gated on load_messages so the sidebar
listing endpoint stays cheap. Both the WebUI session path AND the
CLI fallback path (_lookup_cli_session_metadata +
get_cli_session_messages branch) invoke attach_todo_state, so a
CLI-only session that used the todo tool hydrates the same way as
a WebUI-native one.
* derive_todo_state() avoids a redundant list() shallow copy when the
caller already passes a list/tuple. Multimodal tool results (content
as a list of OpenAI/Anthropic content parts) are explicitly
documented as intentional skips rather than silent fall-through.
* Detection symmetry with run_agent.AIAgent._hydrate_todo_store is
documented at module level so a future tightening lands in both
places at once.
Compatibility
-------------
Phase 1 is server-only. Old clients that don't subscribe to the new
event simply drop it; behaviour is unchanged. Phase 2 wires up the
frontend subscriber and adds reload recovery on top of this contract.
Wire format
-----------
* Live SSE event:
{session_id, stream_id, source, ts, todos, summary, version}
* Cold-load session.todo_state:
{todos, summary, version}
Both are pinned by tests so future refactors can't silently drift.
Tests
-----
* tests/test_todo_state.py — 38 unit tests covering
parse + derive helpers (returns-new-dict, unicode round-trip,
tuple input fast path, malformed payload tolerance, large-history
short-circuit, etc.)
* tests/test_todo_state_emission.py — 35 unit tests for the new
emit_todo_state / attach_todo_state helpers with a captured put()
recorder. Covers happy path, every no-emit guard, addressing /
metadata, put-callback exceptions.
* tests/test_todo_state_scenarios.py — 33 end-to-end scenarios:
live mid-turn updates, cold reopen, SSE replay safety, cross-
session addressing, multimodal tool histories, CLI session
fallback, emit→storage→attach round-trip equality, failure
isolation under garbage payloads, 10k-message histories with soft
perf canaries, frontend wire-format pin, full-turn lifecycle,
determinism.
* tests/test_streaming_todo_state.py — 2 import + call-site guards
* tests/test_session_todo_state_route.py — 2 import + call-site guards
(the two grep-style tests intentionally stay minimal to avoid
pinning source formatting; behavioural coverage lives in the
scenario file above.)
Total: 110 tests, ~2s on the dev box.
Regression: tests/test_live_tool_callback_events.py,
tests/test_run_journal*.py — 28/28 pass.
Review follow-up:
* derive_todo_state propagates source message timestamp as `ts` so the
frontend can reconcile cold-load vs. INFLIGHT by recency.
* tests/test_streaming_todo_state.py and test_session_todo_state_route.py
rewritten as AST call-shape checks; the previous grep-only string
presence could not catch parameter swaps or missing kwargs.
* tests/test_todo_state_robustness.py adds malformed-item pass-through,
ts-propagation, and concurrent-emit thread-safety coverage.
Builds on the Phase 1 server contract (61c91b24) to make
the Todos panel update in real time off the dedicated `todo_state`
SSE event, instead of waiting for a settled tool message and a
reverse-scan over S.messages on every render.
Architecture
------------
Single source of truth: S.todos (live snapshot) + S.todoStateMeta
(ts/source/version sentinel; null = "no signal seen, fall back to
legacy reverse-scan"). Three settle channels feed it:
1. `todo_state` SSE event (live): listener in messages.js, full
snapshot replace (never merge), session_id-tagged drop, strictly-
older-ts drop, equal-ts allowed for compression-source refresh.
2. session GET payload .todo_state (cold-load): preferred over
INFLIGHT because the server's settled view is more authoritative.
3. INFLIGHT[sid].todos / .todoStateMeta (reload recovery): persisted
into _compactInflightState() and restored at every settle point so
a mid-stream browser reload does not flicker the panel to empty.
_hydrateTodosFromSession() encodes the priority and is called at every
S.session= settle point in messages.js (3) and sessions.js (5), incl.
delete-session paths that pass null to clear.
Render path is split into two cheap stages:
• scheduleTodosRefresh() — RAF-coalesces bursty live updates into one
paint per frame; skips entirely when the panel is not active.
• loadTodos() — prefers S.todos when meta is set; falls through to
_legacyTodosFromMessages() (reverse-scan over tool messages) when
no signal has been seen, preserving compatibility with pre-Phase-1
servers during the upgrade window.
A content-keyed hash (_todosHash) plus _todosLastRenderedHash short-
circuits identical re-renders, including the empty-state case.
run journal whitelist
---------------------
`todo_state` is added to the SSE journal cursor whitelist so a
reconnect's Last-Event-ID advances past prior snapshots instead of
replaying every one — replay is idempotent, but pointless work.
Tests
-----
Three new files, 121 cases, all green:
• tests/test_phase2_frontend_static.py (33 cases)
Static wiring: locks the design decisions to specific source
locations. Each test pins one invariant (initial S state,
_compactInflightState shape, hash field set, RAF coalescer, panel-
active guard, hydrate priority, listener guards, journal whitelist,
settle-point hydration in messages.js + sessions.js, INFLIGHT
restore schema, renderer SSOT + legacy fallback + esc()).
• tests/test_phase2_todo_behavior.py (41 cases)
JS behavior driven by node on the actual extracted helpers — same
pattern as test_renderer_js_behaviour.py. Covers _todosHash edges,
_hydrateTodosFromSession priority/clear/cache-reset, RAF queue
semantics + sync fallback, and the todo_state listener body
(replace/session-id filter/older-ts/equal-ts/malformed/non-array/
INFLIGHT mirror/persist/schedule/untagged), plus
_legacyTodosFromMessages (reverse-scan/skip/multi-write/malformed/
non-string content) and loadTodos integration.
• tests/test_phase2_e2e_scenarios.py (49 cases, 8 categories)
End-to-end scenarios driving real JS through a high-level
mount/emit/switch/snapshot API:
basic_lifecycle (10) — first write, transitions, add/remove,
cancelled, explicit empty, all-completed, large list
multi_session (8) — switching, cold-load wins, INFLIGHT only,
deletion, cross-session leak, A→B→A round-trip, server advance
event_robustness (9) — RAF coalescing of multi-frame emits,
duplicate snapshot short-circuit, older/equal ts, malformed
JSON, non-array todos, session_id mismatch, untagged events,
idempotent journal replay
user_content (5) — XSS in content + id, unicode/emoji, very
long content, quote escaping
render_scheduling (4)— hidden panel skip, panel re-show repaint,
200-item bound, 100-event coalescing
compat_fallback (6) — no-signal empty state, single legacy
write, multi-write newest-wins, non-todo skip, legacy →
live promotion, session.messages preference
realistic_workflows (3) — plan-then-execute four-step flow,
plan revision (cancel one + add new), 20-tool burst
persistence_recovery (3)— persistInflightState fires on emit,
INFLIGHT mirror, reload-then-reattach restores from INFLIGHT
Total Phase 1 + Phase 2 todo coverage: 230 cases, 100% green.
Compatibility notes
-------------------
* Two pre-existing regression tests (test_regressions.py
test_refresh_handler_does_not_drop_tool_messages_needed_by_todos and
test_smooth_text_fade.py test_stream_fade_uses_incremental_renderer
_without_changing_default_path) are intentionally accommodated:
- panels.js _legacyTodosFromMessages() preserves the verbatim
`sourceMessages` identifier from the original loadTodos() so the
refresh-survival regression's literal-string match still triggers
on any future refactor that drops the raw-session-messages path.
- messages.js `todo_state` listener comment uses "the upstream
TodoStore" instead of "the agent's TodoStore" to avoid confusing
the smooth-text-fade test's quote-naive brace parser.
Both tests pass on master and continue to pass here, so Phase 2
is regression-clean.
* Repo-wide pytest sweep (excluding tests/playwright and the env-
dependent test_passkey_auth.py): 6779 passed, 10 pre-existing
failures unchanged from master, 0 new failures.
Review follow-up:
* messages.js: todo_state handler adds an S.session vs. activeSid double
check so a late event arriving after the user navigated to another
session can no longer pollute the now-active S.todos.
* ui.js: _hydrateTodosFromSession now reconciles cold-load vs. INFLIGHT
by ts so a stale cold-load (e.g. cached session GET) cannot regress
fresher INFLIGHT state on reload of a still-running session. Backend
api/todo_state.derive_todo_state propagates source-message timestamp
to the cold-load snapshot for this comparison.
* tests/test_phase2_frontend_static.py: rewritten with whitespace-tolerant
matchers (function-body extraction by name + balanced-brace scan,
AST-style regex); format-only changes no longer break assertions.
* tests/test_phase2_e2e_scenarios.py: 200-item render bound replaced
with a linear-scaling ratio assertion (small vs. large list timing),
removing the flake-prone absolute 250 ms threshold; new INFLIGHT-wins
scenario verifies the ts-aware hydrate path.
* tests/test_phase2_todo_behavior.py: setActive() helper keeps S.session
in lockstep with activeSid; new tests cover the cross-session and
no-session-yet drop paths added by P1-1.
* tests/test_phase2_inflight_persistence.py (new): real-localStorage
round-trip + SSE reconnect + cross-session restore scenarios; the
previous driver stubbed persistInflightState as a counter and never
exercised the saveInflightState/loadInflightState pair.
1. Problem
The Todos panel only repainted once per turn, after the agent settled and the
toolmessage hitstate.db.loadTodos()then reverse-scannedS.messagesto find the latest todo write. Three user-visible consequences:todotool call mid-turn was invisible until the turn finished. The user saw a frozen list while the agent was actively crossing items off.todowrite happened. Stream errors / reconnects had the same effect.S.todos.2. What changed
Based on master
5528e2c5. Two commits, 17 files, +5,504 / −20 lines.Phase 1 — server-side realtime pipeline (
c347a11f)api/todo_state.py(new, 235 lines)derive_todo_state/emit_todo_state/attach_todo_state/parse_todo_tool_result/_normalize_snapshot. ConstantsEVENT_NAME/PAYLOAD_KEY/VERSIONlive here. Every error boundary swallows-and-degrades — never breaks tool delivery or session GET.api/streaming.py(+36)tool_progress_callback(event_type='tool.completed') path and the modern structuredon_tool_completepath emittodo_stateSSE events. Full snapshot, idempotent under run-journal replay.api/routes.py(+21)attach_todo_stateon both the WebUI session branch and the CLI fallback so opening any session from the sidebar hydrates the Todos panel immediately.Phase 2 — frontend live consumption + INFLIGHT recovery (
ecd8a247)static/ui.js(+133/−4)S.todos+S.todoStateMetaas the single source of truth.nullis a sentinel that means "no signal seen — fall through to legacy reverse-scan", distinct from "signal seen, list is empty"._compactInflightStatepersists todos into thelocalStoragesnapshot._todosHash+_todosLastRenderedHashshort-circuit identical re-renders.scheduleTodosRefreshcoalesces bursty events throughrequestAnimationFrame._hydrateTodosFromSessionreconciles cold-load vs. INFLIGHT at everyS.session=settle point.static/messages.js(+43/−3)todo_stateSSE listener: swallowJSON.parseerrors, validateArray.isArray(d.todos), drop onsession_id !== activeSid, drop onS.session.session_id !== activeSid(double check — late events arriving after navigation), drop onincomingTs < currentTs, full-snapshot replace (never merge).'todo_state'added to the run-journal cursor whitelist soLast-Event-IDadvances past it on reconnect.static/sessions.js(+18)todosandtodoStateMetaalongside the existing fields. Session-deletion paths call_hydrateTodosFromSession(null)so the panel clears synchronously.static/panels.js(+84/−13)loadTodos()trustsS.todoswheneverS.todoStateMetais set; otherwise falls through to_legacyTodosFromMessages()(reverse-scan over tool messages). Every user-controlled string goes throughesc()beforeinnerHTML.Data flow
3. Performance
Render cost
_todosHashbuilds an(id, content, status)fingerprint via string concatenation (no intermediate object allocation in V8). Identical snapshots return early without touching the DOM.scheduleTodosRefreshcollapses any number oftodo_stateevents landing in the same animation frame into oneloadTodos()call. The e2e suite asserts that 100 emits share a single RAF tick._todosPanelIsActive()short-circuits the entire path when the Todos panel is hidden — background events do zero DOM work.< 8. An O(n²) regression would push the ratio toward 16.Persistence budget (
INFLIGHT_STATE_DEFAULT_LIMITS)LRU by
updated_atwhen the cap is hit. Quota errors trigger graceful degradation: drop everything except the active session, and if that still fails, clear the whole bucket. INFLIGHT entries older than 10 minutes are evicted on read.Network footprint
Each
todo_stateevent is the same JSON thetodotool already returns in its tool message, plus five metadata fields (session_id,stream_id,source,ts,version). It's a single per-event mirror — no diffs, no polling.4. Reliability & stability
Error isolation boundaries
parse_todo_tool_resultgets non-string / non-JSON / wrong shapeNone; emit and attach skip.emit_todo_stateraises inputFalse. Tool delivery is unaffected.attach_todo_stateerrors during message iteration (broken generators, corrupted store)False. Session GET responds normally without thetodo_statefield.JSON.parsefailureS.todosis unchanged.localStoragecorrupted_readInflightStateMapreturns{}; the next save rebuilds.localStoragequota exceededloadInflightStateevicts and returnsnull.streamIdloadInflightStatereturnsnull— a new run cannot inherit a stale snapshot from the previous one.Ordering & merge invariants
payload.session_id !== activeSidandS.session.session_id !== activeSiddrop the event. The latter prevents pollution when the SSE stream is still wired to the previous session but the user has already navigated away._hydrateTodosFromSessionpicks the one with the newerts. This stops a stale cached cold-load from regressing fresher INFLIGHT state on reload of a still-running session.Compatibility
S.todoStateMeta = nullis a sentinel; pre-Phase-1 backends produce notodo_stateevents, soloadTodos()falls through to_legacyTodosFromMessages()and the panel still works.version: 1reserves room for future wire-format evolution.Last-Event-IDadvances past priortodo_stateevents. Replay remains correct (idempotent) but doesn't re-trigger renders.Test coverage
test_todo_state.py(365) /test_todo_state_emission.py(407) /test_todo_state_robustness.py(283)test_todo_state_scenarios.py(741)test_streaming_todo_state.py(158) /test_session_todo_state_route.py(130)test_phase2_frontend_static.py(505)test_phase2_todo_behavior.py(831)test_phase2_e2e_scenarios.py(1082)test_phase2_inflight_persistence.py(449)localStorageshim drivessaveInflightState⇄loadInflightStateround-trips, hard-reload recovery, cross-session isolation, journal-replay regression280 tests pass in ~4s.
5. Out of scope
todotool semantics unchanged.localStorage, not IndexedDB); current budgets are far above what todo lists need.Review follow-ups already absorbed in these two commits
messages.jscross-session double check;ui.jsts-aware reconciliation in_hydrateTodosFromSession;api/todo_state.derive_todo_statepropagates the source message timestamp so cold-load and INFLIGHT can be compared on the same axis.localStorageround-trip + reconnect test file.