feat(todo-state): realtime Todos panel + INFLIGHT recovery by v2psv · Pull Request #3065 · nesquena/hermes-webui

v2psv · 2026-05-28T10:36:33Z

1. Problem

The Todos panel only repainted once per turn, after the agent settled and the tool message hit state.db. loadTodos() then reverse-scanned S.messages to find the latest todo write. Three user-visible consequences:

Not realtime. A todo tool call mid-turn was invisible until the turn finished. The user saw a frozen list while the agent was actively crossing items off.
State lost on reload / reconnect. Refreshing during a running stream wiped the panel until the next todo write happened. Stream errors / reconnects had the same effect.
Occasional cross-session pollution. A late SSE event tagged for session A could land after the user had already navigated to session B and overwrite S.todos.

2. What changed

ecd8a247 feat(todo-state): frontend live updates + INFLIGHT recovery (Phase 2)
c347a11f feat(todo-state): server-side realtime todo state (Phase 1)

Based on master 5528e2c5. Two commits, 17 files, +5,504 / −20 lines.

Production code: 7 files / +573 / −20.
Tests: 10 files / +4,931 (8 new files).

Phase 1 — server-side realtime pipeline (`c347a11f`)

File	Change
`api/todo_state.py` (new, 235 lines)	Single source of truth for the wire protocol: `derive_todo_state` / `emit_todo_state` / `attach_todo_state` / `parse_todo_tool_result` / `_normalize_snapshot`. Constants `EVENT_NAME` / `PAYLOAD_KEY` / `VERSION` live here. Every error boundary swallows-and-degrades — never breaks tool delivery or session GET.
`api/streaming.py` (+36)	Both the legacy `tool_progress_callback` (`event_type='tool.completed'`) path and the modern structured `on_tool_complete` path emit `todo_state` SSE events. Full snapshot, idempotent under run-journal replay.
`api/routes.py` (+21)	The session GET handler calls `attach_todo_state` on both the WebUI session branch and the CLI fallback so opening any session from the sidebar hydrates the Todos panel immediately.

Phase 2 — frontend live consumption + INFLIGHT recovery (`ecd8a247`)

File	Change
`static/ui.js` (+133/−4)	`S.todos` + `S.todoStateMeta` as the single source of truth. `null` is a sentinel that means "no signal seen — fall through to legacy reverse-scan", distinct from "signal seen, list is empty". `_compactInflightState` persists todos into the `localStorage` snapshot. `_todosHash` + `_todosLastRenderedHash` short-circuit identical re-renders. `scheduleTodosRefresh` coalesces bursty events through `requestAnimationFrame`. `_hydrateTodosFromSession` reconciles cold-load vs. INFLIGHT at every `S.session=` settle point.
`static/messages.js` (+43/−3)	`todo_state` SSE listener: swallow `JSON.parse` errors, validate `Array.isArray(d.todos)`, drop on `session_id !== activeSid`, drop on `S.session.session_id !== activeSid` (double check — late events arriving after navigation), drop on `incomingTs < currentTs`, full-snapshot replace (never merge). `'todo_state'` added to the run-journal cursor whitelist so `Last-Event-ID` advances past it on reconnect.
`static/sessions.js` (+18)	INFLIGHT restore on tab return / hard reload carries `todos` and `todoStateMeta` alongside the existing fields. Session-deletion paths call `_hydrateTodosFromSession(null)` so the panel clears synchronously.
`static/panels.js` (+84/−13)	`loadTodos()` trusts `S.todos` whenever `S.todoStateMeta` is set; otherwise falls through to `_legacyTodosFromMessages()` (reverse-scan over tool messages). Every user-controlled string goes through `esc()` before `innerHTML`.

Data flow

agent.todo()
   │
   ▼
streaming.py  ──emit_todo_state──▶  SSE: 'todo_state' (full snapshot)
   │                                          │
   ▼                                          ▼
state.db (tool message)              messages.js listener
   │                                          │
   ▼                                          ▼
routes.py session GET                S.todos / S.todoStateMeta
   │   ──attach_todo_state──▶ payload.todo_state    │
   ▼                                                 │
ui.js  _hydrateTodosFromSession  ◀── reconcile ─────┤
   │       (cold-load vs INFLIGHT, by ts)            │
   ▼                                                 │
panels.js loadTodos ──▶ DOM         localStorage ◀───┘

3. Performance

Render cost

Hash short-circuit. _todosHash builds an (id, content, status) fingerprint via string concatenation (no intermediate object allocation in V8). Identical snapshots return early without touching the DOM.
RAF coalescing. scheduleTodosRefresh collapses any number of todo_state events landing in the same animation frame into one loadTodos() call. The e2e suite asserts that 100 emits share a single RAF tick.
Inactive-panel skip. _todosPanelIsActive() short-circuits the entire path when the Todos panel is hidden — background events do zero DOM work.
Linear-scaling regression guard. Instead of a fragile absolute timeout, the test compares 50-item vs. 200-item render time and asserts the ratio stays < 8. An O(n²) regression would push the ratio toward 16.

Persistence budget (`INFLIGHT_STATE_DEFAULT_LIMITS`)

maxSessions: 8
messages:    24
toolCalls:   48
stringChars: 60,000   per string field
jsonChars:   1,500,000 per full snapshot

LRU by updated_at when the cap is hit. Quota errors trigger graceful degradation: drop everything except the active session, and if that still fails, clear the whole bucket. INFLIGHT entries older than 10 minutes are evicted on read.

Network footprint

Each todo_state event is the same JSON the todo tool already returns in its tool message, plus five metadata fields (session_id, stream_id, source, ts, version). It's a single per-event mirror — no diffs, no polling.

4. Reliability & stability

Error isolation boundaries

Boundary	Behavior
`parse_todo_tool_result` gets non-string / non-JSON / wrong shape	Returns `None`; emit and attach skip.
`emit_todo_state` raises in `put`	Swallowed, debug-logged, returns `False`. Tool delivery is unaffected.
`attach_todo_state` errors during message iteration (broken generators, corrupted store)	Swallowed, returns `False`. Session GET responds normally without the `todo_state` field.
Frontend `JSON.parse` failure	Listener returns; `S.todos` is unchanged.
`localStorage` corrupted	`_readInflightStateMap` returns `{}`; the next save rebuilds.
`localStorage` quota exceeded	Falls back to active-session-only; if still over, clears the bucket.
INFLIGHT entry stale (>10 min)	`loadInflightState` evicts and returns `null`.
Stream reconnects with a different `streamId`	`loadInflightState` returns `null` — a new run cannot inherit a stale snapshot from the previous one.

Ordering & merge invariants

Full-snapshot protocol. Every emission carries the entire list, so SSE replay is idempotent — no merge, no diff reconciliation.
Strict-older-ts drop, equal-ts allowed. A compression-source refresh can legitimately land on the same wall-clock second as the tool emission it follows.
Cross-session double gate. Both payload.session_id !== activeSid and S.session.session_id !== activeSid drop the event. The latter prevents pollution when the SSE stream is still wired to the previous session but the user has already navigated away.
Recency-aware hydrate. When both cold-load (server settled view) and INFLIGHT (locally persisted snapshot) are present, _hydrateTodosFromSession picks the one with the newer ts. This stops a stale cached cold-load from regressing fresher INFLIGHT state on reload of a still-running session.

Compatibility

S.todoStateMeta = null is a sentinel; pre-Phase-1 backends produce no todo_state events, so loadTodos() falls through to _legacyTodosFromMessages() and the panel still works.
version: 1 reserves room for future wire-format evolution.
The run-journal whitelist entry means a reconnect's Last-Event-ID advances past prior todo_state events. Replay remains correct (idempotent) but doesn't re-trigger renders.

Test coverage

Category	File	Purpose
Protocol unit	`test_todo_state.py` (365) / `test_todo_state_emission.py` (407) / `test_todo_state_robustness.py` (283)	parse / derive / emit / attach contracts; malformed items, non-dict summary, concurrent-emit thread safety, ts propagation
Real scenarios	`test_todo_state_scenarios.py` (741)	Multi-write history, last-write-truncated fallback, large-history performance sentinel
Wiring (AST)	`test_streaming_todo_state.py` (158) / `test_session_todo_state_route.py` (130)	AST call-shape checks: pin call sites' kwargs and positional shape, not source strings
Frontend wiring	`test_phase2_frontend_static.py` (505)	Balanced-brace function-body extraction + AST-style regex; tolerates whitespace and harmless reformatting
Frontend behavior	`test_phase2_todo_behavior.py` (831)	Node driver loads real JS function bodies and runs unit-level assertions against them
Frontend scenarios	`test_phase2_e2e_scenarios.py` (1082)	Lifecycle / multi-session / event-robustness / render-scheduling end-to-end through the real JS
Persistence round-trip	`test_phase2_inflight_persistence.py` (449)	Real in-memory `localStorage` shim drives `saveInflightState` ⇄ `loadInflightState` round-trips, hard-reload recovery, cross-session isolation, journal-replay regression

280 tests pass in ~4s.

5. Out of scope

No upstream Hermes Agent protocol change; todo tool semantics unchanged.
No SSE infrastructure change; one new event name plus one journal-whitelist entry.
No persistence-layer change (still localStorage, not IndexedDB); current budgets are far above what todo lists need.

Review follow-ups already absorbed in these two commits

P1: messages.js cross-session double check; ui.js ts-aware reconciliation in _hydrateTodosFromSession; api/todo_state.derive_todo_state propagates the source message timestamp so cold-load and INFLIGHT can be compared on the same axis.
P2: replaced grep-only structural tests with AST call-shape checks; added malformed-item / concurrent-emit / extra malformed-payload coverage; reduced brittleness in static-shape assertions; replaced flake-prone absolute timing with linear-scaling ratio; added the real-localStorage round-trip + reconnect test file.

nesquena-hermes · 2026-05-28T16:16:42Z

Triage: HOLD pending deep review — labels: hold, maintainer-review

This is a large addition: a realtime Todos panel + INFLIGHT recovery surface, +5,504 LOC across 17 files including new modules (api/todo_state.py), api/routes.py cold-load hooks, and api/streaming.py integration.

Two reasons to slow down:

New always-visible UI surface. A realtime Todos panel is a major UI addition. We want multi-viewport screenshots (390/1280/1440/1920), behavior with no todos, behavior mid-stream, behavior after reload, behavior with INFLIGHT recovery scenarios, and dark vs. light theme.
New architectural module + streaming.py integration. api/streaming.py is in our sensitive-paths list and api/todo_state.py adds a parallel store concept. We need our deep-streaming reviewer to walk this carefully. INFLIGHT recovery is also one of the sharpest paths in the codebase — the test coverage looks present but we want to spend some time on the recovery contract before this lands.

Action items on your side when you have time:

Multi-viewport screenshots for the Todos panel (empty / streaming / hydrated / inflight-recovery)
Short README in docs/ (or in the PR body) explaining the todo_state contract and how it interacts with the existing run-journal / partial-output recovery
Confirm whether the agent side already has a todo store this could consume vs. being WebUI-only

Will follow up here after the deep review. Thanks for the work — it's substantial and we want to land it well rather than rush it.

Add a dedicated SSE `todo_state` event so the Todos panel can update mid-turn instead of only on session reload. Cold-load attaches the same snapshot to GET /api/session so opening any session immediately shows the current task list. Why --- The browser's Todos panel parses the most recent role='tool' message in S.messages to find the current todo list. During streaming, that message only lands once the turn finishes (S.messages = full session on done event), so users see the full final list flash in at the end rather than watching tasks transition pending -> in_progress -> done. The fix has two parts that must ship together: 1. A live data channel that delivers todo snapshots while the agent is still mid-turn. 2. A cold-load path so existing sessions opened from the sidebar populate the panel without waiting for a new tool call. Design ------ * New api/todo_state.py module exposes: - parse_todo_tool_result(): used by the live emit path - derive_todo_state(): used by the cold-load path - emit_todo_state(): single helper that streaming.py calls from both legacy and modern callbacks - attach_todo_state(): single helper that routes.py calls from both WebUI and CLI cold-load paths All return the same {todos, summary, version} shape so the frontend has a single decoder. VERSION=1 is reserved for non-additive changes. EVENT_NAME / PAYLOAD_KEY constants centralize the wire-format names so frontend grep stays single-source. * api/streaming.py emits 'todo_state' from both callback paths: - legacy tool_progress_callback (event_type=='tool.completed') - modern on_tool_complete (structured tool_complete_callback) Snapshots are full — idempotent re-application is safe under SSE replay through the existing run journal. Emissions are guarded by name=='todo' and wrapped in try/except so a malformed payload never breaks tool delivery. * api/routes.py attaches todo_state to the session GET response by scanning _all_msgs (not the truncated tail) — mirrors the agent's AIAgent._hydrate_todo_store. Gated on load_messages so the sidebar listing endpoint stays cheap. Both the WebUI session path AND the CLI fallback path (_lookup_cli_session_metadata + get_cli_session_messages branch) invoke attach_todo_state, so a CLI-only session that used the todo tool hydrates the same way as a WebUI-native one. * derive_todo_state() avoids a redundant list() shallow copy when the caller already passes a list/tuple. Multimodal tool results (content as a list of OpenAI/Anthropic content parts) are explicitly documented as intentional skips rather than silent fall-through. * Detection symmetry with run_agent.AIAgent._hydrate_todo_store is documented at module level so a future tightening lands in both places at once. Compatibility ------------- Phase 1 is server-only. Old clients that don't subscribe to the new event simply drop it; behaviour is unchanged. Phase 2 wires up the frontend subscriber and adds reload recovery on top of this contract. Wire format ----------- * Live SSE event: {session_id, stream_id, source, ts, todos, summary, version} * Cold-load session.todo_state: {todos, summary, version} Both are pinned by tests so future refactors can't silently drift. Tests ----- * tests/test_todo_state.py — 38 unit tests covering parse + derive helpers (returns-new-dict, unicode round-trip, tuple input fast path, malformed payload tolerance, large-history short-circuit, etc.) * tests/test_todo_state_emission.py — 35 unit tests for the new emit_todo_state / attach_todo_state helpers with a captured put() recorder. Covers happy path, every no-emit guard, addressing / metadata, put-callback exceptions. * tests/test_todo_state_scenarios.py — 33 end-to-end scenarios: live mid-turn updates, cold reopen, SSE replay safety, cross- session addressing, multimodal tool histories, CLI session fallback, emit→storage→attach round-trip equality, failure isolation under garbage payloads, 10k-message histories with soft perf canaries, frontend wire-format pin, full-turn lifecycle, determinism. * tests/test_streaming_todo_state.py — 2 import + call-site guards * tests/test_session_todo_state_route.py — 2 import + call-site guards (the two grep-style tests intentionally stay minimal to avoid pinning source formatting; behavioural coverage lives in the scenario file above.) Total: 110 tests, ~2s on the dev box. Regression: tests/test_live_tool_callback_events.py, tests/test_run_journal*.py — 28/28 pass. Review follow-up: * derive_todo_state propagates source message timestamp as `ts` so the frontend can reconcile cold-load vs. INFLIGHT by recency. * tests/test_streaming_todo_state.py and test_session_todo_state_route.py rewritten as AST call-shape checks; the previous grep-only string presence could not catch parameter swaps or missing kwargs. * tests/test_todo_state_robustness.py adds malformed-item pass-through, ts-propagation, and concurrent-emit thread-safety coverage.

Builds on the Phase 1 server contract (61c91b24) to make the Todos panel update in real time off the dedicated `todo_state` SSE event, instead of waiting for a settled tool message and a reverse-scan over S.messages on every render. Architecture ------------ Single source of truth: S.todos (live snapshot) + S.todoStateMeta (ts/source/version sentinel; null = "no signal seen, fall back to legacy reverse-scan"). Three settle channels feed it: 1. `todo_state` SSE event (live): listener in messages.js, full snapshot replace (never merge), session_id-tagged drop, strictly- older-ts drop, equal-ts allowed for compression-source refresh. 2. session GET payload .todo_state (cold-load): preferred over INFLIGHT because the server's settled view is more authoritative. 3. INFLIGHT[sid].todos / .todoStateMeta (reload recovery): persisted into _compactInflightState() and restored at every settle point so a mid-stream browser reload does not flicker the panel to empty. _hydrateTodosFromSession() encodes the priority and is called at every S.session= settle point in messages.js (3) and sessions.js (5), incl. delete-session paths that pass null to clear. Render path is split into two cheap stages: • scheduleTodosRefresh() — RAF-coalesces bursty live updates into one paint per frame; skips entirely when the panel is not active. • loadTodos() — prefers S.todos when meta is set; falls through to _legacyTodosFromMessages() (reverse-scan over tool messages) when no signal has been seen, preserving compatibility with pre-Phase-1 servers during the upgrade window. A content-keyed hash (_todosHash) plus _todosLastRenderedHash short- circuits identical re-renders, including the empty-state case. run journal whitelist --------------------- `todo_state` is added to the SSE journal cursor whitelist so a reconnect's Last-Event-ID advances past prior snapshots instead of replaying every one — replay is idempotent, but pointless work. Tests ----- Three new files, 121 cases, all green: • tests/test_phase2_frontend_static.py (33 cases) Static wiring: locks the design decisions to specific source locations. Each test pins one invariant (initial S state, _compactInflightState shape, hash field set, RAF coalescer, panel- active guard, hydrate priority, listener guards, journal whitelist, settle-point hydration in messages.js + sessions.js, INFLIGHT restore schema, renderer SSOT + legacy fallback + esc()). • tests/test_phase2_todo_behavior.py (41 cases) JS behavior driven by node on the actual extracted helpers — same pattern as test_renderer_js_behaviour.py. Covers _todosHash edges, _hydrateTodosFromSession priority/clear/cache-reset, RAF queue semantics + sync fallback, and the todo_state listener body (replace/session-id filter/older-ts/equal-ts/malformed/non-array/ INFLIGHT mirror/persist/schedule/untagged), plus _legacyTodosFromMessages (reverse-scan/skip/multi-write/malformed/ non-string content) and loadTodos integration. • tests/test_phase2_e2e_scenarios.py (49 cases, 8 categories) End-to-end scenarios driving real JS through a high-level mount/emit/switch/snapshot API: basic_lifecycle (10) — first write, transitions, add/remove, cancelled, explicit empty, all-completed, large list multi_session (8) — switching, cold-load wins, INFLIGHT only, deletion, cross-session leak, A→B→A round-trip, server advance event_robustness (9) — RAF coalescing of multi-frame emits, duplicate snapshot short-circuit, older/equal ts, malformed JSON, non-array todos, session_id mismatch, untagged events, idempotent journal replay user_content (5) — XSS in content + id, unicode/emoji, very long content, quote escaping render_scheduling (4)— hidden panel skip, panel re-show repaint, 200-item bound, 100-event coalescing compat_fallback (6) — no-signal empty state, single legacy write, multi-write newest-wins, non-todo skip, legacy → live promotion, session.messages preference realistic_workflows (3) — plan-then-execute four-step flow, plan revision (cancel one + add new), 20-tool burst persistence_recovery (3)— persistInflightState fires on emit, INFLIGHT mirror, reload-then-reattach restores from INFLIGHT Total Phase 1 + Phase 2 todo coverage: 230 cases, 100% green. Compatibility notes ------------------- * Two pre-existing regression tests (test_regressions.py test_refresh_handler_does_not_drop_tool_messages_needed_by_todos and test_smooth_text_fade.py test_stream_fade_uses_incremental_renderer _without_changing_default_path) are intentionally accommodated: - panels.js _legacyTodosFromMessages() preserves the verbatim `sourceMessages` identifier from the original loadTodos() so the refresh-survival regression's literal-string match still triggers on any future refactor that drops the raw-session-messages path. - messages.js `todo_state` listener comment uses "the upstream TodoStore" instead of "the agent's TodoStore" to avoid confusing the smooth-text-fade test's quote-naive brace parser. Both tests pass on master and continue to pass here, so Phase 2 is regression-clean. * Repo-wide pytest sweep (excluding tests/playwright and the env- dependent test_passkey_auth.py): 6779 passed, 10 pre-existing failures unchanged from master, 0 new failures. Review follow-up: * messages.js: todo_state handler adds an S.session vs. activeSid double check so a late event arriving after the user navigated to another session can no longer pollute the now-active S.todos. * ui.js: _hydrateTodosFromSession now reconciles cold-load vs. INFLIGHT by ts so a stale cold-load (e.g. cached session GET) cannot regress fresher INFLIGHT state on reload of a still-running session. Backend api/todo_state.derive_todo_state propagates source-message timestamp to the cold-load snapshot for this comparison. * tests/test_phase2_frontend_static.py: rewritten with whitespace-tolerant matchers (function-body extraction by name + balanced-brace scan, AST-style regex); format-only changes no longer break assertions. * tests/test_phase2_e2e_scenarios.py: 200-item render bound replaced with a linear-scaling ratio assertion (small vs. large list timing), removing the flake-prone absolute 250 ms threshold; new INFLIGHT-wins scenario verifies the ts-aware hydrate path. * tests/test_phase2_todo_behavior.py: setActive() helper keeps S.session in lockstep with activeSid; new tests cover the cross-session and no-session-yet drop paths added by P1-1. * tests/test_phase2_inflight_persistence.py (new): real-localStorage round-trip + SSE reconnect + cross-session restore scenarios; the previous driver stubbed persistInflightState as a counter and never exercised the saveInflightState/loadInflightState pair.

nesquena-hermes added hold maintainer-review Maintainer fit-assessment needed — may not merge even with fixes labels May 28, 2026

nesquena-hermes mentioned this pull request May 28, 2026

stage-batch35: v0.51.153 / Release DY — 11-PR low-risk cleanup #3081

Merged

v2psv force-pushed the feat/todo-realtime branch from ecd8a24 to 0a772cd Compare May 28, 2026 16:51

This was referenced May 28, 2026

stage-batch36: v0.51.154 / Release DZ — 9-PR medium-risk cleanup #3088

Merged

stage-batch39: v0.51.157 / Release EC — 5-PR mixed-risk cleanup #3096

Merged

v2psv added 2 commits May 28, 2026 23:14

v2psv force-pushed the feat/todo-realtime branch from 0a772cd to fa128c7 Compare May 28, 2026 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(todo-state): realtime Todos panel + INFLIGHT recovery#3065

feat(todo-state): realtime Todos panel + INFLIGHT recovery#3065
v2psv wants to merge 2 commits into
nesquena:masterfrom
v2psv:feat/todo-realtime

v2psv commented May 28, 2026

Uh oh!

nesquena-hermes commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

v2psv commented May 28, 2026

1. Problem

2. What changed

Phase 1 — server-side realtime pipeline (c347a11f)

Phase 2 — frontend live consumption + INFLIGHT recovery (ecd8a247)

Data flow

3. Performance

Render cost

Persistence budget (INFLIGHT_STATE_DEFAULT_LIMITS)

Network footprint

4. Reliability & stability

Error isolation boundaries

Ordering & merge invariants

Compatibility

Test coverage

5. Out of scope

Review follow-ups already absorbed in these two commits

Uh oh!

nesquena-hermes commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Phase 1 — server-side realtime pipeline (`c347a11f`)

Phase 2 — frontend live consumption + INFLIGHT recovery (`ecd8a247`)

Persistence budget (`INFLIGHT_STATE_DEFAULT_LIMITS`)