Skip to content

feat(todo-state): realtime Todos panel + INFLIGHT recovery#3065

Open
v2psv wants to merge 2 commits into
nesquena:masterfrom
v2psv:feat/todo-realtime
Open

feat(todo-state): realtime Todos panel + INFLIGHT recovery#3065
v2psv wants to merge 2 commits into
nesquena:masterfrom
v2psv:feat/todo-realtime

Conversation

@v2psv
Copy link
Copy Markdown
Contributor

@v2psv v2psv commented May 28, 2026

1. Problem

The Todos panel only repainted once per turn, after the agent settled and the tool message hit state.db. loadTodos() then reverse-scanned S.messages to find the latest todo write. Three user-visible consequences:

  • Not realtime. A todo tool call mid-turn was invisible until the turn finished. The user saw a frozen list while the agent was actively crossing items off.
  • State lost on reload / reconnect. Refreshing during a running stream wiped the panel until the next todo write happened. Stream errors / reconnects had the same effect.
  • Occasional cross-session pollution. A late SSE event tagged for session A could land after the user had already navigated to session B and overwrite S.todos.

2. What changed

ecd8a247 feat(todo-state): frontend live updates + INFLIGHT recovery (Phase 2)
c347a11f feat(todo-state): server-side realtime todo state (Phase 1)

Based on master 5528e2c5. Two commits, 17 files, +5,504 / −20 lines.

  • Production code: 7 files / +573 / −20.
  • Tests: 10 files / +4,931 (8 new files).

Phase 1 — server-side realtime pipeline (c347a11f)

File Change
api/todo_state.py (new, 235 lines) Single source of truth for the wire protocol: derive_todo_state / emit_todo_state / attach_todo_state / parse_todo_tool_result / _normalize_snapshot. Constants EVENT_NAME / PAYLOAD_KEY / VERSION live here. Every error boundary swallows-and-degrades — never breaks tool delivery or session GET.
api/streaming.py (+36) Both the legacy tool_progress_callback (event_type='tool.completed') path and the modern structured on_tool_complete path emit todo_state SSE events. Full snapshot, idempotent under run-journal replay.
api/routes.py (+21) The session GET handler calls attach_todo_state on both the WebUI session branch and the CLI fallback so opening any session from the sidebar hydrates the Todos panel immediately.

Phase 2 — frontend live consumption + INFLIGHT recovery (ecd8a247)

File Change
static/ui.js (+133/−4) S.todos + S.todoStateMeta as the single source of truth. null is a sentinel that means "no signal seen — fall through to legacy reverse-scan", distinct from "signal seen, list is empty". _compactInflightState persists todos into the localStorage snapshot. _todosHash + _todosLastRenderedHash short-circuit identical re-renders. scheduleTodosRefresh coalesces bursty events through requestAnimationFrame. _hydrateTodosFromSession reconciles cold-load vs. INFLIGHT at every S.session= settle point.
static/messages.js (+43/−3) todo_state SSE listener: swallow JSON.parse errors, validate Array.isArray(d.todos), drop on session_id !== activeSid, drop on S.session.session_id !== activeSid (double check — late events arriving after navigation), drop on incomingTs < currentTs, full-snapshot replace (never merge). 'todo_state' added to the run-journal cursor whitelist so Last-Event-ID advances past it on reconnect.
static/sessions.js (+18) INFLIGHT restore on tab return / hard reload carries todos and todoStateMeta alongside the existing fields. Session-deletion paths call _hydrateTodosFromSession(null) so the panel clears synchronously.
static/panels.js (+84/−13) loadTodos() trusts S.todos whenever S.todoStateMeta is set; otherwise falls through to _legacyTodosFromMessages() (reverse-scan over tool messages). Every user-controlled string goes through esc() before innerHTML.

Data flow

agent.todo()
   │
   ▼
streaming.py  ──emit_todo_state──▶  SSE: 'todo_state' (full snapshot)
   │                                          │
   ▼                                          ▼
state.db (tool message)              messages.js listener
   │                                          │
   ▼                                          ▼
routes.py session GET                S.todos / S.todoStateMeta
   │   ──attach_todo_state──▶ payload.todo_state    │
   ▼                                                 │
ui.js  _hydrateTodosFromSession  ◀── reconcile ─────┤
   │       (cold-load vs INFLIGHT, by ts)            │
   ▼                                                 │
panels.js loadTodos ──▶ DOM         localStorage ◀───┘

3. Performance

Render cost

  • Hash short-circuit. _todosHash builds an (id, content, status) fingerprint via string concatenation (no intermediate object allocation in V8). Identical snapshots return early without touching the DOM.
  • RAF coalescing. scheduleTodosRefresh collapses any number of todo_state events landing in the same animation frame into one loadTodos() call. The e2e suite asserts that 100 emits share a single RAF tick.
  • Inactive-panel skip. _todosPanelIsActive() short-circuits the entire path when the Todos panel is hidden — background events do zero DOM work.
  • Linear-scaling regression guard. Instead of a fragile absolute timeout, the test compares 50-item vs. 200-item render time and asserts the ratio stays < 8. An O(n²) regression would push the ratio toward 16.

Persistence budget (INFLIGHT_STATE_DEFAULT_LIMITS)

maxSessions: 8
messages:    24
toolCalls:   48
stringChars: 60,000   per string field
jsonChars:   1,500,000 per full snapshot

LRU by updated_at when the cap is hit. Quota errors trigger graceful degradation: drop everything except the active session, and if that still fails, clear the whole bucket. INFLIGHT entries older than 10 minutes are evicted on read.

Network footprint

Each todo_state event is the same JSON the todo tool already returns in its tool message, plus five metadata fields (session_id, stream_id, source, ts, version). It's a single per-event mirror — no diffs, no polling.

4. Reliability & stability

Error isolation boundaries

Boundary Behavior
parse_todo_tool_result gets non-string / non-JSON / wrong shape Returns None; emit and attach skip.
emit_todo_state raises in put Swallowed, debug-logged, returns False. Tool delivery is unaffected.
attach_todo_state errors during message iteration (broken generators, corrupted store) Swallowed, returns False. Session GET responds normally without the todo_state field.
Frontend JSON.parse failure Listener returns; S.todos is unchanged.
localStorage corrupted _readInflightStateMap returns {}; the next save rebuilds.
localStorage quota exceeded Falls back to active-session-only; if still over, clears the bucket.
INFLIGHT entry stale (>10 min) loadInflightState evicts and returns null.
Stream reconnects with a different streamId loadInflightState returns null — a new run cannot inherit a stale snapshot from the previous one.

Ordering & merge invariants

  • Full-snapshot protocol. Every emission carries the entire list, so SSE replay is idempotent — no merge, no diff reconciliation.
  • Strict-older-ts drop, equal-ts allowed. A compression-source refresh can legitimately land on the same wall-clock second as the tool emission it follows.
  • Cross-session double gate. Both payload.session_id !== activeSid and S.session.session_id !== activeSid drop the event. The latter prevents pollution when the SSE stream is still wired to the previous session but the user has already navigated away.
  • Recency-aware hydrate. When both cold-load (server settled view) and INFLIGHT (locally persisted snapshot) are present, _hydrateTodosFromSession picks the one with the newer ts. This stops a stale cached cold-load from regressing fresher INFLIGHT state on reload of a still-running session.

Compatibility

  • S.todoStateMeta = null is a sentinel; pre-Phase-1 backends produce no todo_state events, so loadTodos() falls through to _legacyTodosFromMessages() and the panel still works.
  • version: 1 reserves room for future wire-format evolution.
  • The run-journal whitelist entry means a reconnect's Last-Event-ID advances past prior todo_state events. Replay remains correct (idempotent) but doesn't re-trigger renders.

Test coverage

Category File Purpose
Protocol unit test_todo_state.py (365) / test_todo_state_emission.py (407) / test_todo_state_robustness.py (283) parse / derive / emit / attach contracts; malformed items, non-dict summary, concurrent-emit thread safety, ts propagation
Real scenarios test_todo_state_scenarios.py (741) Multi-write history, last-write-truncated fallback, large-history performance sentinel
Wiring (AST) test_streaming_todo_state.py (158) / test_session_todo_state_route.py (130) AST call-shape checks: pin call sites' kwargs and positional shape, not source strings
Frontend wiring test_phase2_frontend_static.py (505) Balanced-brace function-body extraction + AST-style regex; tolerates whitespace and harmless reformatting
Frontend behavior test_phase2_todo_behavior.py (831) Node driver loads real JS function bodies and runs unit-level assertions against them
Frontend scenarios test_phase2_e2e_scenarios.py (1082) Lifecycle / multi-session / event-robustness / render-scheduling end-to-end through the real JS
Persistence round-trip test_phase2_inflight_persistence.py (449) Real in-memory localStorage shim drives saveInflightStateloadInflightState round-trips, hard-reload recovery, cross-session isolation, journal-replay regression

280 tests pass in ~4s.

5. Out of scope

  • No upstream Hermes Agent protocol change; todo tool semantics unchanged.
  • No SSE infrastructure change; one new event name plus one journal-whitelist entry.
  • No persistence-layer change (still localStorage, not IndexedDB); current budgets are far above what todo lists need.

Review follow-ups already absorbed in these two commits

  • P1: messages.js cross-session double check; ui.js ts-aware reconciliation in _hydrateTodosFromSession; api/todo_state.derive_todo_state propagates the source message timestamp so cold-load and INFLIGHT can be compared on the same axis.
  • P2: replaced grep-only structural tests with AST call-shape checks; added malformed-item / concurrent-emit / extra malformed-payload coverage; reduced brittleness in static-shape assertions; replaced flake-prone absolute timing with linear-scaling ratio; added the real-localStorage round-trip + reconnect test file.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Triage: HOLD pending deep review — labels: hold, maintainer-review

This is a large addition: a realtime Todos panel + INFLIGHT recovery surface, +5,504 LOC across 17 files including new modules (api/todo_state.py), api/routes.py cold-load hooks, and api/streaming.py integration.

Two reasons to slow down:

  1. New always-visible UI surface. A realtime Todos panel is a major UI addition. We want multi-viewport screenshots (390/1280/1440/1920), behavior with no todos, behavior mid-stream, behavior after reload, behavior with INFLIGHT recovery scenarios, and dark vs. light theme.

  2. New architectural module + streaming.py integration. api/streaming.py is in our sensitive-paths list and api/todo_state.py adds a parallel store concept. We need our deep-streaming reviewer to walk this carefully. INFLIGHT recovery is also one of the sharpest paths in the codebase — the test coverage looks present but we want to spend some time on the recovery contract before this lands.

Action items on your side when you have time:

  • Multi-viewport screenshots for the Todos panel (empty / streaming / hydrated / inflight-recovery)
  • Short README in docs/ (or in the PR body) explaining the todo_state contract and how it interacts with the existing run-journal / partial-output recovery
  • Confirm whether the agent side already has a todo store this could consume vs. being WebUI-only

Will follow up here after the deep review. Thanks for the work — it's substantial and we want to land it well rather than rush it.

v2psv added 2 commits May 28, 2026 23:14
Add a dedicated SSE `todo_state` event so the Todos panel can update
mid-turn instead of only on session reload. Cold-load attaches the
same snapshot to GET /api/session so opening any session immediately
shows the current task list.

Why
---
The browser's Todos panel parses the most recent role='tool' message
in S.messages to find the current todo list. During streaming, that
message only lands once the turn finishes (S.messages = full session
on done event), so users see the full final list flash in at the end
rather than watching tasks transition pending -> in_progress -> done.

The fix has two parts that must ship together:
  1. A live data channel that delivers todo snapshots while the agent
     is still mid-turn.
  2. A cold-load path so existing sessions opened from the sidebar
     populate the panel without waiting for a new tool call.

Design
------
* New api/todo_state.py module exposes:
    - parse_todo_tool_result(): used by the live emit path
    - derive_todo_state():      used by the cold-load path
    - emit_todo_state():        single helper that streaming.py calls
                                from both legacy and modern callbacks
    - attach_todo_state():      single helper that routes.py calls
                                from both WebUI and CLI cold-load paths
  All return the same {todos, summary, version} shape so the frontend
  has a single decoder. VERSION=1 is reserved for non-additive changes.
  EVENT_NAME / PAYLOAD_KEY constants centralize the wire-format names
  so frontend grep stays single-source.

* api/streaming.py emits 'todo_state' from both callback paths:
    - legacy tool_progress_callback (event_type=='tool.completed')
    - modern on_tool_complete (structured tool_complete_callback)
  Snapshots are full — idempotent re-application is safe under SSE
  replay through the existing run journal. Emissions are guarded by
  name=='todo' and wrapped in try/except so a malformed payload never
  breaks tool delivery.

* api/routes.py attaches todo_state to the session GET response by
  scanning _all_msgs (not the truncated tail) — mirrors the agent's
  AIAgent._hydrate_todo_store. Gated on load_messages so the sidebar
  listing endpoint stays cheap. Both the WebUI session path AND the
  CLI fallback path (_lookup_cli_session_metadata +
  get_cli_session_messages branch) invoke attach_todo_state, so a
  CLI-only session that used the todo tool hydrates the same way as
  a WebUI-native one.

* derive_todo_state() avoids a redundant list() shallow copy when the
  caller already passes a list/tuple. Multimodal tool results (content
  as a list of OpenAI/Anthropic content parts) are explicitly
  documented as intentional skips rather than silent fall-through.

* Detection symmetry with run_agent.AIAgent._hydrate_todo_store is
  documented at module level so a future tightening lands in both
  places at once.

Compatibility
-------------
Phase 1 is server-only. Old clients that don't subscribe to the new
event simply drop it; behaviour is unchanged. Phase 2 wires up the
frontend subscriber and adds reload recovery on top of this contract.

Wire format
-----------
* Live SSE event:
    {session_id, stream_id, source, ts, todos, summary, version}
* Cold-load session.todo_state:
    {todos, summary, version}
Both are pinned by tests so future refactors can't silently drift.

Tests
-----
* tests/test_todo_state.py             — 38 unit tests covering
  parse + derive helpers (returns-new-dict, unicode round-trip,
  tuple input fast path, malformed payload tolerance, large-history
  short-circuit, etc.)
* tests/test_todo_state_emission.py    — 35 unit tests for the new
  emit_todo_state / attach_todo_state helpers with a captured put()
  recorder. Covers happy path, every no-emit guard, addressing /
  metadata, put-callback exceptions.
* tests/test_todo_state_scenarios.py   — 33 end-to-end scenarios:
  live mid-turn updates, cold reopen, SSE replay safety, cross-
  session addressing, multimodal tool histories, CLI session
  fallback, emit→storage→attach round-trip equality, failure
  isolation under garbage payloads, 10k-message histories with soft
  perf canaries, frontend wire-format pin, full-turn lifecycle,
  determinism.
* tests/test_streaming_todo_state.py     — 2 import + call-site guards
* tests/test_session_todo_state_route.py — 2 import + call-site guards
  (the two grep-style tests intentionally stay minimal to avoid
  pinning source formatting; behavioural coverage lives in the
  scenario file above.)

Total: 110 tests, ~2s on the dev box.

Regression: tests/test_live_tool_callback_events.py,
tests/test_run_journal*.py — 28/28 pass.

Review follow-up:
* derive_todo_state propagates source message timestamp as `ts` so the
  frontend can reconcile cold-load vs. INFLIGHT by recency.
* tests/test_streaming_todo_state.py and test_session_todo_state_route.py
  rewritten as AST call-shape checks; the previous grep-only string
  presence could not catch parameter swaps or missing kwargs.
* tests/test_todo_state_robustness.py adds malformed-item pass-through,
  ts-propagation, and concurrent-emit thread-safety coverage.
Builds on the Phase 1 server contract (61c91b24) to make
the Todos panel update in real time off the dedicated `todo_state`
SSE event, instead of waiting for a settled tool message and a
reverse-scan over S.messages on every render.

Architecture
------------
Single source of truth: S.todos (live snapshot) + S.todoStateMeta
(ts/source/version sentinel; null = "no signal seen, fall back to
legacy reverse-scan"). Three settle channels feed it:

  1. `todo_state` SSE event (live): listener in messages.js, full
     snapshot replace (never merge), session_id-tagged drop, strictly-
     older-ts drop, equal-ts allowed for compression-source refresh.

  2. session GET payload .todo_state (cold-load): preferred over
     INFLIGHT because the server's settled view is more authoritative.

  3. INFLIGHT[sid].todos / .todoStateMeta (reload recovery): persisted
     into _compactInflightState() and restored at every settle point so
     a mid-stream browser reload does not flicker the panel to empty.

_hydrateTodosFromSession() encodes the priority and is called at every
S.session= settle point in messages.js (3) and sessions.js (5), incl.
delete-session paths that pass null to clear.

Render path is split into two cheap stages:

  • scheduleTodosRefresh() — RAF-coalesces bursty live updates into one
    paint per frame; skips entirely when the panel is not active.
  • loadTodos() — prefers S.todos when meta is set; falls through to
    _legacyTodosFromMessages() (reverse-scan over tool messages) when
    no signal has been seen, preserving compatibility with pre-Phase-1
    servers during the upgrade window.

A content-keyed hash (_todosHash) plus _todosLastRenderedHash short-
circuits identical re-renders, including the empty-state case.

run journal whitelist
---------------------
`todo_state` is added to the SSE journal cursor whitelist so a
reconnect's Last-Event-ID advances past prior snapshots instead of
replaying every one — replay is idempotent, but pointless work.

Tests
-----
Three new files, 121 cases, all green:

  • tests/test_phase2_frontend_static.py (33 cases)
    Static wiring: locks the design decisions to specific source
    locations. Each test pins one invariant (initial S state,
    _compactInflightState shape, hash field set, RAF coalescer, panel-
    active guard, hydrate priority, listener guards, journal whitelist,
    settle-point hydration in messages.js + sessions.js, INFLIGHT
    restore schema, renderer SSOT + legacy fallback + esc()).

  • tests/test_phase2_todo_behavior.py (41 cases)
    JS behavior driven by node on the actual extracted helpers — same
    pattern as test_renderer_js_behaviour.py. Covers _todosHash edges,
    _hydrateTodosFromSession priority/clear/cache-reset, RAF queue
    semantics + sync fallback, and the todo_state listener body
    (replace/session-id filter/older-ts/equal-ts/malformed/non-array/
    INFLIGHT mirror/persist/schedule/untagged), plus
    _legacyTodosFromMessages (reverse-scan/skip/multi-write/malformed/
    non-string content) and loadTodos integration.

  • tests/test_phase2_e2e_scenarios.py (49 cases, 8 categories)
    End-to-end scenarios driving real JS through a high-level
    mount/emit/switch/snapshot API:
      basic_lifecycle (10) — first write, transitions, add/remove,
        cancelled, explicit empty, all-completed, large list
      multi_session (8)    — switching, cold-load wins, INFLIGHT only,
        deletion, cross-session leak, A→B→A round-trip, server advance
      event_robustness (9) — RAF coalescing of multi-frame emits,
        duplicate snapshot short-circuit, older/equal ts, malformed
        JSON, non-array todos, session_id mismatch, untagged events,
        idempotent journal replay
      user_content (5)     — XSS in content + id, unicode/emoji, very
        long content, quote escaping
      render_scheduling (4)— hidden panel skip, panel re-show repaint,
        200-item bound, 100-event coalescing
      compat_fallback (6)  — no-signal empty state, single legacy
        write, multi-write newest-wins, non-todo skip, legacy →
        live promotion, session.messages preference
      realistic_workflows (3) — plan-then-execute four-step flow,
        plan revision (cancel one + add new), 20-tool burst
      persistence_recovery (3)— persistInflightState fires on emit,
        INFLIGHT mirror, reload-then-reattach restores from INFLIGHT

Total Phase 1 + Phase 2 todo coverage: 230 cases, 100% green.

Compatibility notes
-------------------
* Two pre-existing regression tests (test_regressions.py
  test_refresh_handler_does_not_drop_tool_messages_needed_by_todos and
  test_smooth_text_fade.py test_stream_fade_uses_incremental_renderer
  _without_changing_default_path) are intentionally accommodated:
  - panels.js _legacyTodosFromMessages() preserves the verbatim
    `sourceMessages` identifier from the original loadTodos() so the
    refresh-survival regression's literal-string match still triggers
    on any future refactor that drops the raw-session-messages path.
  - messages.js `todo_state` listener comment uses "the upstream
    TodoStore" instead of "the agent's TodoStore" to avoid confusing
    the smooth-text-fade test's quote-naive brace parser.
  Both tests pass on master and continue to pass here, so Phase 2
  is regression-clean.

* Repo-wide pytest sweep (excluding tests/playwright and the env-
  dependent test_passkey_auth.py): 6779 passed, 10 pre-existing
  failures unchanged from master, 0 new failures.

Review follow-up:
* messages.js: todo_state handler adds an S.session vs. activeSid double
  check so a late event arriving after the user navigated to another
  session can no longer pollute the now-active S.todos.
* ui.js: _hydrateTodosFromSession now reconciles cold-load vs. INFLIGHT
  by ts so a stale cold-load (e.g. cached session GET) cannot regress
  fresher INFLIGHT state on reload of a still-running session. Backend
  api/todo_state.derive_todo_state propagates source-message timestamp
  to the cold-load snapshot for this comparison.
* tests/test_phase2_frontend_static.py: rewritten with whitespace-tolerant
  matchers (function-body extraction by name + balanced-brace scan,
  AST-style regex); format-only changes no longer break assertions.
* tests/test_phase2_e2e_scenarios.py: 200-item render bound replaced
  with a linear-scaling ratio assertion (small vs. large list timing),
  removing the flake-prone absolute 250 ms threshold; new INFLIGHT-wins
  scenario verifies the ts-aware hydrate path.
* tests/test_phase2_todo_behavior.py: setActive() helper keeps S.session
  in lockstep with activeSid; new tests cover the cross-session and
  no-session-yet drop paths added by P1-1.
* tests/test_phase2_inflight_persistence.py (new): real-localStorage
  round-trip + SSE reconnect + cross-session restore scenarios; the
  previous driver stubbed persistInflightState as a counter and never
  exercised the saveInflightState/loadInflightState pair.
@v2psv v2psv force-pushed the feat/todo-realtime branch from 0a772cd to fa128c7 Compare May 28, 2026 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hold maintainer-review Maintainer fit-assessment needed — may not merge even with fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants