Skip to content

fix(chat): keep transcript following the live tail (scrollHeight-shrink clamp) — launch-to-msg flake [+ investigation diagnostics]#210

Draft
weishi-imbue wants to merge 13 commits into
mainfrom
wz/fix-sse-all
Draft

fix(chat): keep transcript following the live tail (scrollHeight-shrink clamp) — launch-to-msg flake [+ investigation diagnostics]#210
weishi-imbue wants to merge 13 commits into
mainfrom
wz/fix-sse-all

Conversation

@weishi-imbue

Copy link
Copy Markdown
Contributor

Summary

Fixes the intermittent launch-to-msg mac-runner flake (~12–25%): the agent's Slack reply (CI MOCK: greetings from the localhost slack mock.) never appeared in the live chat, only after a reload.

Root cause (frontend — proven by instrumentation, server stack fully exonerated):
The agent reads Slack and reports the message correctly every 30s; the server broadcasts and SSE-forwards each report in real time; the client receives and appends them. But ChatPanel virtualizes the transcript and stops following the live tail when its scroll-follow logic latches userScrolledUp = true. Trigger: while pinned to the bottom, async row measurement settles a row shorter than its estimate → scrollHeight shrinks → the browser clamps scrollTop down → that emits a scroll event the naive scrollTop < previousScrollTop check misread as the user scrolling up. Once latched, following never resumes (the viewport never returns to the bottom because new content keeps pushing it down), so new messages render outside the virtualized window and never enter the DOM until a reload re-anchors at the tail. (This is why canned_body_after_reload was always true, and why it never reproduced manually — a real session keeps following the tail.)

✅ The actual fix — review these

  • models/scrollFollow.ts — new pure isUserScrollUp({scrollTop, previousScrollTop, scrollHeight, previousScrollHeight}): a downward scrollTop move that coincides with a scrollHeight shrink is a browser clamp, not user intent.
  • views/ChatPanel.ts — track previousScrollHeight in lockstep with previousScrollTop at every programmatic scroll site; use isUserScrollUp in handleScrollEvent.
  • models/scrollFollow.test.ts — unit tests for the clamp case (+ shrink/grow/down cases).

🟡 Other real fixes found during the hunt (kept; NOT the flake cause)

Legitimate SSE-correctness fixes surfaced while chasing this; the flake persisted after each, so none was the cause. Worth keeping but please sanity-review:

  • server.py _get_events — make the tail snapshot self-consistent (offset/len/total from one read): fixes a tail-anchor snapshot race.
  • session_watcher.py is_main_session_event — default an unknown session to MAIN instead of dropping it: fixes a main-session-rotation live-filter drop.
  • Response.ts / event_queues.pyappendEvents returns a bool; queue replay-on-register.
  • StreamingMessage.ts — SSE staleness watchdog (force-reconnect a half-open/zombie stream after 25s) + visibility/focus reconnect. The visibility-reconnect did not fix the flake; the staleness watchdog is a real robustness improvement. Decide whether to keep.

🔴 Diagnostics — TO BE STRIPPED before merge

Pure instrumentation added to capture the failure mechanism; not for production:

  • hang_watchdog.py + main.py wiring (faulthandler all-thread dump on :8000 unreachable)
  • server.py: [diag-gen] per-event logging, GET /diag/threads, GET /diag/sessions/<agent_id>
  • session_watcher.py: [diag-poll] heartbeat/crash logging
  • event_queues.py: [diag-sse] broadcast/register logging (+ event_id/session in the broadcast log)
  • StreamingMessage.ts: __sseDiag/__sseDiagMeta counters
  • ChatPanel.ts: __chatDiag/__chatScrollDiag (incl. a clampsSuppressed counter used to verify the fix)

Verification

In progress: a 12-run launch-to-msg batch against current main + this fix, with __chatScrollDiag giving mechanistic confirmation — the clamp actually occurs (clampsSuppressed > 0) and userScrolledUp never falsely latches (latchedUp = 0) — not just a pass count (the flake is too low-frequency for pass-count alone). Paired diagnostics live in mngr branch wz/diagnose-si-hang-probe.

🤖 Generated with Claude Code

weishi-imbue and others added 13 commits June 24, 2026 03:04
… wedge

Temporary instrumentation to root-cause the intermittent launch-to-msg slack
hang. Self-probes the server port; on repeated failure dumps all thread stacks
to /tmp/system_interface_hang_dump.txt + stderr so we see the exact stuck stack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Captures whether the conversation SSE stream keeps receiving events (broadcast)
and whether the chat client disconnects without reconnecting (unregister with no
later register) -- to localize the chat-UI freeze in the output path.
Root cause of the intermittent launch-to-msg chat freeze: _get_events reads
get_tail_events(), get_total_event_count() and the offset non-atomically. During
active streaming a new event can land between the tail read and the total read,
so total > offset+len. The client's TranscriptStore.append then treats the window
as 'not at the tail' (hasMoreAfter) and SILENTLY DROPS every subsequent live
event -- the chat freezes until a reload re-fetches a consistent snapshot.

A tail load defines the live end as of that read; newer events arrive over the
SSE stream (buffered during the fetch). So for a tail load, total is now
offset+len(events), keeping the client tail-anchored. Paging loads still report
the real total. Includes a [diag-tail] log when the race is detected, to confirm
the mechanism on CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…E diag

Combines the two robustness fixes for the chat-freeze flake:
- server _get_events keeps the tail snapshot tail-anchored (the snapshot race).
- StreamingMessage staleness watchdog reconnects a zombie SSE (no frame for 25s),
  with the server keepalive as a visible data event so the watchdog sees liveness.

Adds window.__sseDiag per-agent counters (received/buffered/rendered/dropped/
reconnects/errors/inFlightActive/lastId) so the e2e can read, on a freeze, exactly
what the client did with live events -- ground truth for any remaining failure.
…ent __sseDiag

Single branch carrying both proven fixes, the staleness watchdog, and the
window.__sseDiag client instrumentation, to capture the residual chat-freeze with
complete ground truth (client received/rendered/dropped + server [diag-*] logs).
… recovery)

The staleness watchdog is a setInterval, which Chromium throttles/suspends while
the chat WebContents is hidden/occluded -- so a stream that dies while the window
is backgrounded (e.g. while the requests panel is driven) is never recovered by
the timer. Add a visibilitychange/focus handler that force-reconnects any quiet
stream the moment the window is foregrounded again. Adds __sseDiagMeta.ticks to
confirm whether the timer was running.
…atScrollDiag) to confirm tail rows virtualized out
A row measuring shorter than its estimate shrinks scrollHeight while the
transcript is pinned to the bottom; the browser clamps scrollTop down and
emits a scroll event that the naive scrollTop<previousScrollTop check read as
the user scrolling up. That latched userScrolledUp, so the panel stopped
following the live tail and new messages rendered outside the virtualized
window (visible only after a reload). Track previousScrollHeight in lockstep
and ignore a downward move that coincides with a shrink (new isUserScrollUp).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant