fix(chat): keep transcript following the live tail (scrollHeight-shrink clamp) — launch-to-msg flake [+ investigation diagnostics]#210
Draft
weishi-imbue wants to merge 13 commits into
Draft
Conversation
… wedge Temporary instrumentation to root-cause the intermittent launch-to-msg slack hang. Self-probes the server port; on repeated failure dumps all thread stacks to /tmp/system_interface_hang_dump.txt + stderr so we see the exact stuck stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Captures whether the conversation SSE stream keeps receiving events (broadcast) and whether the chat client disconnects without reconnecting (unregister with no later register) -- to localize the chat-UI freeze in the output path.
Root cause of the intermittent launch-to-msg chat freeze: _get_events reads get_tail_events(), get_total_event_count() and the offset non-atomically. During active streaming a new event can land between the tail read and the total read, so total > offset+len. The client's TranscriptStore.append then treats the window as 'not at the tail' (hasMoreAfter) and SILENTLY DROPS every subsequent live event -- the chat freezes until a reload re-fetches a consistent snapshot. A tail load defines the live end as of that read; newer events arrive over the SSE stream (buffered during the fetch). So for a tail load, total is now offset+len(events), keeping the client tail-anchored. Paging loads still report the real total. Includes a [diag-tail] log when the race is detected, to confirm the mechanism on CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…E diag Combines the two robustness fixes for the chat-freeze flake: - server _get_events keeps the tail snapshot tail-anchored (the snapshot race). - StreamingMessage staleness watchdog reconnects a zombie SSE (no frame for 25s), with the server keepalive as a visible data event so the watchdog sees liveness. Adds window.__sseDiag per-agent counters (received/buffered/rendered/dropped/ reconnects/errors/inFlightActive/lastId) so the e2e can read, on a freeze, exactly what the client did with live events -- ground truth for any remaining failure.
…ent __sseDiag Single branch carrying both proven fixes, the staleness watchdog, and the window.__sseDiag client instrumentation, to capture the residual chat-freeze with complete ground truth (client received/rendered/dropped + server [diag-*] logs).
… recovery) The staleness watchdog is a setInterval, which Chromium throttles/suspends while the chat WebContents is hidden/occluded -- so a stream that dies while the window is backgrounded (e.g. while the requests panel is driven) is never recovered by the timer. Add a visibilitychange/focus handler that force-reconnects any quiet stream the moment the window is foregrounded again. Adds __sseDiagMeta.ticks to confirm whether the timer was running.
…n for SSE delivery correlation
…]) to pin where live broadcast stops
…ocked watcher poll thread)
…ubagents) for orchestration analysis
…atScrollDiag) to confirm tail rows virtualized out
A row measuring shorter than its estimate shrinks scrollHeight while the transcript is pinned to the bottom; the browser clamps scrollTop down and emits a scroll event that the naive scrollTop<previousScrollTop check read as the user scrolling up. That latched userScrolledUp, so the panel stopped following the live tail and new messages rendered outside the virtualized window (visible only after a reload). Track previousScrollHeight in lockstep and ignore a downward move that coincides with a shrink (new isUserScrollUp).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the intermittent
launch-to-msgmac-runner flake (~12–25%): the agent's Slack reply (CI MOCK: greetings from the localhost slack mock.) never appeared in the live chat, only after a reload.Root cause (frontend — proven by instrumentation, server stack fully exonerated):
The agent reads Slack and reports the message correctly every 30s; the server broadcasts and SSE-forwards each report in real time; the client receives and appends them. But
ChatPanelvirtualizes the transcript and stops following the live tail when its scroll-follow logic latchesuserScrolledUp = true. Trigger: while pinned to the bottom, async row measurement settles a row shorter than its estimate →scrollHeightshrinks → the browser clampsscrollTopdown → that emits a scroll event the naivescrollTop < previousScrollTopcheck misread as the user scrolling up. Once latched, following never resumes (the viewport never returns to the bottom because new content keeps pushing it down), so new messages render outside the virtualized window and never enter the DOM until a reload re-anchors at the tail. (This is whycanned_body_after_reloadwas always true, and why it never reproduced manually — a real session keeps following the tail.)✅ The actual fix — review these
models/scrollFollow.ts— new pureisUserScrollUp({scrollTop, previousScrollTop, scrollHeight, previousScrollHeight}): a downwardscrollTopmove that coincides with ascrollHeightshrink is a browser clamp, not user intent.views/ChatPanel.ts— trackpreviousScrollHeightin lockstep withpreviousScrollTopat every programmatic scroll site; useisUserScrollUpinhandleScrollEvent.models/scrollFollow.test.ts— unit tests for the clamp case (+ shrink/grow/down cases).🟡 Other real fixes found during the hunt (kept; NOT the flake cause)
Legitimate SSE-correctness fixes surfaced while chasing this; the flake persisted after each, so none was the cause. Worth keeping but please sanity-review:
server.py_get_events— make the tail snapshot self-consistent (offset/len/total from one read): fixes a tail-anchor snapshot race.session_watcher.pyis_main_session_event— default an unknown session to MAIN instead of dropping it: fixes a main-session-rotation live-filter drop.Response.ts/event_queues.py—appendEventsreturns a bool; queue replay-on-register.StreamingMessage.ts— SSE staleness watchdog (force-reconnect a half-open/zombie stream after 25s) + visibility/focus reconnect. The visibility-reconnect did not fix the flake; the staleness watchdog is a real robustness improvement. Decide whether to keep.🔴 Diagnostics — TO BE STRIPPED before merge
Pure instrumentation added to capture the failure mechanism; not for production:
hang_watchdog.py+main.pywiring (faulthandler all-thread dump on:8000unreachable)server.py:[diag-gen]per-event logging,GET /diag/threads,GET /diag/sessions/<agent_id>session_watcher.py:[diag-poll]heartbeat/crash loggingevent_queues.py:[diag-sse]broadcast/register logging (+ event_id/session in the broadcast log)StreamingMessage.ts:__sseDiag/__sseDiagMetacountersChatPanel.ts:__chatDiag/__chatScrollDiag(incl. aclampsSuppressedcounter used to verify the fix)Verification
In progress: a 12-run launch-to-msg batch against current
main+ this fix, with__chatScrollDiaggiving mechanistic confirmation — the clamp actually occurs (clampsSuppressed > 0) anduserScrolledUpnever falsely latches (latchedUp = 0) — not just a pass count (the flake is too low-frequency for pass-count alone). Paired diagnostics live in mngr branchwz/diagnose-si-hang-probe.🤖 Generated with Claude Code