fix(chat): keep transcript following the live tail (scrollHeight-shrink clamp) — launch-to-msg flake [+ investigation diagnostics] by weishi-imbue · Pull Request #210 · imbue-ai/forever-claude-template

weishi-imbue · 2026-06-25T23:39:40Z

Summary

Fixes the intermittent launch-to-msg mac-runner flake (~12–25%): the agent's Slack reply (CI MOCK: greetings from the localhost slack mock.) never appeared in the live chat, only after a reload.

Root cause (frontend — proven by instrumentation, server stack fully exonerated):
The agent reads Slack and reports the message correctly every 30s; the server broadcasts and SSE-forwards each report in real time; the client receives and appends them. But ChatPanel virtualizes the transcript and stops following the live tail when its scroll-follow logic latches userScrolledUp = true. Trigger: while pinned to the bottom, async row measurement settles a row shorter than its estimate → scrollHeight shrinks → the browser clamps scrollTop down → that emits a scroll event the naive scrollTop < previousScrollTop check misread as the user scrolling up. Once latched, following never resumes (the viewport never returns to the bottom because new content keeps pushing it down), so new messages render outside the virtualized window and never enter the DOM until a reload re-anchors at the tail. (This is why canned_body_after_reload was always true, and why it never reproduced manually — a real session keeps following the tail.)

✅ The actual fix — review these

models/scrollFollow.ts — new pure isUserScrollUp({scrollTop, previousScrollTop, scrollHeight, previousScrollHeight}): a downward scrollTop move that coincides with a scrollHeight shrink is a browser clamp, not user intent.
views/ChatPanel.ts — track previousScrollHeight in lockstep with previousScrollTop at every programmatic scroll site; use isUserScrollUp in handleScrollEvent.
models/scrollFollow.test.ts — unit tests for the clamp case (+ shrink/grow/down cases).

🟡 Other real fixes found during the hunt (kept; NOT the flake cause)

Legitimate SSE-correctness fixes surfaced while chasing this; the flake persisted after each, so none was the cause. Worth keeping but please sanity-review:

server.py _get_events — make the tail snapshot self-consistent (offset/len/total from one read): fixes a tail-anchor snapshot race.
session_watcher.py is_main_session_event — default an unknown session to MAIN instead of dropping it: fixes a main-session-rotation live-filter drop.
Response.ts / event_queues.py — appendEvents returns a bool; queue replay-on-register.
StreamingMessage.ts — SSE staleness watchdog (force-reconnect a half-open/zombie stream after 25s) + visibility/focus reconnect. The visibility-reconnect did not fix the flake; the staleness watchdog is a real robustness improvement. Decide whether to keep.

🔴 Diagnostics — TO BE STRIPPED before merge

Pure instrumentation added to capture the failure mechanism; not for production:

hang_watchdog.py + main.py wiring (faulthandler all-thread dump on :8000 unreachable)
server.py: [diag-gen] per-event logging, GET /diag/threads, GET /diag/sessions/<agent_id>
session_watcher.py: [diag-poll] heartbeat/crash logging
event_queues.py: [diag-sse] broadcast/register logging (+ event_id/session in the broadcast log)
StreamingMessage.ts: __sseDiag/__sseDiagMeta counters
ChatPanel.ts: __chatDiag/__chatScrollDiag (incl. a clampsSuppressed counter used to verify the fix)

Verification

In progress: a 12-run launch-to-msg batch against current main + this fix, with __chatScrollDiag giving mechanistic confirmation — the clamp actually occurs (clampsSuppressed > 0) and userScrolledUp never falsely latches (latchedUp = 0) — not just a pass count (the flake is too low-frequency for pass-count alone). Paired diagnostics live in mngr branch wz/diagnose-si-hang-probe.

🤖 Generated with Claude Code

… wedge Temporary instrumentation to root-cause the intermittent launch-to-msg slack hang. Self-probes the server port; on repeated failure dumps all thread stacks to /tmp/system_interface_hang_dump.txt + stderr so we see the exact stuck stack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Captures whether the conversation SSE stream keeps receiving events (broadcast) and whether the chat client disconnects without reconnecting (unregister with no later register) -- to localize the chat-UI freeze in the output path.

Root cause of the intermittent launch-to-msg chat freeze: _get_events reads get_tail_events(), get_total_event_count() and the offset non-atomically. During active streaming a new event can land between the tail read and the total read, so total > offset+len. The client's TranscriptStore.append then treats the window as 'not at the tail' (hasMoreAfter) and SILENTLY DROPS every subsequent live event -- the chat freezes until a reload re-fetches a consistent snapshot. A tail load defines the live end as of that read; newer events arrive over the SSE stream (buffered during the fetch). So for a tail load, total is now offset+len(events), keeping the client tail-anchored. Paging loads still report the real total. Includes a [diag-tail] log when the race is detected, to confirm the mechanism on CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…E diag Combines the two robustness fixes for the chat-freeze flake: - server _get_events keeps the tail snapshot tail-anchored (the snapshot race). - StreamingMessage staleness watchdog reconnects a zombie SSE (no frame for 25s), with the server keepalive as a visible data event so the watchdog sees liveness. Adds window.__sseDiag per-agent counters (received/buffered/rendered/dropped/ reconnects/errors/inFlightActive/lastId) so the e2e can read, on a freeze, exactly what the client did with live events -- ground truth for any remaining failure.

…ent __sseDiag Single branch carrying both proven fixes, the staleness watchdog, and the window.__sseDiag client instrumentation, to capture the residual chat-freeze with complete ground truth (client received/rendered/dropped + server [diag-*] logs).

… recovery) The staleness watchdog is a setInterval, which Chromium throttles/suspends while the chat WebContents is hidden/occluded -- so a stream that dies while the window is backgrounded (e.g. while the requests panel is driven) is never recovered by the timer. Add a visibilitychange/focus handler that force-reconnects any quiet stream the moment the window is foregrounded again. Adds __sseDiagMeta.ticks to confirm whether the timer was running.

…n for SSE delivery correlation

…]) to pin where live broadcast stops

…ocked watcher poll thread)

…ubagents) for orchestration analysis

…atScrollDiag) to confirm tail rows virtualized out

A row measuring shorter than its estimate shrinks scrollHeight while the transcript is pinned to the bottom; the browser clamps scrollTop down and emits a scroll event that the naive scrollTop<previousScrollTop check read as the user scrolling up. That latched userScrolledUp, so the panel stopped following the live tail and new messages rendered outside the virtualized window (visible only after a reload). Track previousScrollHeight in lockstep and ignore a downward move that coincides with a shrink (new isUserScrollUp).

weishi-imbue and others added 13 commits June 24, 2026 03:04

diag: log per-event generator got/forward + broadcast event_id/sessio…

50ec365

…n for SSE delivery correlation

diag: log watcher poll-thread heartbeat + crash traceback ([diag-poll…

6b667e5

…]) to pin where live broadcast stops

diag: add /diag/threads endpoint to dump all thread stacks (locate bl…

343a05c

…ocked watcher poll thread)

diag: add /diag/sessions/<agent_id> to dump raw transcripts (main + s…

253f025

…ubagents) for orchestration analysis

diag: expose chat virtualization/scroll-follow state (__chatDiag/__ch…

4594b16

…atScrollDiag) to confirm tail rows virtualized out

Merge remote-tracking branch 'origin/main' into wz/fix-sse-all

71de209

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(chat): keep transcript following the live tail (scrollHeight-shrink clamp) — launch-to-msg flake [+ investigation diagnostics]#210

fix(chat): keep transcript following the live tail (scrollHeight-shrink clamp) — launch-to-msg flake [+ investigation diagnostics]#210
weishi-imbue wants to merge 13 commits into
mainfrom
wz/fix-sse-all

weishi-imbue commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

weishi-imbue commented Jun 25, 2026

Summary

✅ The actual fix — review these

🟡 Other real fixes found during the hunt (kept; NOT the flake cause)

🔴 Diagnostics — TO BE STRIPPED before merge

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant