Skip to content

test(ui,app,agent): onboarding/voice/background/interaction QA follow-ups + de-larp (#11083 follow-up)#11103

Merged
lalalune merged 10 commits into
developfrom
shaw/fervent-knuth-55d14b
Jul 2, 2026
Merged

test(ui,app,agent): onboarding/voice/background/interaction QA follow-ups + de-larp (#11083 follow-up)#11103
lalalune merged 10 commits into
developfrom
shaw/fervent-knuth-55d14b

Conversation

@lalalune

@lalalune lalalune commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

Follow-up hardening for the onboarding/chat/voice/gesture/background epic
(PR #11083, merged as a9be4f48c70a). A 6-agent audit of that epic surfaced
a set of residual gaps + test-honesty issues; this PR closes the
locally-achievable, verifiable ones. Branched fresh off develop (9 commits, 0
behind), so it layers cleanly on top of the merged epic and develop's
subsequent shader refinement (#11088/#11102).

What's in it

Voice / #10700 — converse-mode fuzz dimension. The shell send/voice/new-chat
fuzz declared converse (VAD/semantic end-of-turn commit→send) a deferred
dimension. Added it: the fuzz now drives converse capture through the real
TurnAggregator (a complete final commits synchronously → VOICE_DM send) and
asserts lastTurnVoice clears on every new-chat; a dedicated test proves a
complete converse final sends a VOICE_DM (not a plain DM) + a negative (pure
disfluency commits but the respond-gate drops it). Also corrected the header's
mock disclosure (sendChatText is the separately-pinned send-queue leaf).

Background / #10694 residuals.

  • Redo now persists across reload (deliverable was "undo + redo, bounded,
    persisted" — redo was in-memory only).
  • Killed the e2e store-mirror larp: extracted one pure, browser-safe reducer
    (state/background-history.ts, applyBackgroundSet/Undo/Redo) used by BOTH
    useDisplayPreferences and the e2e fixture — the fixture no longer hand-mirrors
    the history semantics, so drift is impossible. Added a direct reducer unit test.

Voice-test honesty / #10726. Retired the tautological WER assertion in the
Chromium voice lanes (the mock ASR echoes the expected phrase → WER structurally
0, can never regress). The load-bearing "a real WAV reached ASR" assert stays;
WER accuracy is scored only in the real-recognizer tiers.

Chat UI regression gates.

Interaction de-larp / #10722.

  • Real drag-to-reorder launcher e2e: a genuine Framer Reorder.Item pointer
    drag that fires reorder telemetry, changes the tile order, persists to
    LAUNCHER_STORAGE_KEY, and drops/duplicates no ids (verified live: 0→23
    reorder events, 25 unique ids). The mock-based Launcher.gestures.test and
    use-pull-gesture.test are relabelled as explicitly logic-only (they no
    longer overstate gesture-pipeline coverage).
  • View-capability gate de-larped: it was vacuously true (passed on any
    VIEW_ACTION_MAP entry — every view has one) and blind to the 8 spatial views
    (which instrument via agent=data-agent-id, not useAgentElement, so they
    passed as 0-control for free — documents actually has 8 registrations counted
    as 0). Replaced with a proportional density gate over both DOM + spatial
    dialects (>= ceil(controls / 4) registrations), calibrated with headroom over
    the densest real view, plus a teeth/positive-control test that FAILS an
    8-control/1-registration source. Render-based coverage stays with the running-
    shell crawler (@elizaos/agent must not import 14 leaf view packages).
  • WebKit lane: opt-in (PLAYWRIGHT_WEBKIT=1) Safari-engine project in the
    ui-smoke config, scoped to the permission-free pointer/focus/text-input specs.

CI. app-real-e2e.yml now exports ELIZA_CHROME_PATH (resolved from the
chromium it already installs) so the nightly live-streaming lane stops
self-skipping forever.

Evidence (verified against this develop base)

  • Unit: packages/ui touched suites 38/38 (background-history reducer,
    redo-persist round-trip, converse fuzz) + packages/agent view-capability
    20/20. ui + app typecheck clean (one pre-existing develop-wide
    plugin-local-inference dist-staleness error, untouched here).
  • E2E (real browser, green on this base): launcher drag-reorder
    (run-launcher-e2e — real Framer drag → telemetry + persistence + no dup ids),
    home-screen pull-down (run-home-screen-e2e), chat-sheet no-fill
    (run-chat-sheet-e2e). Committed screenshots + walkthrough webms.

Honest deferrals (N/A with reason — not larped)

🤖 Generated with Claude Code

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your trial has ended. Reactivate Greptile to resume code reviews.

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e9bad36b-4e0c-4525-8812-a49c41d9907d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch shaw/fervent-knuth-55d14b

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Shaw and others added 10 commits July 1, 2026 22:47
The shell fuzz's own header declared converse (VAD/semantic end-of-turn
commit→send) a deferred follow-up dimension — a residual against the no-residuals
standard. Added it:
- the interleaved fuzz now drives converse capture (a complete final routes
  through the REAL TurnAggregator → synchronous commit → VOICE_DM send) alongside
  dictation, and asserts lastTurnVoice is cleared after every new-chat (invariant
  (d));
- a dedicated test proves a complete converse final sends a VOICE_DM (not a plain
  DM), sets lastTurnVoice, and a new-chat mid-converse clears the flag without
  orphaning the capture; plus a negative — pure disfluency commits but the
  respond-gate drops it, so nothing sends.
lastTurnVoice is internal (not on the public controller return), so it's observed
through its real consumer boundary (the useShellVoiceOutput arg), not by exposing
new public state. Also corrected the header's mock disclosure: sendChatText is
stubbed here and the send-QUEUE race is pinned separately in
useChatSend.send-voice-newchat.race — this suite proves the controller lifecycle,
not that leaf. 9/9 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The #10694 deliverable is "undo + redo, bounded, persisted", but the redo stack
was in-memory only. Persist it symmetrically with the undo history (same
bound + data-URL quota cap) via loadBackgroundRedo/saveBackgroundRedo, so "step
forward" survives a reload just like "step back" does. New test: edit→edit→undo,
remount (reload), redo restores the undone config.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nes (#10726)

voice-realaudio.spec asserted asr.detail.wer <= 0.34 against a Chromium page.route
ASR mock that echoes the expected phrase verbatim — WER is structurally 0, so the
assertion could never catch a regression (a real-accuracy claim made against a
mock standing in for the thing under test). Removed it; the load-bearing proof in
this lane stays (a real captured WAV reached ASR + the stage passed). Documented
in voice-selftest that its transcript-content check proves pipeline PROPAGATION,
not accuracy. WER accuracy is scored only in the real-recognizer tiers
(plugin-local-inference *.real.test.ts + voice:matrix hardware lanes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e2e mirror-larp, #10694)

The background e2e fixture hand-mirrored useDisplayPreferences' set/undo/redo
push-pop semantics, so mirror-vs-real drift was invisible (audit larp finding).
Extracted the semantics into a pure, persistence-free module
(state/background-history.ts: applyBackgroundSet/Undo/Redo + MAX) that BOTH the
real store (useDisplayPreferences) and the browser e2e fixture now call — one
implementation, no drift, and it stays browser-safe for esbuild (no persistence
import graph). Added a direct reducer unit test (set/undo/redo, no-op identity,
redo-cleared-by-edit, empty-stack no-ops, bound). MAX_BACKGROUND_HISTORY now
lives in the reducer module (re-exported from persistence for existing sites).
ui typecheck clean; background history/persistence 29/29; background integration
e2e green (regenerated screenshots + walkthrough.webm).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + fix a rebase-orphaned swipeRight

- #10706: added a REAL CDP-touch pull-DOWN on home-notification-pull-zone that
  opens the NotificationCenter sheet (asserts closed→open→closed), and re-settles
  home before the rail swipe. Previously only jsdom synthetic pointer events
  covered the pull-down.
- Fixed a rebase artifact: the inner-pager mouse-drag test (develop #11065) called
  a `swipeRight(locator)` helper that no longer had a definition after the rebase
  onto develop — only `swipeLeft` survived. Added the mirrored `swipeRight` so the
  runner (and its CI lane) stops crashing with ReferenceError. Full home-screen
  e2e green (7 screenshots).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The backdrop-blur gate bans blur but not a background fill, so a re-added
bg-black*/bg-white/10 on the floating transcript bubbles would slip past it
(audit gap). Added a computed-style assertion in the chat-sheet e2e: with the
populated thread MAXIMIZED, every message bubble's computed backgroundColor must
be transparent (implementation-agnostic — catches a fill re-added by any class,
not just a known class name). 12 bubbles asserted; full chat-sheet e2e green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…the mock gesture tests (#10722)

The audit flagged Launcher.gestures.test.tsx and use-pull-gesture.test.ts as
gesture-pipeline larp — they mock motion/react and fabricate PointerEvents, so
they cannot catch drag/reorder/pointer-capture breakage yet presented as gesture
coverage.

- Real coverage: extended run-launcher-e2e.mjs with a GENUINE pointer drag on a
  Framer Reorder.Item (in edit mode) and assert it fires `reorder` telemetry,
  actually changes the tile order, PERSISTS the new order to
  LAUNCHER_STORAGE_KEY, and drops/duplicates no ids. Verified live: real drag
  0→23 reorder events, order views→activity, 25 unique persisted ids.
- Honest labels: Launcher.gestures.test is now explicitly the onReorder/onDragEnd
  BRIDGE-LOGIC suite (what the Launcher does with a gesture result), and
  use-pull-gesture.test is explicitly LOGIC-ONLY (pure resolvePull/resolveSwipe +
  the rAF-coalescing #9141 contract) — both point at the real CDP-touch runners
  for the actual pointer pipeline. No more overstating.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…10722), live-e2e chrome path

- #10713: the per-message COPY test asserted only the "Copied" affordance; now it
  reads navigator.clipboard.readText() back and asserts it equals the assistant
  text (the context already grants clipboard-read) — proving bytes reached the
  clipboard, not just that a label flipped.
- #10722 WebKit: added an opt-in (PLAYWRIGHT_WEBKIT=1) WebKit/Safari-engine lane
  to the ui-smoke config, scoped to the keyless, permission-free pointer/focus/
  text-input specs (chat-overlay-controls-interactions, conversation-management,
  slash-commands) so iOS/Safari pointer regressions are catchable; gated so a
  machine without the WebKit browser download never reds the default lane.
- CI: the nightly app-real-e2e ubuntu job set ELIZA_LIVE_TEST=1 but never
  ELIZA_CHROME_PATH, so the live streaming suite self-skipped forever. Resolve the
  chromium the job already installs via playwright-core and export
  ELIZA_CHROME_PATH (with a test -x guard so a mismatch fails loudly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…over both dialects (#10722)

The static view-capability audit was vacuous: `isReachable` passed if a view had
any VIEW_ACTION_MAP entry (every audited view does → the assertion was
unconditionally true), and the DOM-only regex was blind to the 8 spatial views
(documents/inbox/goals/health/finances/relationships/todos/focus) that
instrument via a spatial `agent=` prop → `data-agent-id`, not `useAgentElement`,
so they passed as 0-control "cosmetic" for free (documents actually has 8
registrations the old grep counted as 0).

Replaced it with a proportional DENSITY gate: a control-bearing view must
register >= ceil(controls / 4) agent-addressable elements, counting controls +
registrations across BOTH dialects (DOM handlers/buttons + spatial agent= props).
Cap calibrated against the densest real view (orchestrator ~2.7 controls/reg) for
~1.5x headroom (no false fails) while still failing an under-instrumented view.
Added a teeth/positive-control test (an 8-control/1-registration source FAILS —
the exact case the old check let through) and honest describe/header labels
stating it proves static registration density, not runtime hittability. Render-
based coverage stays with the running-shell crawler (scripts/view-audit) — see
the agent's rationale: @elizaos/agent must not import 14 leaf view packages
(dependency inversion) and a bare jsdom render would false-fail. 20/20 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… deterministic BACKGROUND scenario (#10722, #10694)

Completes the follow-up CI wiring so the new lanes actually run, and lands the
deferred deterministic BACKGROUND scenario:

- test.yml: install WebKit and run the opt-in `webkit` project over the keyless
  chat pointer/focus/composer specs (PLAYWRIGHT_WEBKIT=1) — without this step the
  WebKit lane never ran anywhere.
- ui-e2e-gate.yml: gate the launcher real-drag reorder e2e (test:launcher-e2e),
  add the components/pages/** path trigger, and upload its output-launcher
  artifacts.
- deterministic-background-actions.scenario.ts: the pr-deterministic lane
  coverage of the REAL plugin-app-control BACKGROUND handler — named-color + hex
  set, GLSL shader preset (text + explicit `preset`), a live-shader uniform
  tweak, undo, redo, reset — asserting the exact ordered `background:apply`
  broadcast ledger. Verified green locally (86ms). README updated.
- background-set-color / background-shader-undo-redo (plugin-app-control,
  lane:"live-only"): NL→BACKGROUND routing variants for the live lane, matching
  the existing app-control live-scenario convention (excluded from PR CI; need
  the designated live model — gpt-oss-120b under-routes them, same as the sibling
  app-list live scenario).
- run-chat-sheet-e2e: strengthen the #10698 no-fill gate to walk the WHOLE
  per-message wrapper chain (not just the immediate parent), so a fill re-added at
  any wrapper level is caught. Verified: 24 wrapper entries, all transparent.
- Launcher.gestures.test comment: correct the runner filename
  (run-launcher-e2e.mjs section 2b, gated in ui-e2e-gate.yml).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lalalune lalalune force-pushed the shaw/fervent-knuth-55d14b branch from 8463e24 to c32e433 Compare July 2, 2026 02:50

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your trial has ended. Reactivate Greptile to resume code reviews.

@lalalune lalalune merged commit 03f9cf6 into develop Jul 2, 2026
32 of 65 checks passed
@lalalune lalalune deleted the shaw/fervent-knuth-55d14b branch July 2, 2026 02:58
lalalune pushed a commit that referenced this pull request Jul 2, 2026
…ion (#11112 WebKit lane → 9/9)

The 'transcript text is selectable' spec had a SECOND
toHaveCSS('user-select', 'text') that #11103 missed when it fixed the first:
WebKit's getComputedStyle reports only the prefixed -webkit-user-select and
returns '' for the unprefixed property, so the assert failed on WebKit even
though the app correctly emits BOTH (base.css select-text). Probe the prefixed
property with an unprefixed fallback; the behavioral range-selection assert
below is the real proof. Full WebKit pointer/focus lane now 9/9 (was 3/9),
Chromium unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
lalalune added a commit that referenced this pull request Jul 2, 2026
…t slash-menu/reload (lane 3/9 → 9/9) (#11225)

* fix(ui/chat): focusing the composer opens the overlay again — boot-race in expand()'s reveal gate (#11112)

[MAJOR, live regression on develop, both engines] Focusing the chat composer
textarea no longer flipped the overlay to data-open="true". Root cause (not a
stale suppress-ref, not an element divergence — aria-label="message" and
data-testid="chat-composer-textarea" are the same textarea): expand() early-
returns when hasRevealableThread is false (visibleMessages empty && not
loading). On /chat the overlay becomes focusable BEFORE the restored
conversation's messages arrive, so a focus→expand() no-op'd — and focus is a
one-shot event, so the sheet never opened even after the 34 messages loaded.
Playwright trace confirmed: locator.focus fired while /api/conversations and
.../messages were still in flight. The jsdom test passed because it renders
the controller with messages already present, so the gate never tripped.

Fix: park the open-intent (pendingExpandOnRevealRef) when there's nothing to
reveal yet; a reveal-edge effect consumes it (one-shot) when the thread
becomes showable — but only if the composer is STILL focused, so an abandoned
focus can't pop the sheet open later. The suppressExpandOnFocusRef contract is
untouched (a pill-open keyboard-raise consumes the suppress flag before expand
runs, so it never parks an intent). Focusing a genuinely empty new chat still
doesn't open an empty sheet.

Reproduced on real Chromium (chat-overlay-controls-interactions 'long
transcript scrolls': 16.5s timeout-fail → 2.5s pass, 4/4). +3 jsdom regression
tests; overlay 126/126, fuzz 119/119; run-chat-sheet-e2e PASSED.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(ui/chat): surface slash-catalog fetch failures instead of swallowing them (#11112 diagnosis)

The slash-command controller degraded a failed catalog / custom-actions fetch
to [] with a silent .catch(() => []), making a fetch failure
indistinguishable from a genuinely empty catalog — the menu just never mounts.
That is exactly what made #11112's WebKit slash-menu failure hard to diagnose
(the real cause: the service worker wasn't bypassed for Playwright routes on
WebKit, so /api/* hit the real stub serving commands:[] — fixed in the ui-smoke
config alongside the reload-persistence bug). Now both catches console.error a
[useSlashCommandController]-prefixed message + the error before degrading; the
composer still works catalog-less. filterCommandsForSurface authorization
gating untouched. +5 controller unit tests (engine-agnostic: commands resolve
whenever the catalog resolves incl. requiresAuth/requiresElevated under trusted
defaults; unauthorized senders still lose gated commands; empty resolves
silently; failed fetch degrades AND surfaces). 36/36.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test(app): block the service worker on the WebKit ui-smoke lane (#11112 findings 1 & 2)

The WebKit pointer/focus lane's slash-menu (finding 1) and conversation-
reload-persistence (finding 2) failures share ONE root cause: the ui-smoke
stack serves the PROD renderer, which registers /sw.js (skipWaiting +
clients.claim). WebKit — unlike Chromium — does NOT bypass a controlling
service worker when page.route interception is active, so once the SW claims
the page every /api/* fetch goes AROUND the per-spec route fixtures to the real
stub server (verified via an in-page probe: a route-fulfilled
/api/conversations returned the stub server's conversations, not the fixture's).
So slash listCommands resolved the stub's empty catalog (menu never mounted)
and the reload rehydrated a foreign thread (timeout). Added
serviceWorkers: 'block' to the webkit project — parity with the existing
desktop-webkit lane that already documents this exact hazard. Config-only; both
specs stay pristine. WebKit: slash 4/4 + conversation 3/3 (previously 0/4, 0/3);
Chromium unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test(app): probe -webkit-user-select on the sibling selectable assertion (#11112 WebKit lane → 9/9)

The 'transcript text is selectable' spec had a SECOND
toHaveCSS('user-select', 'text') that #11103 missed when it fixed the first:
WebKit's getComputedStyle reports only the prefixed -webkit-user-select and
returns '' for the unprefixed property, so the assert failed on WebKit even
though the app correctly emits BOTH (base.css select-text). Probe the prefixed
property with an unprefixed fallback; the behavioral range-selection assert
below is the real proof. Full WebKit pointer/focus lane now 9/9 (was 3/9),
Chromium unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: moon <stupidlybadadvice@gmail.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

❌ PR title does not match the required pattern. Please use one of these formats:

  • 'type: description' (e.g., 'feat: add new feature')
  • 'type(scope): description' (e.g., 'chore(core): update dependencies')
    Valid types: feat, fix, docs, style, refactor, perf, test, build, ci, chore, revert, release

@claude

claude Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Claude encountered an error —— View job


I'll analyze this and get back to you.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

LifeOps Benchmark — eliza

Run ID: lifeops-eliza-28561873721

LifeOps Benchmark

Model: gpt-oss-120b
Judge: claude-opus-4-7
Scenarios: 25
pass@1: 0.000
pass@k: 0.000
Total cost: $0.0000

Full artifacts: see the lifeops-run-eliza-28561873721 upload on this run.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

LifeOps Benchmark — hermes

Run ID: lifeops-hermes-28561873721

LifeOps Benchmark

Model: gpt-oss-120b
Judge: claude-opus-4-7
Scenarios: 25
pass@1: 0.240
pass@k: 0.240
Total cost: $0.9128

Full artifacts: see the lifeops-run-hermes-28561873721 upload on this run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant