Skip to content

docs: Add UI performance optimizations design document#801

Open
swaroopvarma1 wants to merge 1 commit into
releasefrom
claude/buddy-widget-ui-perf-hNBgu
Open

docs: Add UI performance optimizations design document#801
swaroopvarma1 wants to merge 1 commit into
releasefrom
claude/buddy-widget-ui-perf-hNBgu

Conversation

@swaroopvarma1

@swaroopvarma1 swaroopvarma1 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add a comprehensive design document outlining performance optimization strategies for the Breeze Buddy widget's UI generation pipeline. This document surveys current architecture, benchmarks against industry standards (Vercel json-render, Google A2UI, Tambo), and proposes a phased rollout of optimizations targeting token economy, perceived TTFR, and healer simplification.

Key Changes

  • New document: docs/widget/UI_PERF_OPTIMIZATIONS.md (343 lines)
    • Current architecture recap with ranked cost breakdown (LLM token throughput is the dominant bottleneck at 2-4s per carousel)
    • Competitive analysis of 6 open-source generative UI frameworks and their techniques
    • 16 optimization recommendations grouped into 6 tiers (A–D) by impact and effort
    • Phased rollout sequence with effort estimates and latency projections
    • Measurement plan with specific metrics to track (ttft_ms, ttfui_ms, cache_read_input_tokens, etc.)
    • Verification that all recommendations preserve the generic spec-stream contract

Notable Implementation Details

  • Tier A (Token Economy): Proposes adopting json-render's state + elements split with $item binding and repeat directives to achieve 5–10× LLM output reduction on list-rendering turns; server-side speculative emission to drop TTFR to ~100ms on tool-result-driven turns; compact wire form shorthand to save ~60% tokens per op.

  • Tier B (Per-Prop Streaming): Tambo-style per-prop status and partial-JSON parsing to drop perceived TTFR to ~200–400ms by rendering skeletons before full props arrive.

  • Tier C (Constrained Decoding): Tool-call escape hatch and JSON Schema mode to eliminate malformed-JSON and unknown-op healer drops entirely.

  • Tier D (Free Wins): LRU caching of primitives section, Anthropic prompt cache markers, moving _ui_examples to cached system prompt, short-circuit validation, and emits_ui flag to save 50–150ms TTFT on prose-only turns.

  • Rollout sequence: 6 phases from 2-day "free wins" baseline through 2-week data/structure split, with per-phase effort, latency win, and goal clearly stated.

  • Measurement plan: Defines 6 key metrics (ttft_ms, ttfui_ms, ttlui_ms, llm_output_tokens, cache_read_input_tokens, ui_op_dropped) with p50/p95/p99 tracking and per-phase tagging for A/B analysis.

  • Preservation of contract: Explicitly verifies that every recommendation is additive and maintains the catalog as single source of truth, template allowlists, and per-template UI customization.

This document serves as a reference for prioritizing UI performance work and aligns the team on industry best practices before implementation begins.

https://claude.ai/code/session_01PXYLybfLwxL3jrR3mPvoY9

Summary by CodeRabbit

  • Documentation
    • Added technical documentation outlining performance optimization strategies for the Buddy Widget, including pipeline analysis, industry best practices survey, tiered improvement recommendations, rollout phases, and measurement methodology.

Survey of vercel-labs/json-render, google/A2UI v0.9+v0.10, tambo-ai/tambo,
constrained decoding (XGrammar/OpenAI strict), and prompt caching. Maps each
external technique to a tiered set of optimisations for our SpecStream
pipeline — from free wins (catalog lru_cache, Anthropic cache_control,
validate_props short-circuit) through wire compaction and per-prop
streaming, up to the structural state+elements split that the modern
frameworks all converge on.

https://claude.ai/code/session_01PXYLybfLwxL3jrR3mPvoY9
@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

A new documentation file outlines a draft optimization plan to reduce Buddy Widget UI generation latency through tiered improvements: token economy, per-prop streaming, constrained generation, and caching strategies. The plan includes a phased rollout sequence, measurement metrics for TTFT/latency/token efficiency, and explicit guarantees that the generic SpecStream contract remains intact.

Changes

UI Performance Optimizations RFC

Layer / File(s) Summary
Current Architecture and Problem Statement
docs/widget/UI_PERF_OPTIMIZATIONS.md
Document header, scope, and overview of the current SpecStream-driven UI generation pipeline identifying major time costs in prompt construction, JIT instruction injection, op-line healing/parsing, and rendering.
Open-Source Survey and Related Approaches
docs/widget/UI_PERF_OPTIMIZATIONS.md
Survey of existing UI streaming and generative UI protocols (json-render, A2UI, tambo, constrained decoding, prompt caching, partial JSON parsing) with patterns applicable to widget generation.
Tiered Optimization Recommendations
docs/widget/UI_PERF_OPTIMIZATIONS.md
Four tiers of detailed optimization strategies (token economy, per-prop streaming, constrained generation, caching/short-circuits) with proposed contract changes, pipeline impact, risk/effort assessments, and SpecStream contract preservation constraints.
Rollout Phases and Measurement Plan
docs/widget/UI_PERF_OPTIMIZATIONS.md
Phased implementation sequence (phases 0–6) mapping optimizations to latency wins; measurement metrics for TTFT/TTFUI/TTLUI, output token breakdown, cache read tracking, and ui_op_dropped telemetry with p50/p95/p99 comparisons.
Preserved Guarantees and References
docs/widget/UI_PERF_OPTIMIZATIONS.md
Statement of preserved generic functionality (fallback behaviors, overridability, additive changes, catalog/templates as source of truth) and complete source references for frameworks and performance specifications.

Estimated Code Review Effort

🎯 1 (Trivial) | ⏱️ ~8 minutes

Poem

🐰 A document for the Widget's race,
Where latency we'll embrace,
Token streams and caches tight,
Constrained decoding in the night—
The SpecStream keeps its contract true,
Hop on through! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'docs: Add UI performance optimizations design document' accurately and specifically describes the main change—adding a new design document focused on UI performance optimizations for the Buddy widget.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/buddy-widget-ui-perf-hNBgu

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (5)
docs/widget/UI_PERF_OPTIMIZATIONS.md (5)

37-42: 💤 Low value

Add language specifier to code block.

The fenced code block starting at line 37 should specify a language (likely jsonl or json) for proper syntax highlighting:

-```
+```jsonl
 {"op":"add","path":"/root","value":"dashboard"}

As per coding guidelines, fenced code blocks should have a language specified.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 37 - 42, The fenced code
block containing the JSON Patch lines (the objects with "op":"add",
"path":"/root", etc.) needs a language specifier for syntax highlighting; update
the block that currently starts with ``` to ```jsonl (or ```json if you prefer)
so the snippet is rendered with proper JSON/JSONL highlighting—locate the block
containing the JSON Patch entries and replace the opening backticks with
```jsonl.

224-227: 💤 Low value

Verify effort estimate for D1 caching implementation.

The "30 minutes" effort estimate for adding @lru_cache seems optimistic. While adding the decorator is trivial, properly validating that:

  • The frozenset keying correctly handles all template allowlist variations
  • Cache hit rates are acceptable in production
  • Cache invalidation works correctly when templates change
  • Memory usage is bounded appropriately

...typically requires more than 30 minutes of work, including testing and validation.

Consider revising to "2-4 hours" to account for proper testing and validation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 224 - 227, Update the
effort estimate for the D1 caching change from "30 minutes" to "2-4 hours" and
annotate that this time covers adding and verifying the `@lru_cache` usage,
validating the frozenset keying for all template allowlist permutations,
measuring cache hit rates in production-like scenarios, implementing and testing
cache invalidation when templates change, and confirming memory usage bounds;
reference the D1 caching change, `@lru_cache` decorator, frozenset keying, and
template allowlist in the note so reviewers know what validation tasks are
expected.

275-288: ⚡ Quick win

Clarify cumulative latency improvements and phase dependencies.

The "Latency win" column lists per-phase improvements, but it's unclear whether these are:

  • Additive (Phase 2's 30-40% reduction + Phase 3's TTFR improvement = ~X total)
  • Independent (each measured against baseline)
  • Multiplicative (Phase 4's 5-10× reduction applies to the output remaining after Phase 2's 30-40% reduction)

Consider adding a "Cumulative TTFT target" column showing the expected total latency after each phase completes. For example:

  • Baseline: ~3s TTFT
  • After Phase 0: ~2.6s (-400ms)
  • After Phase 2: ~1.8s (additional 30% reduction on output tokens)
  • After Phase 4: ~600ms (5-10× reduction on lists)

Also, explicitly note dependencies: Phase 5 (A2) is listed as "stacked on phase 4," but what about Phase 3's widget-side changes? Does Phase 4 depend on Phase 3, or are they independent?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 275 - 288, The table's
"Latency win" entries are ambiguous about whether improvements are additive,
independent, or multiplicative; update the rollout table to add a "Cumulative
TTFT target" column that computes expected total latency after each phase (use
the Baseline ~3s TTFT example and show numeric outcomes after Phase 0, Phase 1,
Phase 2, Phase 3, Phase 4, etc.), and explicitly annotate phase dependencies
(e.g., mark Phase 5/A2 as "requires Phase 4", specify whether Phase 4 depends on
Phase 3's widget-side changes or is independent, and note which phases compound
vs. measured against baseline). Ensure you update the Phase rows (Phase 0..6)
and add short dependency flags like "stacked on", "independent", or "requires"
next to unique identifiers such as D1, D2, D3, A1, A2, A3, B1, B2, C1, C2 so
readers can see cumulative effects and prerequisite relationships.

136-140: ⚡ Quick win

Quantify the token reduction claim with concrete examples.

The claim "5-10× output token reduction on list-rendering turns" is compelling but needs supporting calculation. Consider adding a worked example showing:

  • Current: 8 full Tile ops with token count
  • Proposed: 1 template + 1 data array with token count
  • Actual reduction ratio

This would strengthen the case for prioritizing A1 and help validate the effort estimate.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 136 - 140, Add a short
worked example that quantifies the "5-10× output token reduction" claim by
showing concrete token counts for both formats: (1) current approach: 8 full
Tile ops (show per-Tile token count and total), and (2) proposed approach: 1
template op + 1 data array (show template token count, data-array token count
and total), then compute the reduction ratio; place this example in the "Massive
optimisation" paragraph that discusses emitting the `set_data` op server-side
and reference the carousel / list-rendering scenario and the
`set_data`/`$item`/`repeat` symbols so readers can see the assumptions behind
the calculation.

124-132: 💤 Low value

Add language specifier to code block.

The fenced code block starting at line 124 should specify a language (likely jsonl or json):

-```
+```jsonl
 <ui_stream>
 {"op":"add","id":"root","type":"Carousel"}

As per coding guidelines, fenced code blocks should have a language specified.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 124 - 132, The fenced code
block that begins with the <ui_stream> operations is missing a language
specifier; update the opening triple-backtick for that block to include a
language (e.g., "jsonl" or "json") so the block reads ```jsonl (or ```json)
before the <ui_stream> line to satisfy the coding guideline for specifying
fenced-code languages.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/widget/UI_PERF_OPTIMIZATIONS.md`:
- Around line 308-320: Add explicit contract tests to the rollout measurement
plan (Section 5) that validate the fallback behaviors described in Section 6:
create tests for A1 ensuring the LLM can still emit flat-op form for
heterogeneous lists and one-off UI elements; for A2 ensure speculative emission
is overridden by LLM final ops including edge cases like identical id/different
type and partial property overrides; for C1/C2 validate constrained generation
schemas permit every case allowed by the existing healer (no false negatives);
and for D4 add a short-circuit validation test that confirms background Pydantic
validation still catches all issues. Include these as a "Contract Tests"
subsection and reference the specific guarantees (A1, A2, C1, C2, D4) so they
run in CI during each phase rollout.
- Around line 143-148: Update the speculative UI emission proposal to mitigate
jarring UX by: add a rate-limit/suppression threshold (e.g. skip speculative
emit if refinement < ~200ms) around the emission logic that currently uses
ToolUiHint.examples[0]; surface a clear visual indicator/placeholder state for
speculative content so users know it's a refinement in progress; implement
interaction guards that debounce or disable actions on speculative elements and
queue user actions to be re-applied after the LLM <ui_stream> arrives; ensure
the widget op-merge semantics for ids follow the described replace/add/remove
behavior and tag emitted ops with a "speculative" flag so analytics can compute
a refinement delta metric (how often final <ui_stream> changes ids or values).
- Around line 293-304: Add explicit baseline measurements and new user-perceived
and cache-effectiveness metrics to the Measurement plan: record p50/p95/p99
baselines for each existing metric (ttft_ms, ttfui_ms, ttlui_ms,
llm_output_tokens, cache_read_input_tokens, ui_op_dropped) before Phase 0, and
add speculative_refinement_delta_ms, speculative_change_rate, and
layout_shift_score to capture user-perceived impact of speculative emits; also
add cache_hit_rate and cache_miss_rate alongside cache_read_input_tokens for
relative cache effectiveness, and expand the SSE tagging proposal to include
per-request flags for which specific optimizations (e.g., D1, D2, D3, D4) are
active in addition to the overall optimisation phase tag.
- Around line 194-202: Summarize and mitigate vendor lock-in for the C1
"render_ui" tool-call approach: assess parity of tool-call streaming semantics
across providers (Anthropic partial_json vs OpenAI/Gemini), list required
fallback behavior for the existing <ui_stream> marker path, and estimate added
maintenance/migration cost if Anthropic changes; update docs in
UI_PERF_OPTIMIZATIONS.md (section C1) to either (a) define a clear provider
abstraction strategy with interfaces/feature-detection to switch between
render_ui tool-call and <ui_stream> marker handling, or (b) explicitly mark C1
as Anthropic-only with rationale and ongoing maintenance estimate, and adjust
the "Effort: 1 week" estimate accordingly.
- Around line 13-27: Update the UI_PERF_OPTIMIZATIONS doc to correct the
inaccuracies: mention that template/builder.py calls _splice_ui_primitives(...)
at the noted location and mcp/__init__.py:_maybe_inject_ui_instructions injects
_ui_instructions/_ui_examples (already present), change the UiStreamExtractor
carry buffer description to reference _CARRY_MAX = max(len("<ui_stream>"),
len("</ui_stream>")) - 1 instead of a fixed "16-char" value, clarify that
validate_props (Pydantic) is applied to "add" ops while "replace" only undergoes
weaker checks in parse_op_line with stronger validation deferred to the widget,
and make the cost table estimates explicit by labeling timings as approximate
(or add profiling methodology/conditions and a "last verified" timestamp) rather
than asserting exact seconds.

---

Nitpick comments:
In `@docs/widget/UI_PERF_OPTIMIZATIONS.md`:
- Around line 37-42: The fenced code block containing the JSON Patch lines (the
objects with "op":"add", "path":"/root", etc.) needs a language specifier for
syntax highlighting; update the block that currently starts with ``` to ```jsonl
(or ```json if you prefer) so the snippet is rendered with proper JSON/JSONL
highlighting—locate the block containing the JSON Patch entries and replace the
opening backticks with ```jsonl.
- Around line 224-227: Update the effort estimate for the D1 caching change from
"30 minutes" to "2-4 hours" and annotate that this time covers adding and
verifying the `@lru_cache` usage, validating the frozenset keying for all template
allowlist permutations, measuring cache hit rates in production-like scenarios,
implementing and testing cache invalidation when templates change, and
confirming memory usage bounds; reference the D1 caching change, `@lru_cache`
decorator, frozenset keying, and template allowlist in the note so reviewers
know what validation tasks are expected.
- Around line 275-288: The table's "Latency win" entries are ambiguous about
whether improvements are additive, independent, or multiplicative; update the
rollout table to add a "Cumulative TTFT target" column that computes expected
total latency after each phase (use the Baseline ~3s TTFT example and show
numeric outcomes after Phase 0, Phase 1, Phase 2, Phase 3, Phase 4, etc.), and
explicitly annotate phase dependencies (e.g., mark Phase 5/A2 as "requires Phase
4", specify whether Phase 4 depends on Phase 3's widget-side changes or is
independent, and note which phases compound vs. measured against baseline).
Ensure you update the Phase rows (Phase 0..6) and add short dependency flags
like "stacked on", "independent", or "requires" next to unique identifiers such
as D1, D2, D3, A1, A2, A3, B1, B2, C1, C2 so readers can see cumulative effects
and prerequisite relationships.
- Around line 136-140: Add a short worked example that quantifies the "5-10×
output token reduction" claim by showing concrete token counts for both formats:
(1) current approach: 8 full Tile ops (show per-Tile token count and total), and
(2) proposed approach: 1 template op + 1 data array (show template token count,
data-array token count and total), then compute the reduction ratio; place this
example in the "Massive optimisation" paragraph that discusses emitting the
`set_data` op server-side and reference the carousel / list-rendering scenario
and the `set_data`/`$item`/`repeat` symbols so readers can see the assumptions
behind the calculation.
- Around line 124-132: The fenced code block that begins with the <ui_stream>
operations is missing a language specifier; update the opening triple-backtick
for that block to include a language (e.g., "jsonl" or "json") so the block
reads ```jsonl (or ```json) before the <ui_stream> line to satisfy the coding
guideline for specifying fenced-code languages.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72f828d3-bf19-4813-918a-ac85bbf4ac2e

📥 Commits

Reviewing files that changed from the base of the PR and between 311ae9b and 0147693.

📒 Files selected for processing (1)
  • docs/widget/UI_PERF_OPTIMIZATIONS.md

Comment on lines +13 to +27
1. `template/builder.py:351` splices the rendered `## Available primitives` section (`ui_prompt.py:render_primitives_section`) into the system prompt. Built every turn, no cache.
2. Tool results are mutated post-hoc in `mcp/__init__.py:_maybe_inject_ui_instructions` — `_ui_instructions` + `_ui_examples` keys spliced into the JSON envelope (JIT pattern).
3. LLM streams text. Inside `<ui_stream>…</ui_stream>` markers it emits one JSON op per line: `{"op":"add","id":"p1","type":"Tile","parent":"root","props":{...}}`.
4. `chat/ui_stream.py:UiStreamExtractor` is a stateful FSM with a 16-char carry buffer. Each complete line → `heal_op_line` (deterministic rule pass) → `parse_op_line` → `validate_props` (full Pydantic) → `ui_op` SSE event to the widget.
5. Widget applies ops to a session-stateful tree and renders.

**Where the seconds actually go (ranked):**

| Source | Cost | Mitigable? |
|---|---|---|
| LLM token throughput on verbose Tile JSON | **~2-4s for a typical carousel of 8 tiles** | Yes — biggest lever |
| `_ui_examples` re-injected per tool call, never cached | 1-10KB tokens per turn | Yes |
| Primitives section rebuilt per turn | Tokens + ~5-20ms CPU | Yes — trivial |
| Per-line full Pydantic validation | 1-5ms × N lines | Yes — short-circuit |
| No prompt-cache markers | Re-tokenised every turn | Yes — provider feature |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify that the referenced implementation files and symbols exist

echo "=== Checking referenced files ==="
fd -t f "builder.py|ui_stream.py|ui_healer.py|agent.py|ui_prompt.py|ui_catalog.py" --exec echo "Found: {}"

echo -e "\n=== Checking for referenced symbols ==="
rg -n "def _maybe_inject_ui_instructions" --type py
rg -n "class UiStreamExtractor" --type py
rg -n "def render_primitives_section" --type py
rg -n "def heal_op_line" --type py

Repository: juspay/clairvoyance

Length of output: 1013


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cd "$(git rev-parse --show-toplevel)"

echo "=== Confirming exact line numbers mentioned in docs ==="
# Show definition line numbers and surrounding context
python3 - <<'PY'
import pathlib, re
paths = [
  "app/ai/voice/agents/breeze_buddy/template/builder.py",
  "app/ai/voice/agents/breeze_buddy/mcp/__init__.py",
  "app/ai/voice/agents/breeze_buddy/chat/ui_stream.py",
  "app/ai/voice/agents/breeze_buddy/template/ui_prompt.py",
  "app/ai/voice/agents/breeze_buddy/chat/ui_healer.py",
  "app/ai/voice/agents/breeze_buddy/chat/agent.py",
]
for p in paths:
    path = pathlib.Path(p)
    if not path.exists():
        print("MISSING", p)
        continue
    lines = path.read_text().splitlines()
    def find(patterns):
        for i,l in enumerate(lines, start=1):
            for pat in patterns:
                if re.search(pat, l):
                    return i,l
        return None,None
    if p.endswith("builder.py"):
        i,l = find([r"render_primitives_section", r"Available primitives", r"system prompt", r"splic"])
        if i:
            print(p, "first_match_line", i, l.strip()[:160])
        # also print around line 351 specifically
        target=351
        print("\n--", p, "around line", target, "--")
        lo=max(1,target-20); hi=min(len(lines),target+20)
        for k in range(lo,hi+1):
            print(f"{k:5d}: {lines[k-1]}")
    elif p.endswith("mcp/__init__.py"):
        # find definition
        for i,l in enumerate(lines, start=1):
            if re.search(r"def _maybe_inject_ui_instructions", l):
                print(p, "def_line", i)
                print("\n-- around", i, "--")
                lo=max(1,i-10); hi=min(len(lines),i+10)
                for k in range(lo,hi+1):
                    print(f"{k:5d}: {lines[k-1]}")
                break
    elif p.endswith("ui_stream.py"):
        for i,l in enumerate(lines, start=1):
            if re.search(r"class UiStreamExtractor", l):
                print(p, "class_line", i)
                print("\n-- around", i, "--")
                lo=max(1,i-10); hi=min(len(lines),i+10)
                for k in range(lo,hi+1):
                    print(f"{k:5d}: {lines[k-1]}")
                break
    elif p.endswith("ui_prompt.py"):
        for i,l in enumerate(lines, start=1):
            if re.search(r"def render_primitives_section", l):
                print(p, "def_line", i)
                break
    elif p.endswith("ui_healer.py"):
        for i,l in enumerate(lines, start=1):
            if re.search(r"def heal_op_line", l):
                print(p, "def_line", i)
                break
PY

echo
echo "=== Checking docs table line 21-27 content ==="
nl -ba docs/widget/UI_PERF_OPTIMIZATIONS.md | sed -n '1,120p'

Repository: juspay/clairvoyance

Length of output: 4513


🏁 Script executed:

#!/bin/bash
set -euo pipefail
cd "$(git rev-parse --show-toplevel)"

DOC="docs/widget/UI_PERF_OPTIMIZATIONS.md"
echo "=== Docs excerpt (lines 1-120) ==="
python3 - <<PY
from pathlib import Path
p=Path("$DOC")
lines=p.read_text().splitlines()
for i in range(1, min(120, len(lines))+1):
    print(f"{i:4d}: {lines[i-1]}")
PY

echo
echo "=== Find _splice_ui_primitives usage + implementation ==="
rg -n "_splice_ui_primitives" app/ai/voice/agents/breeze_buddy/template/builder.py
python3 - <<'PY'
import pathlib, re
path="app/ai/voice/agents/breeze_buddy/template/builder.py"
lines=pathlib.Path(path).read_text().splitlines()
for i,l in enumerate(lines, start=1):
    if re.search(r"def _splice_ui_primitives", l):
        lo=max(1,i-25); hi=min(len(lines), i+80)
        for k in range(lo, hi+1):
            print(f"{k:5d}: {lines[k-1]}")
        break
else:
    print("def _splice_ui_primitives not found in builder.py")
PY

echo
echo "=== Verify render_primitives_section linkage ==="
rg -n "render_primitives_section" app/ai/voice/agents/breeze_buddy/template/builder.py
rg -n "render_primitives_section" app/ai/voice/agents/breeze_buddy/template/ui_prompt.py | head -n 20

echo
echo "=== UiStreamExtractor: carry buffer + JSONL per-line handling ==="
python3 - <<'PY'
import pathlib, re
path="app/ai/voice/agents/breeze_buddy/chat/ui_stream.py"
lines=pathlib.Path(path).read_text().splitlines()
# print around carry/buffer mentions and parse/validate pipeline
need_patterns=[
    r"carry",
    r"16",
    r"JsonlOpLine",
    r"heal_op_line",
    r"parse_op_line",
    r"validate_props",
    r"Pydantic",
    r"SSE",
    r"yield",
]
hits=[]
for i,l in enumerate(lines, start=1):
    for pat in need_patterns:
        if re.search(pat, l):
            hits.append((i,l.strip()))
            break
for i,l in hits[:40]:
    print(f"{i:5d}: {l}")
PY

echo
echo "=== Print UiStreamExtractor feed/flush and pipeline region ==="
python3 - <<'PY'
import pathlib, re
path="app/ai/voice/agents/breeze_buddy/chat/ui_stream.py"
lines=pathlib.Path(path).read_text().splitlines()
# locate feed() and flush()
def print_around(pattern, context=60):
    for i,l in enumerate(lines, start=1):
        if re.search(pattern, l):
            lo=max(1,i-context); hi=min(len(lines), i+context)
            for k in range(lo,hi+1):
                print(f"{k:5d}: {lines[k-1]}")
            return True
    return False

ok = print_around(r"def feed\(") or print_around(r"def flush\(")
if not ok:
    print("Could not locate feed()/flush() via regex")
PY

echo
echo "=== ui_healer: heal_op_line / validate_props (Pydantic) ==="
python3 - <<'PY'
import pathlib, re
path="app/ai/voice/agents/breeze_buddy/chat/ui_healer.py"
lines=pathlib.Path(path).read_text().splitlines()
# locate heal_op_line and parse_op_line/validate_props if present
for target in [r"def heal_op_line", r"def parse_op_line", r"validate_props", r"Pydantic", r"BaseModel", r"SSE", r"jsonschema"]:
    for i,l in enumerate(lines, start=1):
        if re.search(target, l):
            print("\n--- match:", target, "at", i, "---")
            lo=max(1,i-40); hi=min(len(lines), i+120)
            for k in range(lo,hi+1):
                print(f"{k:5d}: {lines[k-1]}")
            break
PY

Repository: juspay/clairvoyance

Length of output: 25253


🏁 Script executed:

#!/bin/bash
set -euo pipefail
cd "$(git rev-parse --show-toplevel)"

echo "=== Inspect UiStreamExtractor carry logic (_CARRY_MAX, _tail_marker_prefix) ==="
python3 - <<'PY'
import pathlib, re
path="app/ai/voice/agents/breeze_buddy/chat/ui_stream.py"
lines=pathlib.Path(path).read_text().splitlines()
# print region where _CARRY_MAX/_tail_marker_prefix/_process is defined
for i,l in enumerate(lines, start=1):
    if re.search(r"_CARRY_MAX", l):
        lo=max(1,i-20); hi=min(len(lines), i+80)
        for k in range(lo,hi+1):
            print(f"{k:5d}: {lines[k-1]}")
        break
else:
    print("No _CARRY_MAX found")
PY

echo
echo "=== Inspect parse_op_line + process_op_line pipeline (does it call healer then validate) ==="
python3 - <<'PY'
import pathlib, re
path="app/ai/voice/agents/breeze_buddy/chat/ui_stream.py"
lines=pathlib.Path(path).read_text().splitlines()
targets=["def parse_op_line","def process_op_line","def validate_props","ui_op_event","yield ui_op"]
for t in targets:
    for i,l in enumerate(lines, start=1):
        if re.search(t.replace("(","\\("), l):
            print("\n--- around", t, "at", i, "---")
            lo=max(1,i-40); hi=min(len(lines), i+120)
            for k in range(lo,hi+1):
                print(f"{k:5d}: {lines[k-1]}")
            break
PY

echo
echo "=== Find validate_props definition across repo ==="
rg -n "def validate_props|validate_props\s*=" app/ai/voice/agents/breeze_buddy --type py

echo
echo "=== Inspect mcp/_maybe_inject_ui_instructions keys ==="
python3 - <<'PY'
import pathlib, re
path="app/ai/voice/agents/breeze_buddy/mcp/__init__.py"
lines=pathlib.Path(path).read_text().splitlines()
# find function block and print relevant key names
start=None
for i,l in enumerate(lines, start=1):
    if re.search(r"def _maybe_inject_ui_instructions", l):
        start=i; break
if not start:
    print("function not found")
    raise SystemExit
lo=max(1,start); hi=min(len(lines), start+160)
for k in range(lo,hi+1):
    print(f"{k:5d}: {lines[k-1]}")
PY

echo
echo "=== Inspect render_primitives_section for caching ==="
python3 - <<'PY'
import pathlib, re
path="app/ai/voice/agents/breeze_buddy/template/ui_prompt.py"
lines=pathlib.Path(path).read_text().splitlines()
# print around render_primitives_section
for i,l in enumerate(lines, start=1):
    if re.search(r"def render_primitives_section", l):
        lo=max(1,i-40); hi=min(len(lines), i+200)
        for k in range(lo,hi+1):
            print(f"{k:5d}: {lines[k-1]}")
        break
PY

echo
echo "=== Search for prompt caching / lru_cache around render_primitives_section ==="
rg -n "lru_cache|cache|memoiz|functools" app/ai/voice/agents/breeze_buddy/template/ui_prompt.py app/ai/voice/agents/breeze_buddy/template/builder.py --type py

Repository: juspay/clairvoyance

Length of output: 37181


🏁 Script executed:

#!/bin/bash
set -euo pipefail
cd "$(git rev-parse --show-toplevel)"

echo "=== ui_catalog.validate_props (confirm Pydantic / validation scope) ==="
python3 - <<'PY'
import pathlib
path="app/ai/voice/agents/breeze_buddy/template/ui_catalog.py"
lines=pathlib.Path(path).read_text().splitlines()
# find validate_props definition line
for i,l in enumerate(lines, start=1):
    if "def validate_props" in l:
        start=max(1,i-40); end=min(len(lines), i+180)
        for k in range(start, end+1):
            print(f"{k:5d}: {lines[k-1]}")
        break
else:
    print("validate_props not found")
PY

echo
echo "=== builder.py around system_prompt splice (caching context) ==="
python3 - <<'PY'
import pathlib
path="app/ai/voice/agents/breeze_buddy/template/builder.py"
lines=pathlib.Path(path).read_text().splitlines()
# print around line ~351 and around where templates L2-cached noted
for target in [241, 330, 351, 500, 520, 530, 540]:
    if target<1 or target>len(lines): 
        continue
    lo=max(1,target-20); hi=min(len(lines), target+60)
    print(f"\n--- around {path}:{target} ---")
    for k in range(lo,hi+1):
        print(f"{k:5d}: {lines[k-1]}")
PY

echo
echo "=== Template cache module references ==="
rg -n "template/cache|L2-cache|l2_cached|cache" app/ai/voice/agents/breeze_buddy/template --type py

Repository: juspay/clairvoyance

Length of output: 34127


Fix UI_PERF_OPTIMIZATIONS recap inaccuracies and clarify perf numbers

  • The referenced files/symbols currently exist (template/builder.py:351 calls _splice_ui_primitives(...), and mcp/__init__.py:_maybe_inject_ui_instructions injects _ui_instructions / _ui_examples into JSON dict results).
  • chat/ui_stream.py:UiStreamExtractor carry isn’t a fixed “16-char” buffer; it’s derived as _CARRY_MAX = max(len("<ui_stream>"), len("</ui_stream>")) - 1.
  • The pipeline description overstates validation: validate_props (Pydantic) is applied for add ops, while replace only has weak checks in parse_op_line (strong type validation is handled on the widget side).
  • The cost breakdown table claims concrete timings (e.g., “~2-4s for … 8 tiles”) without citations/measurement methodology; label as estimates or add profiling details/conditions. Consider adding a “last verified” timestamp for these claims.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 13 - 27, Update the
UI_PERF_OPTIMIZATIONS doc to correct the inaccuracies: mention that
template/builder.py calls _splice_ui_primitives(...) at the noted location and
mcp/__init__.py:_maybe_inject_ui_instructions injects
_ui_instructions/_ui_examples (already present), change the UiStreamExtractor
carry buffer description to reference _CARRY_MAX = max(len("<ui_stream>"),
len("</ui_stream>")) - 1 instead of a fixed "16-char" value, clarify that
validate_props (Pydantic) is applied to "add" ops while "replace" only undergoes
weaker checks in parse_op_line with stronger validation deferred to the widget,
and make the cost table estimates explicit by labeling timings as approximate
(or add profiling methodology/conditions and a "last verified" timestamp) rather
than asserting exact seconds.

Comment on lines +143 to +148

After a tool returns, immediately emit a default UI from `ToolUiHint.examples[0]` (or a deterministic mapping) **while the LLM is still generating**. Treat the LLM's eventual `<ui_stream>` as a refinement — same `id`s → `replace` ops, new ids → `add` ops, missing ids → `remove`.

- Effort: 2-3 days
- Latency: TTFR drops from ~1-2s to ~100ms on tool-result-driven turns
- Risk: low if op-merge semantics are correctly defined on the widget; needs explicit "speculative" tagging so analytics can measure refinement deltas

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

Consider UX implications of speculative emission.

A2 proposes emitting a default UI immediately while the LLM generates a refinement. This optimization could create a jarring user experience if the speculative UI differs significantly from the LLM's final output (flickering, layout shifts, confusing interactions if the user clicks on a speculative element that then changes).

The "speculative tagging" for analytics is mentioned, but consider also:

  • Rate-limiting or suppressing speculative emission if the refinement typically arrives within ~200ms
  • Visual indicators to the user that content is being refined
  • Handling user interactions with speculative elements that are then removed/replaced
  • Measuring the "refinement delta" (how often does the LLM override the default?) to assess if this optimization is worth the complexity
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 143 - 148, Update the
speculative UI emission proposal to mitigate jarring UX by: add a
rate-limit/suppression threshold (e.g. skip speculative emit if refinement <
~200ms) around the emission logic that currently uses ToolUiHint.examples[0];
surface a clear visual indicator/placeholder state for speculative content so
users know it's a refinement in progress; implement interaction guards that
debounce or disable actions on speculative elements and queue user actions to be
re-applied after the LLM <ui_stream> arrives; ensure the widget op-merge
semantics for ids follow the described replace/add/remove behavior and tag
emitted ops with a "speculative" flag so analytics can compute a refinement
delta metric (how often final <ui_stream> changes ids or values).

Comment on lines +194 to +202
#### C1. Tool-call escape hatch for SpecStream

Define a tool called `render_ui` whose JSON-Schema-constrained input *is* the ops list. Anthropic guarantees the tool input validates against the schema. Replaces `<ui_stream>` markers entirely — the streaming SDK emits `content_block_delta(partial_json=…)` events that we accumulate into a tool call. Healer's malformed-JSON and unknown-type drops disappear because the schema enforces them.

Trade-off: tool-call streaming is per-arg, not per-op-line. We'd accumulate the full ops list before processing. Compatible with A1's data-binding pattern; conflicts with B2's partial-JSON streaming unless we use the SDK's `partial_json` events directly.

- Effort: 1 week
- Latency: small TTFT gain (no marker parsing), large healer simplification
- Risk: medium — locks us into Anthropic-shaped tool-call semantics; Gemini / OpenAI parity needs verification

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Assess vendor lock-in risk for C1 tool-call approach.

C1 proposes using Anthropic's tool-call mechanism for SpecStream rendering. While the risk section notes this "locks us into Anthropic-shaped tool-call semantics," this deserves deeper consideration:

  • Gemini/OpenAI parity: Tool-call streaming behavior differs across providers. Anthropic streams partial_json within tool calls; OpenAI and Gemini may have different semantics.
  • Fallback complexity: If C1 is implemented, maintaining both the tool-call path (Anthropic) and the <ui_stream> marker path (other providers) adds significant maintenance burden.
  • Migration cost: If Anthropic's tool-call semantics change in future API versions, this could require significant rework.

Consider explicitly documenting a provider abstraction strategy or accepting that this optimization may be Anthropic-specific only. The "1 week" effort estimate may not account for maintaining dual code paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 194 - 202, Summarize and
mitigate vendor lock-in for the C1 "render_ui" tool-call approach: assess parity
of tool-call streaming semantics across providers (Anthropic partial_json vs
OpenAI/Gemini), list required fallback behavior for the existing <ui_stream>
marker path, and estimate added maintenance/migration cost if Anthropic changes;
update docs in UI_PERF_OPTIMIZATIONS.md (section C1) to either (a) define a
clear provider abstraction strategy with interfaces/feature-detection to switch
between render_ui tool-call and <ui_stream> marker handling, or (b) explicitly
mark C1 as Anthropic-only with rationale and ongoing maintenance estimate, and
adjust the "Effort: 1 week" estimate accordingly.

Comment on lines +293 to +304
## 5. Measurement plan

Before phase 0 ships, instrument:

- `ttft_ms` — from `/message` POST to first `assistant_token` / first `ui_op` SSE event (split metric)
- `ttfui_ms` — from `/message` POST to first `ui_op` SSE event
- `ttlui_ms` — from `/message` POST to last `ui_op` SSE event in the turn
- `llm_output_tokens` — break down by `<ui_stream>` vs prose
- `cache_read_input_tokens` — Anthropic cache hit rate after D2 lands
- `ui_op_dropped` count + reason — healer success rate; should drop after C1/C2

Compare p50/p95/p99 across phases. Tag every SSE stream with the active optimisation phase so we can A/B in production.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add baseline measurements and user-perceived metrics.

The measurement plan is comprehensive for backend metrics, but consider adding:

  1. Baseline measurements: Before Phase 0 ships, capture current p50/p95/p99 for all six metrics to establish a quantitative baseline. The document references "~2-4s" and "~1-2s" but these should be backed by actual measurements.

  2. User-perceived quality metrics (especially for Phase 5's speculative emission):

    • speculative_refinement_delta_ms: Time between speculative UI render and LLM refinement arrival
    • speculative_change_rate: Percentage of speculative UIs that are modified by LLM
    • layout_shift_score: CLS-like metric for UI jank during refinement
  3. Cache effectiveness (for Phases 0-1):

    • cache_hit_rate: Percentage of requests hitting prompt cache
    • cache_miss_rate: Percentage missing cache
      (Not just cache_read_input_tokens, which is an absolute count)
  4. Per-optimization attribution: The plan mentions "tag every SSE stream with the active optimisation phase," but consider also tagging which specific optimizations are active (e.g., a request might have D1+D2+D3 enabled but not D4).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 293 - 304, Add explicit
baseline measurements and new user-perceived and cache-effectiveness metrics to
the Measurement plan: record p50/p95/p99 baselines for each existing metric
(ttft_ms, ttfui_ms, ttlui_ms, llm_output_tokens, cache_read_input_tokens,
ui_op_dropped) before Phase 0, and add speculative_refinement_delta_ms,
speculative_change_rate, and layout_shift_score to capture user-perceived impact
of speculative emits; also add cache_hit_rate and cache_miss_rate alongside
cache_read_input_tokens for relative cache effectiveness, and expand the SSE
tagging proposal to include per-request flags for which specific optimizations
(e.g., D1, D2, D3, D4) are active in addition to the overall optimisation phase
tag.

Comment on lines +308 to +320
## 6. Generic functionality preserved

Every recommendation is additive:

- **A1's data-binding split** keeps the flat-op form as a fallback. LLM can still emit a hand-crafted tree when needed.
- **A2's speculative emission** is fully overridable — the LLM's final ops win.
- **A3's compact wire form** is a pure encoding alias; canonical ops downstream are unchanged.
- **B1/B2** are skeleton/progressive enhancement; final state matches today's.
- **C1/C2** narrow what's *allowed* but the existing healer rules already enforce this — we're just moving enforcement to the right layer.
- **D1-D5** are pure caching/short-circuiting; no contract change.

The catalog remains the single source of truth. Templates still declare allowlists and JIT instructions. Merchants still get per-template UI customisation.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Ensure fallback behaviors are tested.

This section makes important claims about preserved contracts and fallback mechanisms. These should be explicitly tested as part of each phase rollout:

  • A1: Verify that LLM can still emit flat-op form when data-binding is inappropriate (e.g., heterogeneous lists, one-off UI elements)
  • A2: Test that LLM's ops correctly override speculative emission (especially edge cases like same id but different type, or partial property override)
  • C1/C2: Validate that constrained generation schemas permit everything the existing healer allows (no false negatives)
  • D4: Confirm that short-circuit validation doesn't introduce correctness regressions (background Pydantic still catches all issues)

Consider adding a "Contract Tests" subsection to the measurement plan (Section 5) that explicitly validates these guarantees don't regress across phases.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/widget/UI_PERF_OPTIMIZATIONS.md` around lines 308 - 320, Add explicit
contract tests to the rollout measurement plan (Section 5) that validate the
fallback behaviors described in Section 6: create tests for A1 ensuring the LLM
can still emit flat-op form for heterogeneous lists and one-off UI elements; for
A2 ensure speculative emission is overridden by LLM final ops including edge
cases like identical id/different type and partial property overrides; for C1/C2
validate constrained generation schemas permit every case allowed by the
existing healer (no false negatives); and for D4 add a short-circuit validation
test that confirms background Pydantic validation still catches all issues.
Include these as a "Contract Tests" subsection and reference the specific
guarantees (A1, A2, C1, C2, D4) so they run in CI during each phase rollout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants