Skip to content

DO NOT MERGE: Evaluate figma implementer subagent (WEB-2442)#1560

Open
Marcosld wants to merge 1 commit into
masterfrom
WEB-2442
Open

DO NOT MERGE: Evaluate figma implementer subagent (WEB-2442)#1560
Marcosld wants to merge 1 commit into
masterfrom
WEB-2442

Conversation

@Marcosld
Copy link
Copy Markdown
Contributor

@Marcosld Marcosld commented Jun 3, 2026

Does using a specialiced subagent improve the mistica-react skill in any way?

A/B experiment — 6 headless Claude Code runs, identical prompt, same Figma design.
Model: claude-opus-4-8 (all runs) · n = 3 per arm.


TL;DR

Verdict
Output quality (visual fidelity) Tie. Both arms produce high-fidelity, near-pixel-faithful pages (~4.3–4.4 / 5). The subagent did not produce visibly better screens.
Mistica compliance Tie (both near-perfect). Every run used Mistica primitives almost exclusively — 0 raw HTML / 0 inline styles in 4 of 6 runs. The skill is what enforces this, in both arms.
Total token volume ≈ Tie (−3%). Total tokens are dominated (~99.5%) by input-side context (mostly cache reads), which is near-equal across arms. The subagent does not consume dramatically fewer tokens overall, altough at first glance it seemed so because the main loop delegated work into the subagent.
Output tokens −40% with subagent (27.5k vs 46k) — real and reproducible. But output is <1% of token volume; see decomposition below. This is probably caused by the system prompt asking the agent to be brief, whilst the main loop agent is usually verbose.
Orchestrator turns −92% to −95%, but this is largely a measurement artifact: num_turns counts only the main loop. Total agentic work (deduped messages) is actually +22% higher in the subagent arm.
Cost / latency Modest subagent edge: ~10% cheaper ($4.95 vs $5.51), ~14% faster (632s vs 735s). ~82% of the cost gap is attributable to the output-token difference. Speed improvement could be attributed to quicker MCP and skill loading speed as they are specified in the system prompt.

Conclusion: the subagent does not make the result better (the mistica-react skill does the heavy lifting in both arms) and does not do less total work or use dramatically fewer total tokens. Its two real, verified effects are: (1) it keeps the orchestrator transcript ~90% leaner by relocating the work into a child context (context hygiene), and (2) the subagent generates ~52% less text per turn than the verbose top-level agent, which yields a modest ~10% cost / ~14% latency edge at equal quality. Its value here is orchestration hygiene + lower output verbosity — not output fidelity, and not a large compute saving. We think it is not worth to use a subagent as part of the mistica-react plugin due to context loss (for example when iterating over a implementation) and almost non-existent improvements.


Methodology

  • Identical prompt (byte-for-byte) in all 6 runs:
    Implement this design from Figma using Mistica @https://www.figma.com/design/puOwn8pBJCrMYksvXeCiJO/AI-Test---Figma-MCP-2-code?node-id=1-67393&m=dev
  • Isolation: 6 separate workspaces outside the main repo (so no parent .claude/agents leaks in). Each = copy of the project + .claude (skill symlink), node_modules symlinked to the main repo.
    • with-1/2/3: .claude/agents/figma-mistica-implementer.md present.
    • no-1/2/3: .claude/agents removed — skill only.
  • Delegation: natural auto-delegation (identical prompt; no nudge). Detected per run from the event stream.
  • Runs: headless claude -p --output-format stream-json --verbose --model opus. Metrics (duration_ms, num_turns, total_cost_usd, per-model token usage) read straight from the terminal result event.
  • Rendering: each implementation booted on a Vite dev server and screenshotted full-page at the design's native 1368px width via Playwright/Chromium; console + Vite-overlay errors captured.
  • Compliance: static analysis of generated .tsx (excluding the identical main.tsx boilerplate).

Sanity checks passed: all 6 runs exited 0; all loaded the mistica-react skill; all 3 with-* runs delegated to figma-mistica-implementer and no no-* run did; all 6 compiled and rendered with zero console/overlay errors.


Original design (baseline)

figma-baseline

Per-run results

Run Figma calls Wall (s) Output tok Cache-read tok
with-1 15 527 29,556 6,861,084
with-2 12 718 28,194 5,467,758
with-3 15 651 24,794 5,485,159
no-1 15 713 46,727 4,858,799
no-2 16 727 43,302 6,710,654
no-3 14 767 48,185 6,741,214

We can see mean wall time is less in with runs. Nevertheless the minimum cache read tokens run is a no run, pointing to intra-group variability.


Token economics — decomposed and verified

This is the core of the "why fewer tokens?" question. Three findings:

1. Total token volume is ≈ equal (−3%). Tokens are ~99.5% input-side (context fed in each step, mostly cache reads) and ~0.5% output (text generated):

Run Input-side Output Output % of all tokens
with-1 / 2 / 3 7.10M / 5.66M / 5.69M 29,556 / 28,194 / 24,794 0.41% / 0.50% / 0.43%
no-1 / 2 / 3 5.09M / 6.92M / 6.95M 46,727 / 43,302 / 48,185 0.91% / 0.62% / 0.69%

Both arms read the same Mistica docs + Figma payloads once and re-read their growing context a comparable number of times, so the bulk (cache reads) is near-equal. The subagent doesn't avoid that context — it just holds it in a child session instead of the parent (same volume, different container).

2. The −40% output-token gap is driven by verbosity-per-turn, not less work. Despite doing +22% more total messages, the with-* runs emit −40% output, because the subagent generates ~533 tokens/turn vs the top-level agent's ~1,116 (the two clusters don't overlap: max-with 539 < min-no 902). Causes: the subagent runs under a rigid 5-step workflow and its output is treated as a return value (terse, little narration), whereas the no-* run is the top-level conversational agent (more planning prose, running commentary, a long final user-facing summary).

3. Cost composition is dominated by cache reads; output is where the arms differ. Shares are exact (uniform 3.00× discount, CHECK 4):

Cost component WITH share NO share
Cache-read 59.9% 55.1%
Cache-write 23.7% 21.7%
Output 13.9% 21.0%
Input (uncached) 2.5% 2.3%

The mean output-token gap (18,557 tokens) is worth $0.46 at the realized output price — i.e. ~82% of the $0.56 total cost gap. The remaining ~$0.10 is cache-read run-to-run variance (note with-1 has the highest cache-read of all six → it cost $5.60 despite low output). So the ~10% cost edge is real but modest, and almost entirely an output-verbosity effect.


Mistica primitive compliance

Both arms are excellent and effectively tied — the skill enforces primitive usage regardless of the subagent:

  • Raw HTML elements: with-* = 0 / 0 / 0; no-* = 2 / 0 / 0 (only no-1 slipped in two <span>s).
  • Inline styles: with-* = 0 / 1 / 0; no-* = 2 / 0 / 0.
  • Hardcoded hex colors: 0 across all 6 runs.
  • Hardcoded px: only with-2 (6 occurrences); all others 0.
  • Every run composes from @telefonica/mistica components (MainNavigationBar, NavigationBreadcrumbs, Hero, Chip, GridLayout, MediaCard, Checkbox, RadioGroup, InfoRating, Text*, etc.) and pulls colors/spacing from skinVars tokens.

The subagent arm is marginally cleaner (0 raw HTML, more explicit skinVars usage), but the difference is within noise — both pass the "no raw divs / no raw styles" bar.


Visual fidelity

Scored 1–5 against the baseline screenshot (layout, hero, sidebar, grid, pagination, type/colour). All six are strong; the arms are statistically indistinguishable.

Run Fidelity Notes
with-1 4.5 All sections faithful; kept literal nav placeholders (matches the screenshot); hero slightly taller; ratings rendered as filled dots.
with-2 4.0 Cleanest engineering (split into components/), but customized nav to real labels ("Tienda/Móvil/…", "Lo quiero") — semantically nicer yet less literal vs the screenshot; hero smaller/top-right.
with-3 4.3 Faithful; ratings as real stars; good grid + pagination.
no-1 4.5 Near-identical to with-1; faithful across the board; ratings as dots.
no-2 4.2 Faithful; hero rendered as a wide 3-image panorama (aspect differs from baseline); ratings as stars.
no-3 4.5 Faithful; best hero match (green bg, phones + face); ratings as stars; full 1–5 pagination.
WITH avg 4.27
NO avg 4.40 (marginally higher — within subjective noise)

Screenshot gallery

WITH-subagent NO-subagent
with-1
no-1
with-2
no-2
with-3
no-3

Interpretation

  1. The skill is the quality driver, not the subagent. Both arms loaded mistica-react, and both produced near-pixel-faithful, primitive-compliant pages. Removing the subagent did not degrade output quality.
  2. The subagent does NOT do less work or use dramatically fewer tokens. Total token volume is within 3%, and the subagent arm runs more total agentic messages (+22%). The headline "−95% turns" is an artifact of num_turns measuring only the orchestrator loop (verified, CHECK 2).
  3. What the subagent actually changes is two-fold: (a) it keeps the orchestrator transcript ~90% leaner (work relocated into a child context), and (b) the subagent emits ~52% less text per turn than the verbose top-level agent. (b) is the entire source of the −40% output-token gap.
  4. Cost is modest and output-driven. Cache reads (~55–60% of cost) dominate and are near-equal across arms; the ~10% cost edge is ~82% explained by the output-token gap (≈$0.46 of $0.56), the rest cache-read noise. Latency edge ~14%.
  5. Architecture varied within both arms (monolithic App.tsx vs split component files) — run-to-run variance, not a subagent effect.

Recommendation

  • If the goal is better-looking / more compliant output, the subagent is not justified on this evidence — invest in the skill.
  • The subagent's defensible value is orchestration hygiene (a ~90% leaner main transcript) plus a modest ~10% cost / ~14% latency saving from lower output verbosity — not a large compute reduction. Worth keeping for batch/CI Figma→code where a clean orchestrator context and small per-job savings compound.
  • The leaner-orchestrator benefit should scale up on larger / multi-screen designs (where a bloated single context hurts more) — a worthwhile follow-up to test in the future, ideally with larger n to tighten the cost/latency estimates.

@@ -0,0 +1,128 @@
---
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the evaluated subagent

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Size stats

master this branch diff
Total JS 16.2 MB 16.2 MB 0 B
JS without icons 2.07 MB 2.07 MB 0 B
Lib overhead 92.5 kB 92.5 kB 0 B
Lib overhead (gzip) 19.9 kB 19.9 kB 0 B

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Deploy preview for mistica-web ready!

Project:mistica-web
Status: ✅  Deploy successful!
Preview URL:https://mistica-athkv550q-flows-projects-65bb050e.vercel.app
Latest Commit:d7c20bf

Deployed with vercel-action

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Accessibility report

55 problems detected
welcome--welcome [O2-new] (1 violations)
welcome--welcome [Movistar-new] (1 violations)
welcome--welcome [Vivo-new] (1 violations)
welcome--welcome [Blau] (1 violations)
components-accordions--boxed-accordion-story [Vivo-new] (1 violations)
components-accordions--boxed-accordion-story [Blau] (1 violations)
components-accordions--boxed-accordion-story [Movistar-new] (1 violations)
components-accordions--boxed-accordion-story [O2-new] (1 violations)
components-badge--default [Movistar-new] (1 violations)
components-badge--default [O2-new] (1 violations)
components-buttons--primary-button [Vivo-new] (1 violations)
components-buttons--danger-button [Movistar-new] (1 violations)
components-buttons--secondary-button [Blau] (1 violations)
components-buttons--danger-button [Vivo-new] (1 violations)
components-buttons--icon-button-story [Blau] (1 violations)
components-carousels-carousel--with-carousel-context-and-outside-controls [O2-new] (1 violations)
components-carousels-centeredcarousel--default [Vivo-new] (1 violations)
components-carousels-centeredcarousel--with-controls [Movistar-new] (1 violations)
components-carousels-centeredcarousel--with-controls [O2-new] (1 violations)
components-carousels-slideshow--with-carousel-context [O2-new] (1 violations)
components-checkbox--uncontrolled [Movistar-new] (1 violations)
components-checkbox--uncontrolled [O2-new] (1 violations)
components-checkbox--uncontrolled [Vivo-new] (1 violations)
components-chip--multiple-selection [Vivo-new] (1 violations)
components-headers-header--default [Movistar-new] (1 violations)
components-input-fields-autocomplete--controlled [O2-new] (1 violations)
components-input-fields-cvvfield--controlled [Blau] (1 violations)
components-input-fields-phonenumberfieldlite--uncontrolled [O2-new] (1 violations)
components-input-fields-searchfield--uncontrolled [O2-new] (1 violations)
components-input-fields-textfield--controlled [Movistar-new] (1 violations)
components-input-fields-searchfield--uncontrolled [Vivo-new] (1 violations)
components-modals-drawer--default [Movistar-new] (1 violations)
components-modals-drawer--default [O2-new] (1 violations)
components-popover--default [O2-new] (1 violations)
components-primitives-video--default [Blau] (1 violations)
components-progress-bars--progress-bar-story [Movistar-new] (1 violations)
components-radio-button--controlled [Vivo-new] (1 violations)
components-radio-button--uncontrolled [Vivo-new] (1 violations)
components-radio-button--uncontrolled [Blau] (1 violations)
components-radio-button--uncontrolled [O2-new] (1 violations)
components-switch--uncontrolled [Vivo-new] (1 violations)
components-text--text-wrapping [O2-new] (1 violations)
components-text--text-wrapping [Movistar-new] (1 violations)
components-timer--text-timer-story [Vivo-new] (1 violations)
patterns-loading--brand-loading-screen-story [Blau] (1 violations)
layout-align--default [Movistar-new] (1 violations)
layout-inline--wrap [O2-new] (1 violations)
community-advanceddatacard--default [Movistar-new] (1 violations)
private-components-inside-portals--default [Vivo-new] (1 violations)
private-components-inside-portals--default [O2-new] (1 violations)
private-deprecated-card-stories-nakedcard--default [Vivo-new] (1 violations)
private-fixedfooter--default [Blau] (1 violations)
private-image-image-sizes--default [Blau] (1 violations)
private-tooltip--moving-target [Vivo-new] (1 violations)
private-tooltip--moving-target [Blau] (1 violations)

ℹ️ You can run this locally by executing yarn audit-accessibility.

@Marcosld Marcosld marked this pull request as ready for review June 3, 2026 10:01
Copilot AI review requested due to automatic review settings June 3, 2026 10:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a dedicated “figma-mistica-implementer” agent definition intended to delegate Figma→Mistica React implementation work into a specialized subagent context.

Changes:

  • Introduces a new agent prompt/spec for implementing Figma designs with @telefonica/mistica.
  • Defines a required 5-step workflow (load skill → extract Figma → map → implement → verify) plus output requirements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +5
---
name: 'figma-mistica-implementer'
description:
"Use this agent when you need to translate a Figma design into production-ready React code using the
@telefonica/mistica component library. This agent should be invoked whenever a user provides a Figma design
Comment on lines +2 to +22
name: 'figma-mistica-implementer'
description:
"Use this agent when you need to translate a Figma design into production-ready React code using the
@telefonica/mistica component library. This agent should be invoked whenever a user provides a Figma design
URL and wants it implemented as code, or when a design needs to be converted into Mistica-compliant
components. <example>\\nContext: The user wants to implement a Figma design into code using Mistica.\\nuser:
\"Here's the design for the new login screen: https://figma.com/file/abc123/login-screen. Can you implement
it?\"\\nassistant: \"I'm going to use the Agent tool to launch the figma-mistica-implementer agent to
translate this Figma design into Mistica-compliant React code.\"\\n<commentary>\\nSince the user provided a
Figma URL and wants it implemented, use the figma-mistica-implementer agent to extract the design via Figma
MCP and build it with @telefonica/mistica.\\n</commentary>\\n</example>\\n<example>\\nContext: The user
shares a Figma frame and asks for a component.\\nuser: \"Build this card component from Figma using our
design system: https://figma.com/file/xyz789/card\"\\nassistant: \"Let me use the Agent tool to launch the
figma-mistica-implementer agent to build a visually accurate, Mistica-compliant implementation of this
card.\"\\n<commentary>\\nThe user wants a Figma design implemented with the Telefonica design system, so the
figma-mistica-implementer agent is the right choice.\\n</commentary>\\n</example>\\n<example>\\nContext: The
user pastes a Figma node link mid-conversation while building a feature.\\nuser: \"Now add the settings
panel — here's the design: https://figma.com/file/def456/settings?node-id=12-345\"\\nassistant: \"I'll use
the Agent tool to launch the figma-mistica-implementer agent to implement the settings panel from this Figma
node using Mistica.\"\\n<commentary>\\nA Figma design link was provided for implementation; delegate to the
figma-mistica-implementer agent.\\n</commentary>\\n</example>"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants