Skip to content

DO NOT MERGE - Full Context Only - feat(skills): Add page-structure skill with eval-driven development#16

Draft
yan-xie-webflow wants to merge 13 commits intomainfrom
feat/page-structure-skill-and-evals
Draft

DO NOT MERGE - Full Context Only - feat(skills): Add page-structure skill with eval-driven development#16
yan-xie-webflow wants to merge 13 commits intomainfrom
feat/page-structure-skill-and-evals

Conversation

@yan-xie-webflow
Copy link
Copy Markdown
Contributor

Summary

  • New plugin: webflow-designer-tools with page-structure skill for building and managing page elements, components, and layouts in Webflow Designer
  • New eval framework: pytest-based test suite using Claude CLI stream-json output to validate skill trigger accuracy and execution quality
  • Registered the new plugin in both Claude and Cursor marketplace configs

Methodology: Eval-Driven Skill Development

This PR follows an eval-first approach — all tests were written and validated before the skill implementation, similar to TDD but for LLM skill behavior:

  1. Define eval framework — Built shared fixtures (run_claude, extract_tool_calls, extract_skill_invocations) that spawn claude -p with --output-format stream-json and parse events to verify tool calls and skill invocations
  2. Write trigger tests first — 29 tests (14 positive, 15 negative) validating that natural language prompts route to the correct skill based on the skill description alone
  3. Write execution tests first — 15 tests verifying correct MCP tool usage, ordering (guide_tool → data_sites_tool → de_page_tool → element_tool), confirmation before mutations, and no hallucinated tools
  4. Implement the skill — Wrote SKILL.md following existing patterns (phased workflow, tool declarations, examples)
  5. Iterate on failures — Used test diagnostics to refine the skill description (trigger accuracy), add missing site discovery step, and adjust test assertions for non-interactive mode behavior

Key design decisions driven by evals

Finding Fix
Skill description too vague → low trigger accuracy Added specific verbs: "inspecting components", "viewing what's inside a component", "previewing page structure"
Missing data_sites_tool → model asks for site ID and stops Added site discovery to Phase 1 (matching all other skills)
50+ sites in workspace → model can't pick one Test prompts specify site name; skill instructions say "if only one site, use it automatically"
Mutating ops correctly ask for confirmation in -p mode Tests accept confirmation-and-stop as valid safety behavior
Claude uses short skill names (page-structure) vs full (webflow-designer-tools:page-structure) extract_skill_invocations normalizes both forms using init event's skills list

Test Results

Suite Pass Total
Smoke (fixtures work) 5 5
Trigger accuracy 29 29
Direct execution 15 15
Total 49 49

Files

Eval framework (evals/):

  • pytest.ini — config with custom markers (designer, data_api, trigger, direct, negative)
  • constants.py — central config (MCP tools, plugin dirs, model, max turns)
  • conftest.pyrun_claude(), extract_tool_calls(), extract_skill_invocations(), get_result()
  • test_conftest_smoke.py — 5 smoke tests
  • test_page_structure_trigger.py — 29 trigger accuracy tests
  • test_page_structure_direct.py — 15 execution quality tests

Skill (plugins/webflow-designer-tools/):

  • .claude-plugin/plugin.json — plugin config
  • skills/page-structure/SKILL.md — 5-phase workflow (Discovery → Inspection → Planning → Execution → Verification)

Config:

  • .claude-plugin/marketplace.json — added webflow-designer-tools entry
  • .cursor-plugin/plugin.json — added skills path
  • .cursor-plugin/marketplace.json — updated description

Test plan

  • All 49 eval tests pass (pytest evals/ -v)
  • Smoke test with Designer open: /page-structure List all elements on the current page
  • Verify trigger routing: "Add a hero section to my Webflow page" → triggers page-structure
  • Verify negative routing: "Publish my Webflow site" → does NOT trigger page-structure

🤖 Generated with Claude Code

yan-xie-nk and others added 11 commits April 5, 2026 20:44
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tract_skill_invocations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TDD tests covering execution quality (tool ordering, correct tool calls)
and safety (confirmation before mutations, no hallucinated tools).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TDD tests covering 14 positive triggers (page/element/component
manipulation prompts) and 15 negative triggers (CMS, publish, audit,
and other non-structure prompts). Tests will fail until the
page-structure skill is implemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Important Note section with all MCP tool declarations (matching existing pattern)
- 5-phase workflow: Discovery, Inspection, Planning, Execution, Verification
- Safety: snapshot before mutation, explicit confirmation required
- 5 examples covering list elements, build, update component, restructure, layout
- Fixed plugin config: moved to .claude-plugin/plugin.json, added skills field
…ger prompt

- extract_skill_invocations now normalizes both short and full skill names
- Moved 'What components does my site have?' to ambiguous cases (site-audit wins)
- Replaced with unambiguous 'List the components I can use on this page'
- Reverted description bloat
…nd relaxed negative tests

Update page-structure skill description with more specific verbs (inspecting,
viewing, creating pages, previewing) to improve trigger reliability. Loosen
two negative test assertions where alternative skills (brainstorming,
frontend-design) legitimately intercept prompts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rect tests

Add data_sites_tool to Discovery phase and all examples in SKILL.md,
matching the pattern used by all other skills. Update direct test prompts
to specify site name ("Yan's Test Case") to avoid ambiguity with 50+ sites
in workspace. Relax mutation test assertions to accept confirmation-and-stop
behavior in non-interactive mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rketplace configs

Add the new plugin to .claude-plugin/marketplace.json and
.cursor-plugin/plugin.json so it's discoverable in both Claude Code
and Cursor.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yan-xie-webflow yan-xie-webflow marked this pull request as draft April 6, 2026 02:31
@yan-xie-webflow yan-xie-webflow changed the title feat(skills): Add page-structure skill with eval-driven development DO NOT MERGE - Full Context Only - feat(skills): Add page-structure skill with eval-driven development Apr 6, 2026
The skill covers more than just page structure (components, styles,
snapshots, element building), so rename test files to match the
broader plugin scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename skill folder, frontmatter, test references, and class names
from page-structure to designer-tools to reflect the broader scope
(pages, elements, components, styles, snapshots). Update Cursor
marketplace description accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants