DO NOT MERGE - Full Context Only - feat(skills): Add page-structure skill with eval-driven development by yan-xie-webflow · Pull Request #16 · webflow/webflow-skills

yan-xie-webflow · 2026-04-06T02:30:45Z

Summary

New plugin: webflow-designer-tools with page-structure skill for building and managing page elements, components, and layouts in Webflow Designer
New eval framework: pytest-based test suite using Claude CLI stream-json output to validate skill trigger accuracy and execution quality
Registered the new plugin in both Claude and Cursor marketplace configs

Methodology: Eval-Driven Skill Development

This PR follows an eval-first approach — all tests were written and validated before the skill implementation, similar to TDD but for LLM skill behavior:

Define eval framework — Built shared fixtures (run_claude, extract_tool_calls, extract_skill_invocations) that spawn claude -p with --output-format stream-json and parse events to verify tool calls and skill invocations
Write trigger tests first — 29 tests (14 positive, 15 negative) validating that natural language prompts route to the correct skill based on the skill description alone
Write execution tests first — 15 tests verifying correct MCP tool usage, ordering (guide_tool → data_sites_tool → de_page_tool → element_tool), confirmation before mutations, and no hallucinated tools
Implement the skill — Wrote SKILL.md following existing patterns (phased workflow, tool declarations, examples)
Iterate on failures — Used test diagnostics to refine the skill description (trigger accuracy), add missing site discovery step, and adjust test assertions for non-interactive mode behavior

Key design decisions driven by evals

Finding	Fix
Skill description too vague → low trigger accuracy	Added specific verbs: "inspecting components", "viewing what's inside a component", "previewing page structure"
Missing `data_sites_tool` → model asks for site ID and stops	Added site discovery to Phase 1 (matching all other skills)
50+ sites in workspace → model can't pick one	Test prompts specify site name; skill instructions say "if only one site, use it automatically"
Mutating ops correctly ask for confirmation in `-p` mode	Tests accept confirmation-and-stop as valid safety behavior
Claude uses short skill names (`page-structure`) vs full (`webflow-designer-tools:page-structure`)	`extract_skill_invocations` normalizes both forms using init event's skills list

Test Results

Suite	Pass	Total
Smoke (fixtures work)	5	5
Trigger accuracy	29	29
Direct execution	15	15
Total	49	49

Files

Eval framework (evals/):

pytest.ini — config with custom markers (designer, data_api, trigger, direct, negative)
constants.py — central config (MCP tools, plugin dirs, model, max turns)
conftest.py — run_claude(), extract_tool_calls(), extract_skill_invocations(), get_result()
test_conftest_smoke.py — 5 smoke tests
test_page_structure_trigger.py — 29 trigger accuracy tests
test_page_structure_direct.py — 15 execution quality tests

Skill (plugins/webflow-designer-tools/):

.claude-plugin/plugin.json — plugin config
skills/page-structure/SKILL.md — 5-phase workflow (Discovery → Inspection → Planning → Execution → Verification)

Config:

.claude-plugin/marketplace.json — added webflow-designer-tools entry
.cursor-plugin/plugin.json — added skills path
.cursor-plugin/marketplace.json — updated description

Test plan

All 49 eval tests pass (pytest evals/ -v)
Smoke test with Designer open: /page-structure List all elements on the current page
Verify trigger routing: "Add a hero section to my Webflow page" → triggers page-structure
Verify negative routing: "Publish my Webflow site" → does NOT trigger page-structure

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tract_skill_invocations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TDD tests covering execution quality (tool ordering, correct tool calls) and safety (confirmation before mutations, no hallucinated tools). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TDD tests covering 14 positive triggers (page/element/component manipulation prompts) and 15 negative triggers (CMS, publish, audit, and other non-structure prompts). Tests will fail until the page-structure skill is implemented. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Important Note section with all MCP tool declarations (matching existing pattern) - 5-phase workflow: Discovery, Inspection, Planning, Execution, Verification - Safety: snapshot before mutation, explicit confirmation required - 5 examples covering list elements, build, update component, restructure, layout - Fixed plugin config: moved to .claude-plugin/plugin.json, added skills field

…ger prompt - extract_skill_invocations now normalizes both short and full skill names - Moved 'What components does my site have?' to ambiguous cases (site-audit wins) - Replaced with unambiguous 'List the components I can use on this page' - Reverted description bloat

…nd relaxed negative tests Update page-structure skill description with more specific verbs (inspecting, viewing, creating pages, previewing) to improve trigger reliability. Loosen two negative test assertions where alternative skills (brainstorming, frontend-design) legitimately intercept prompts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rect tests Add data_sites_tool to Discovery phase and all examples in SKILL.md, matching the pattern used by all other skills. Update direct test prompts to specify site name ("Yan's Test Case") to avoid ambiguity with 50+ sites in workspace. Relax mutation test assertions to accept confirmation-and-stop behavior in non-interactive mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rketplace configs Add the new plugin to .claude-plugin/marketplace.json and .cursor-plugin/plugin.json so it's discoverable in both Claude Code and Cursor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The skill covers more than just page structure (components, styles, snapshots, element building), so rename test files to match the broader plugin scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename skill folder, frontmatter, test references, and class names from page-structure to designer-tools to reflect the broader scope (pages, elements, components, styles, snapshots). Update Cursor marketplace description accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yan-xie-nk and others added 11 commits April 5, 2026 20:44

feat(evals): add pytest config and constants for skill eval framework

e1aaedb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(evals): add shared fixtures — run_claude, extract_tool_calls, ex…

3bc479d

…tract_skill_invocations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(evals): add direct invocation tests for page-structure skill

29385fb

TDD tests covering execution quality (tool ordering, correct tool calls) and safety (confirmation before mutations, no hallucinated tools). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(skills): add webflow-designer-tools plugin with page-structure stub

6a32e76

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: add eval framework spec and implementation plan

47550d6

yan-xie-webflow marked this pull request as draft April 6, 2026 02:31

yan-xie-webflow changed the title ~~feat(skills): Add page-structure skill with eval-driven development~~ DO NOT MERGE - Full Context Only - feat(skills): Add page-structure skill with eval-driven development Apr 6, 2026

refactor(evals): rename test files to test_webflow_designer_tools_*

adb7514

The skill covers more than just page structure (components, styles, snapshots, element building), so rename test files to match the broader plugin scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yan-xie-webflow mentioned this pull request Apr 6, 2026

feat(skills): Add webflow-designer-tools plugin with designer-tools skill #17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE - Full Context Only - feat(skills): Add page-structure skill with eval-driven development#16

DO NOT MERGE - Full Context Only - feat(skills): Add page-structure skill with eval-driven development#16
yan-xie-webflow wants to merge 13 commits intomainfrom
feat/page-structure-skill-and-evals

yan-xie-webflow commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yan-xie-webflow commented Apr 6, 2026

Summary

Methodology: Eval-Driven Skill Development

Key design decisions driven by evals

Test Results

Files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants