A CLI-based eval harness for testing AI coding agents' ability to build UI components with Storybook and MCP tools.
This eval harness runs automated trials where AI coding agents (Claude Code CLI or GitHub Copilot CLI) are given prompts to build UI components. Each trial:
- Prepares a fresh Vite + React + Storybook project
- Executes the agent with a prompt and optional context (MCP servers, component manifests, or extra prompts)
- Grades the results using automated metrics: build success, type checking, linting, tests, and accessibility
The goal is to measure how well agents can use Storybook's MCP tools to build production-quality components.
Note
All task results that are uploaded (opt-outable) are publicly available in this Google Sheet.
- Node.js 24+
- pnpm 10.19.0+
- Playwright (
npx playwright install) - Claude Code CLI (
npm install -g @anthropic-ai/claude-code) - forclaude-codeagent - GitHub Copilot CLI (
gh extension install github/gh-copilot) - forcopilot-cliagent
# Run eval across variants (recommended for batch testing)
node eval.ts
# Run single advanced eval (interactive mode)
node advanced-eval.ts
# With all options specified (advanced-eval)
node advanced-eval.ts --agent claude-code --model claude-sonnet-4.6 --context components.json --upload-id batch-1 100-flight-booking-plainThe following options apply to advanced-eval.ts:
| Option | Short | Type | Description |
|---|---|---|---|
--agent |
-a |
string | Which agent to use (claude-code or copilot-cli) |
--model |
-m |
string | Which model to use (see Model Selection below) |
--context |
-c |
string | Context type: false, storybook-dev, *.json (manifest), mcp.config.json, or *.md (extra prompts) |
--verbose |
-v |
boolean | Show detailed logs during execution |
--storybook |
-s |
boolean | Auto-start Storybook after completion |
--upload-id |
-u |
string | Upload results to Google Sheets with this ID for grouping/filtering |
--no-upload-id |
- | Skip uploading results (default if no upload ID provided) | |
--run-id |
string | Run identifier to group uploads together | |
--help |
-h |
- | Display help information |
Positional argument: The task directory name (e.g., 100-flight-booking-plain)
Different agents support different models:
| Model | Claude Code CLI | Copilot CLI |
|---|---|---|
claude-opus-4.6 |
✅ | ✅ |
claude-opus-4.5 |
❌ | ✅ |
claude-sonnet-4.6 |
✅ | ✅ |
claude-haiku-4.5 |
✅ | ✅ |
gpt-5.2 |
❌ | ✅ |
gpt-5.2-codex |
❌ | ✅ |
gpt-5.1-codex-max |
❌ | ✅ |
gemini-3-pro-preview |
❌ | ✅ |
Example usage:
# Claude Code with Opus (advanced-eval)
node advanced-eval.ts --agent claude-code --model claude-opus-4.5 100-flight-booking-plain
# Copilot CLI with GPT-5.2 (advanced-eval)
node advanced-eval.ts --agent copilot-cli --model gpt-5.2 100-flight-booking-plainImportant
GitHub Copilot CLI Model Configuration
To use models other than claude-sonnet-4.6 with the Copilot CLI, you must first enable them in your GitHub account settings:
- Go to GitHub Copilot Features Settings
- Enable the models you want to use (e.g., GPT-5.1 Codex Max, GPT-5.2, Claude Opus 4.5)
- Save your settings
- Wait up to 30 minutes
Without enabling these models, the Copilot CLI will fail when attempting to use them.
The harness supports five context modes:
- No context (
--no-context): Agent uses only default tools - Storybook MCP - Dev (
--context storybook-dev): Sets up local Storybook dev server with MCP endpoint - Storybook MCP - Docs (
--context components.json): Provides component documentation via the@storybook/mcppackage - MCP server config (
--context mcp.config.jsonor inline JSON): Custom MCP server setup (use this for fully custom MCP servers, not for Storybook MCP) - Extra prompts (
--context extra-prompt-01.md,extra-prompt-02.md): Additional markdown files appended to main prompt
Use eval to run multiple trials across context variants and compare results.
# Interactive eval
node eval.ts
# Eval via pnpm script
pnpm evalVariant configs live under eval/variant-configs/ and define a base setup plus variants:
// eval/variant-configs/storybook-mcp-comparison.ts
const base = {
agent: 'claude-code',
model: 'claude-sonnet-4.6',
};
export default {
name: 'storybook-mcp-comparison',
variants: [
{
...base,
id: 'with-mcp',
label: 'With Storybook MCP',
context: [{ type: 'storybook-mcp-docs' }],
},
{
...base,
id: 'without-mcp',
label: 'Without MCP',
context: [{ type: false }],
},
],
};eval/
├── tasks/ # Task definitions
│ └── 100-flight-booking-plain/
│ ├── prompt.md # Main prompt for the agent
│ ├── manifests/ # Optional: manifest files directory
│ │ ├── components.json # Component manifest for @storybook/mcp
│ │ └── docs.json # Optional docs manifest for @storybook/mcp
│ ├── mcp.config.json # Optional: MCP server config
│ ├── extra-prompt-*.md # Optional: additional context
│ ├── hooks.ts # Optional: lifecycle hooks
│ └── trials/ # Generated trial runs
│ └── {context}-{agent}-{timestamp}-{unique}/
│ ├── prompt.md # Full prompt sent to agent
│ ├── project/ # Generated project code
│ └── results/ # Grading results
│ ├── summary.json
│ ├── transcript.json
│ ├── build-output.txt
│ ├── typecheck-output.txt
│ ├── lint-output.txt
│ └── test-results.json
├── templates/
│ ├── project/ # Base Vite + React + Storybook template
│ └── grading/ # Test/lint configs for grading
├── variant-configs/ # Variant configs
└── lib/
├── agents/ # Agent implementations
├── eval/ # Eval runner logic
├── graders/ # Grading runners (build, test, lint, etc.)
└── *.ts # Core harness logic
-
Create task directory:
mkdir tasks/200-my-component
-
Write
prompt.md:Build a SearchBar component with autocomplete... <technical_requirements> 1. Component MUST be default export in src/components/SearchBar.tsx 2. Component MUST have data-testid="search-bar" </technical_requirements>
-
Optional: Add context files:
manifests/components.json- Component manifest for Storybook MCP (in amanifests/subdirectory)mcp.config.json- Custom MCP server configurationextra-prompt-*.md- Supplementary instructions
-
Optional: Create
hooks.ts:import type { Hooks } from '../../types.ts'; export default { async postPrepareTrial(args, log) { // Custom setup (e.g., copy fixtures) }, } satisfies Hooks;
Each trial produces:
- Build success: Can the project build without errors?
- Type check: TypeScript compilation errors count
- Lint: ESLint errors count
- Tests: Storybook story results (passed/failed) including play functions
- Accessibility: Axe violations count
- Coverage: Vite/Vitest coverage summary (lines/statements/branches/functions)
- Cost: API usage cost in USD
- Duration: Total time and API time in seconds
- Turns: Number of agent transcript turns
Complete metrics from execution and grading:
{
"cost": 0.1234,
"duration": 45,
"turns": 8,
"buildSuccess": true,
"typeCheckErrors": 0,
"lintErrors": 0,
"test": { "passed": 3, "failed": 0 },
"a11y": { "violations": 1 },
"coverage": {
"lines": 87.5,
"statements": 86.9,
"branches": 75.0,
"functions": 80.0
}
}Complete transcript log with:
- All assistant and user messages
- Tool calls with arguments
- Token counts and costs per message
- Todo list progress
test-results.json- Detailed test outcomesbuild-output.txt- Build logstypecheck-output.txt- TypeScript errorslint-output.txt- ESLint output
Customize trial behavior with hooks.ts:
export default {
prePrepareTrial: async (args, log) => {
// Before project template copy
},
postPrepareTrial: async (args, log) => {
// After dependencies installed
},
preExecuteAgent: async (args, log) => {
// Before agent starts
},
postExecuteAgent: async (args, log) => {
// After agent completes
},
preGrade: async (args, log) => {
// Before grading runs
},
postGrade: async (args, log) => {
// After grading completes
},
} satisfies Hooks;Inspect generated project:
cd tasks/100-flight-booking-plain/trials/{trial-name}/project
pnpm storybookView transcript: Open results/transcript.json to see agent activity.
- Use
--verboseto see detailed agent activity and tool calls - Check
transcript.jsonto debug agent behavior - Use extra prompts to guide agent without modifying main prompt
- Component manifests work best when agents need library documentation