Storybook MCP Eval Harness

A CLI-based eval harness for testing AI coding agents' ability to build UI components with Storybook and MCP tools.

What is this?

This eval harness runs automated trials where AI coding agents (Claude Code CLI or GitHub Copilot CLI) are given prompts to build UI components. Each trial:

Prepares a fresh Vite + React + Storybook project
Executes the agent with a prompt and optional context (MCP servers, component manifests, or extra prompts)
Grades the results using automated metrics: build success, type checking, linting, tests, and accessibility

The goal is to measure how well agents can use Storybook's MCP tools to build production-quality components.

Note

All task results that are uploaded (opt-outable) are publicly available in this Google Sheet.

Requirements

Node.js 24+
pnpm 10.19.0+
Playwright (npx playwright install)
Claude Code CLI (npm install -g @anthropic-ai/claude-code) - for claude-code agent
GitHub Copilot CLI (gh extension install github/gh-copilot) - for copilot-cli agent

Quick Start

# Run eval across variants (recommended for batch testing)
node eval.ts

# Run single advanced eval (interactive mode)
node advanced-eval.ts

# With all options specified (advanced-eval)
node advanced-eval.ts --agent claude-code --model claude-sonnet-4.6 --context components.json --upload-id batch-1 100-flight-booking-plain

CLI Options (Advanced Eval)

The following options apply to advanced-eval.ts:

Option	Short	Type	Description
`--agent`	`-a`	string	Which agent to use (`claude-code` or `copilot-cli`)
`--model`	`-m`	string	Which model to use (see Model Selection below)
`--context`	`-c`	string	Context type: `false`, `storybook-dev`, `.json` (manifest), `mcp.config.json`, or `.md` (extra prompts)
`--verbose`	`-v`	boolean	Show detailed logs during execution
`--storybook`	`-s`	boolean	Auto-start Storybook after completion
`--upload-id`	`-u`	string	Upload results to Google Sheets with this ID for grouping/filtering
`--no-upload-id`		-	Skip uploading results (default if no upload ID provided)
`--run-id`		string	Run identifier to group uploads together
`--help`	`-h`	-	Display help information

Positional argument: The task directory name (e.g., 100-flight-booking-plain)

Model Selection

Different agents support different models:

Model	Claude Code CLI	Copilot CLI
`claude-opus-4.6`	✅	✅
`claude-opus-4.5`	❌	✅
`claude-sonnet-4.6`	✅	✅
`claude-haiku-4.5`	✅	✅
`gpt-5.2`	❌	✅
`gpt-5.2-codex`	❌	✅
`gpt-5.1-codex-max`	❌	✅
`gemini-3-pro-preview`	❌	✅

Example usage:

# Claude Code with Opus (advanced-eval)
node advanced-eval.ts --agent claude-code --model claude-opus-4.5 100-flight-booking-plain

# Copilot CLI with GPT-5.2 (advanced-eval)
node advanced-eval.ts --agent copilot-cli --model gpt-5.2 100-flight-booking-plain

Important

GitHub Copilot CLI Model Configuration

To use models other than claude-sonnet-4.6 with the Copilot CLI, you must first enable them in your GitHub account settings:

Go to GitHub Copilot Features Settings
Enable the models you want to use (e.g., GPT-5.1 Codex Max, GPT-5.2, Claude Opus 4.5)
Save your settings
Wait up to 30 minutes

Without enabling these models, the Copilot CLI will fail when attempting to use them.

Context Types

The harness supports five context modes:

No context (--no-context): Agent uses only default tools
Storybook MCP - Dev (--context storybook-dev): Sets up local Storybook dev server with MCP endpoint
Storybook MCP - Docs (--context components.json): Provides component documentation via the @storybook/mcp package
MCP server config (--context mcp.config.json or inline JSON): Custom MCP server setup (use this for fully custom MCP servers, not for Storybook MCP)
Extra prompts (--context extra-prompt-01.md,extra-prompt-02.md): Additional markdown files appended to main prompt

Eval

Use eval to run multiple trials across context variants and compare results.

# Interactive eval
node eval.ts

# Eval via pnpm script
pnpm eval

Eval configs

Variant configs live under eval/variant-configs/ and define a base setup plus variants:

// eval/variant-configs/storybook-mcp-comparison.ts
const base = {
	agent: 'claude-code',
	model: 'claude-sonnet-4.6',
};

export default {
	name: 'storybook-mcp-comparison',
	variants: [
		{
			...base,
			id: 'with-mcp',
			label: 'With Storybook MCP',
			context: [{ type: 'storybook-mcp-docs' }],
		},
		{
			...base,
			id: 'without-mcp',
			label: 'Without MCP',
			context: [{ type: false }],
		},
	],
};

Project Structure

eval/
├── tasks/                          # Task definitions
│   └── 100-flight-booking-plain/
│       ├── prompt.md               # Main prompt for the agent
│       ├── manifests/               # Optional: manifest files directory
│       │   ├── components.json     # Component manifest for @storybook/mcp
│       │   └── docs.json     			# Optional docs manifest for @storybook/mcp
│       ├── mcp.config.json         # Optional: MCP server config
│       ├── extra-prompt-*.md       # Optional: additional context
│       ├── hooks.ts                # Optional: lifecycle hooks
│       └── trials/                 # Generated trial runs
│           └── {context}-{agent}-{timestamp}-{unique}/
│               ├── prompt.md       # Full prompt sent to agent
│               ├── project/        # Generated project code
│               └── results/        # Grading results
│                   ├── summary.json
│                   ├── transcript.json
│                   ├── build-output.txt
│                   ├── typecheck-output.txt
│                   ├── lint-output.txt
│                   └── test-results.json
├── templates/
│   ├── project/                    # Base Vite + React + Storybook template
│   └── grading/                    # Test/lint configs for grading
├── variant-configs/                # Variant configs
└── lib/
    ├── agents/                     # Agent implementations
    ├── eval/                       # Eval runner logic
    ├── graders/                    # Grading runners (build, test, lint, etc.)
    └── *.ts                        # Core harness logic

Creating a Task

Create task directory:
```
mkdir tasks/200-my-component
```

Write prompt.md:

Build a SearchBar component with autocomplete...

<technical_requirements>

1. Component MUST be default export in src/components/SearchBar.tsx
2. Component MUST have data-testid="search-bar"
   </technical_requirements>

Optional: Add context files:
- manifests/components.json - Component manifest for Storybook MCP (in a manifests/ subdirectory)
- mcp.config.json - Custom MCP server configuration
- extra-prompt-*.md - Supplementary instructions

Optional: Create hooks.ts:

import type { Hooks } from '../../types.ts';

export default {
	async postPrepareTrial(args, log) {
		// Custom setup (e.g., copy fixtures)
	},
} satisfies Hooks;

Grading Metrics

Each trial produces:

Build success: Can the project build without errors?
Type check: TypeScript compilation errors count
Lint: ESLint errors count
Tests: Storybook story results (passed/failed) including play functions
Accessibility: Axe violations count
Coverage: Vite/Vitest coverage summary (lines/statements/branches/functions)
Cost: API usage cost in USD
Duration: Total time and API time in seconds
Turns: Number of agent transcript turns

Output Files

`summary.json`

Complete metrics from execution and grading:

{
	"cost": 0.1234,
	"duration": 45,
	"turns": 8,
	"buildSuccess": true,
	"typeCheckErrors": 0,
	"lintErrors": 0,
	"test": { "passed": 3, "failed": 0 },
	"a11y": { "violations": 1 },
	"coverage": {
		"lines": 87.5,
		"statements": 86.9,
		"branches": 75.0,
		"functions": 80.0
	}
}

`transcript.json`

Complete transcript log with:

All assistant and user messages
Tool calls with arguments
Token counts and costs per message
Todo list progress

Test Results

test-results.json - Detailed test outcomes
build-output.txt - Build logs
typecheck-output.txt - TypeScript errors
lint-output.txt - ESLint output

Lifecycle Hooks

Customize trial behavior with hooks.ts:

export default {
	prePrepareTrial: async (args, log) => {
		// Before project template copy
	},
	postPrepareTrial: async (args, log) => {
		// After dependencies installed
	},
	preExecuteAgent: async (args, log) => {
		// Before agent starts
	},
	postExecuteAgent: async (args, log) => {
		// After agent completes
	},
	preGrade: async (args, log) => {
		// Before grading runs
	},
	postGrade: async (args, log) => {
		// After grading completes
	},
} satisfies Hooks;

Viewing Results

Inspect generated project:

cd tasks/100-flight-booking-plain/trials/{trial-name}/project
pnpm storybook

View transcript: Open results/transcript.json to see agent activity.

Tips

Use --verbose to see detailed agent activity and tool calls
Check transcript.json to debug agent behavior
Use extra prompts to guide agent without modifying main prompt
Component manifests work best when agents need library documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storybook MCP Eval Harness

What is this?

Requirements

Quick Start

CLI Options (Advanced Eval)

Model Selection

Context Types

Eval

Eval configs

Project Structure

Creating a Task

Grading Metrics

Output Files

`summary.json`

`transcript.json`

Test Results

Lifecycle Hooks

Viewing Results

Tips

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Storybook MCP Eval Harness

What is this?

Requirements

Quick Start

CLI Options (Advanced Eval)

Model Selection

Context Types

Eval

Eval configs

Project Structure

Creating a Task

Grading Metrics

Output Files

summary.json

transcript.json

Test Results

Lifecycle Hooks

Viewing Results

Tips

`summary.json`

`transcript.json`