This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This repo is a generic framework for evaluating MCP server implementations side-by-side using Braintrust evals. It ships with Stripe and Increase eval suites but can be extended to benchmark any MCP server.
The harness runs an agent loop (Claude Code via the Agent SDK) against each MCP server defined in a suite config, then scores responses on factuality, completeness, and efficiency.
Each suite lives in src/suites/<name>/ as a directory containing:
suite.ts— default-exports aSuiteConfigwith servers, test cases, system prompt, and project name- Optional supporting files (e.g.
fixtures.jsonfor test data seeding)
The SuiteConfig schema (src/suite.ts) includes:
projectName— Braintrust project namesystemPrompt— prompt given to the agentservers[]— MCP server configs (command, args, env, capabilities, mode)testCases[]— prompts with expected results (description, containsText, fieldValues)setup— optional shell command to seed test data (e.g.stripe fixtures ...)
All runner code, types, and the model registry live in src/agent/.
AnthropicRunner(anthropic-runner.ts) — Uses the Claude Agent SDK (@anthropic-ai/claude-agent-sdk) for standard Anthropic models.AnthropicCodeRunner(anthropic-code-runner.ts) — Uses the raw@anthropic-ai/sdkfor-codemodel aliases (e.g.sonnet-code,opus-code). Enablesdefer_loadingon MCP tools,tool_search, andcode_executionvia theadvanced-tool-usebeta.OpenAIRunner(openai-runner.ts) — Uses the OpenAI SDK for GPT/o-series models.
The createRunner(model) factory in src/agent/index.ts dispatches to the correct runner based on provider and codeMode.
- Completeness (
src/scorers/completeness.ts): Heuristic — checks expected text strings and field values in output - Efficiency (
src/scorers/efficiency.ts): Heuristic — penalizes high turn count and token usage - Correctness (
src/scorers/correctness.ts): LLM-as-judge factuality via autoevals
src/evals/e2e.eval.ts loads a suite via EVAL_SUITE env var, iterates over servers, and runs each test case through the agent runner. Results are scored and logged to Braintrust.
npm install
# Run the default suite (stripe)
npm run eval
# Run a specific suite
EVAL_SUITE=stripe npm run eval
EVAL_SUITE=increase npm run eval
# Convenience shortcuts
npm run eval:stripe
npm run eval:increaseExperiment records can be tagged from three sources:
- Test case tags —
tagsarray on each test case, applied per-record viaEvalCase.tags - Server tags — optional
tagsarray on each server config inSharedServerFields/SharedServerConfig, applied at the experiment level - CLI tags —
EVAL_TAGSenv var (comma-separated), applied at the experiment level
Server and CLI tags are applied at the experiment level. Test case tags remain per-record. Tags must NOT be logged on child spans (Braintrust only allows tags on the root span).
All API keys are configured via .env (gitignored). See .env.example for the full list.
src/suite.ts— SuiteConfig type, Zod schema, loadSuite(), getTestCasesForServer()src/suites/index.ts— Auto-generated barrel file (do not edit; regenerated bynpm run generate:suites)src/suites/stripe/suite.ts— Stripe eval suite (12 test cases, 2 servers)src/suites/stripe/fixtures.json— Stripe CLI fixtures for seeding 500 test customerssrc/suites/increase/suite.ts— Increase eval suite (30 test cases)src/evals/e2e.eval.ts— Generic eval loopsrc/agent/anthropic-runner.ts— Agent SDK runner (standard models)src/agent/anthropic-code-runner.ts— Raw SDK runner (code-mode models with defer_loading)src/agent/openai-runner.ts— OpenAI runnersrc/agent/types.ts— AgentRunner interface, AgentResult, ToolCallRecord, ModelConfig, Providersrc/agent/models.ts— Model registry and resolveModel()src/agent/index.ts— Runner factory and re-exportsscripts/generate-suite-index.ts— Scanssrc/suites/*/suite.tsand generates the barrel file.env.example— Required environment variables