A feedback loop for people building AI skills and MCP servers.
You're building a skill, an MCP server, or a custom prompt strategy that's supposed to make an AI coding assistant better at a specific job. But how do you know it actually works? How do you know your latest commit made things better and not worse?
Pitlane gives you the answer. Define the tasks your skill should help with, set up a baseline (assistant without your skill) and a challenger (assistant with your skill), and race them. The results tell you with numbers, not vibes, whether your work is paying off.
In motorsport, the pit lane is where engineers tune the car between laps. Swap a part, adjust the setup, check the telemetry, see if the next lap is faster.
Building skills and MCP servers works the same way:
- Tune: change your skill, update your MCP server, tweak your prompts
- Race: run the assistant with and without your changes against real coding tasks
- Check the telemetry: did pass rates go up? Did quality improve? Did it get faster or cheaper?
- Repeat: go back to the pit, make another adjustment, race again
Pitlane is the telemetry system. You build the skill, pitlane tells you if it's working.
pitlane.mov
- YAML-based benchmark definitions (easy to version and diff)
- Deterministic assertions (file checks, command execution, custom scripts)
- Similarity metrics (ROUGE, BLEU, BERTScore, cosine similarity)
- Metrics tracking (time, tokens, cost, file changes)
- JUnit XML output (
junit.xml) for native CI test reporting - Interactive HTML reports with side-by-side agent comparison
- Parallel execution and repeated runs with statistics
- Graceful interrupt handling (Ctrl+C generates partial reports)
- TDD workflow support (red-green-refactor)
- Quick Start
- Supported Assistants
- Usage
- Writing Benchmarks
- TDD Workflow
- Editor Integration
- Security Considerations
- Contributing
- License
You'll need uv, a fast Python package installer.
Install on macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | shInstall on Windows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Note: Windows should work but has not been extensively tested. If you encounter issues or can help validate Windows support, contributions are welcome!
Install pitlane:
uv tool install pitlane --from git+https://github.com/pitlane-ai/pitlane.gitOr run without installing:
uvx --from git+https://github.com/pitlane-ai/pitlane.git pitlane run examples/simple-codegen-eval.yamlBefore running your first evaluation, ensure you have an AI coding assistant installed and authenticated on your machine. Choose from supported assistants. Install the assistant's CLI tool following their official documentation, most assistants will prompt you to log in on first use.
The example below uses OpenCode because it's free and requires no API key, but you can use any supported assistant by editing the YAML file.
Initialize a project with example benchmarks:
pitlane init --with-examplesRun the evaluation:
pitlane run examples/simple-codegen-eval.yamlResults appear in runs/ with an HTML report showing pass rates and metrics.
Want to use a different assistant? Edit examples/simple-codegen-eval.yaml and uncomment your preferred assistant configuration. See Supported Assistants for options.
Need help designing benchmarks? Install the pitlane skill for AI-guided assistance:
npx skills add pitlane-ai/pitlaneYour AI assistant can then help you create effective eval benchmarks. See Writing Benchmarks for details.
| Assistant | Type | Status |
|---|---|---|
| Claude Code | claude-code |
✅ Tested |
| Mistral Vibe | mistral-vibe |
✅ Tested |
| OpenCode | opencode |
✅ Tested |
| Bob | bob |
✅ Tested |
For the latest status and additional assistants in development, see PR #32.
Want to add support for another assistant? Contributions are welcome! See the Contributing Guide for instructions on implementing new assistants.
Run all tasks against all configured assistants:
pitlane run examples/simple-codegen-eval.yamlRun specific tasks or assistants:
Run a single task:
pitlane run examples/simple-codegen-eval.yaml --task hello-world-pythonRun specific assistants (comma-separated):
pitlane run examples/simple-codegen-eval.yaml --only-assistants claude-baselineSkip assistants:
pitlane run examples/simple-codegen-eval.yaml --skip-assistants claude-baselineCombine filters:
pitlane run examples/simple-codegen-eval.yaml --task hello-world-python --only-assistants claude-baselineSpeed up multi-task benchmarks:
pitlane run examples/simple-codegen-eval.yaml --parallel 4Run tasks multiple times to measure consistency and get aggregated statistics:
pitlane run examples/simple-codegen-eval.yaml --repeat 5This runs each task 5 times and reports avg/min/max/stddev for all metrics in the HTML report.
Every run creates debug.log with detailed execution information. Stream output to terminal in real-time:
pitlane run examples/simple-codegen-eval.yaml --verboseAll assertions include detailed logging to help diagnose failures.
Press Ctrl+C to stop a run. You'll get a partial HTML report with results from completed tasks.
By default, report.html opens in your browser after each run. To disable this:
pitlane run examples/simple-codegen-eval.yaml --no-openThe same flag works when regenerating a report:
pitlane report runs/2024-01-01_12-00-00 --no-openInitialize new benchmark project:
pitlane initInitialize with example benchmarks:
pitlane init --with-examplesGenerate JSON Schema for YAML validation:
pitlane schema generateInstall VS Code YAML validation (safe, with preview):
pitlane schema installRegenerate HTML report from existing junit.xml:
pitlane report runs/2024-01-01_12-00-00Benchmarks are YAML files with two sections: assistants and tasks.
Need help designing effective benchmarks? Install the pitlane skill for AI-guided assistance:
npx skills add pitlane-ai/pitlaneYour AI assistant can help you design eval benchmarks that actually measure whether your skills or MCP servers improve performance.
assistants:
claude-baseline:
type: claude-code
args:
model: haiku
tasks:
- name: hello-world-python
prompt: "Create a Python script called hello.py that prints 'Hello, World!'"
workdir: ./fixtures/empty
timeout: 120
assertions:
- file_exists: "hello.py"
- command_succeeds: "python3 hello.py"
- file_contains: { path: "hello.py", pattern: "Hello, World!" }Each assistant defines how to run a model:
assistants:
# Baseline configuration
claude-baseline:
type: claude-code
args:
model: haiku
# With skills/MCP
claude-with-skill:
type: claude-code
args:
model: haiku
skills:
# Remote: GitHub reference (installed via npx skills add)
- source: org/repo
skill: my-skill-name
# Local: directory path (for development/testing)
- source: ./path/to/my-skillEach task specifies:
name: Unique identifierprompt: Instructions for the assistantworkdir: Fixture directory (copied for each run)timeout: Maximum secondsassertions: Checks to verify success
assertions:
# File exists
- file_exists: "main.py"
# Command succeeds (exit code 0)
- command_succeeds: "python main.py"
# Command fails (non-zero exit)
- command_fails: "python main.py --invalid"
# File contains pattern (regex)
- file_contains:
path: "main.py"
pattern: "def main\\(\\):"
# Custom script validation
- custom_script:
script: "./validate.sh"
interpreter: "bash"
timeout: 30
expected_exit_code: 0When you need more complex validation logic than simple commands provide, use custom_script to run a dedicated test script. This is useful for multi-step validation, complex parsing, or reusable test logic.
Simple form (expects exit code 0):
- custom_script: "scripts/validate_output.sh"
- custom_script: "python scripts/validate.py"
- custom_script: "node scripts/check.js"Advanced form with options:
- custom_script:
script: "python scripts/validate_output.py"
args: ["--strict", "--format=json"]
timeout: 30
expected_exit_code: 0Options:
script— Shell command to execute (e.g.,python script.py,node script.js,./script.sh)args— List of arguments to pass to the script (optional)timeout— Maximum seconds to wait for completion (default: 60)expected_exit_code— Exit code that indicates success (default: 0)
The script field is executed as a shell command in the workdir, so you can use any interpreter:
- Python:
python validate.pyorpython3 validate.py - Node.js:
node check.js - Executable scripts:
./validate.sh(must have shebang and be executable) - Any command: Works like
command_succeedsbut with more control over timeout and exit codes
Your script receives the workdir as its working directory, so it can access generated files directly. The assertion passes if the script exits with the expected code.
Example validation script (scripts/validate_tf.sh):
#!/bin/bash
# Check if Terraform config is valid and contains required resources
terraform validate || exit 1
grep -q "aws_s3_bucket" main.tf || exit 2
exit 0Use it in your eval:
- custom_script: "scripts/validate_tf.sh"When exact matching isn't practical, use similarity metrics:
assertions:
# ROUGE: topic coverage (good for docs)
- rouge:
actual: "README.md"
expected: "./refs/golden.md"
metric: "rougeL"
min_score: 0.35
# BLEU: phrase matching (good for docs, not code)
- bleu:
actual: "README.md"
expected: "./refs/golden.md"
min_score: 0.2
# BERTScore: semantic similarity (good for docs/code)
- bertscore:
actual: "README.md"
expected: "./refs/golden.md"
min_score: 0.75
# Cosine similarity: overall meaning (good for code/configs)
- cosine_similarity:
actual: "variables.tf"
expected: "./refs/expected-vars.tf"
min_score: 0.7Choosing metrics:
| Metric | Question | Speed | Best For |
|---|---|---|---|
rouge |
Same topics? | Fast | Documentation coverage |
bleu |
Same phrases? | Fast | Documentation phrasing |
bertscore |
Same meaning? | Slow | Semantic preservation |
cosine_similarity |
Same subject? | Slow | Code/config similarity |
Use deterministic assertions first. Add similarity metrics when you need fuzzy matching.
Make some assertions count more:
assertions:
- file_exists: "main.tf"
- command_succeeds: "terraform validate"
weight: 3.0 # 3x more important
- rouge:
actual: "README.md"
expected: "./refs/golden.md"
metric: "rougeL"
min_score: 0.3
weight: 2.0Results include both assertion_pass_rate (binary) and weighted_score (continuous).
The examples/ directory contains working benchmarks you can use as starting points:
simple-codegen-eval.yaml— Minimal example with deterministic assertionssimilarity-codegen-eval.yaml— Demonstrates all similarity metricsterraform-module-eval.yaml— Real-world Terraform evaluationweighted-grading-eval.yaml— Weighted assertions and continuous scoring
Treat benchmarks like tests:
- Red: Add or tighten assertions
- Green: Update skills/MCP until assertions pass
- Refactor: Clean up without breaking tests
This lets you iterate on what "good" means without guessing.
Enable YAML validation:
pitlane schema installThis adds JSON Schema validation to .vscode/settings.json with preview and backup.
Manual setup:
{
"yaml.schemas": {
"./schemas/pitlane.schema.json": [
"eval.yaml",
"examples/*.yaml",
"**/*eval*.y*ml"
]
},
"yaml.validate": true
}Generate schema and docs:
pitlane schema generateThis outputs:
schemas/pitlane.schema.jsondocs/schema.md
Execution is not sandboxed. Pitlane runs assistants directly on your system using their native CLIs. While this provides full functionality and realistic testing conditions, it means assistants have the same file system and network access as any other process you run.
Recommended precautions:
- Run evaluations in a Docker container or virtual machine when testing untrusted code or prompts
- Review benchmark tasks and assertions before running them
- Use dedicated test environments rather than production systems
- Be cautious with benchmarks that involve sensitive data or credentials
The native CLI approach is intentional—it ensures pitlane tests assistants in real-world conditions. But like any development tool that executes code, reasonable precautions are advisable.
See CONTRIBUTING.md for development setup, testing guidelines, and how to submit changes.
Apache 2.0