Evaluation harness for AI agents. Define structured test suites with rubrics, run your agent against them, get LLM-as-judge scores, inspect full execution traces, and diff runs over time to catch regressions.
Building an agent is easy. Knowing whether yesterday's prompt tweak made it better or worse is hard.
Most teams ship agents with vibes-based testing — a few manual prompts, no structured rubric, no regression catching. RubricLab is the missing dev-loop tool: write test cases once, score every change automatically, and see exactly where behavior drifted.
M1 — scaffold
- Monorepo:
apps/api(FastAPI),apps/web(Next.js 15),packages/sharedTypeScript types GET /andGET /healthlive; dev tooling (uv + ruff, pnpm + prettier + eslint)
M2 — data model + storage
- SQLModel entities:
Suite,Case,Run,CaseResult,Trace,RubricScore - SQLite bootstrap on startup; demo suite "Research Agent v1" (8 cases) seeded idempotently
M3 — agent runner + trace capture
AgentRunnerprotocol (apps/api/src/rubriclab/runner.py);ResearchAgentwithweb_search+calculatortools
M4 — LLM-as-judge engine
- Anthropic-powered rubric scoring with per-dimension scores and written justifications
M5 — FastAPI routes
/suites,/runs,/cases,/traces,/diffREST endpoints
M6 — Next.js dashboard
- Suite browser, run trigger, pass/fail results with score breakdown
M7 — Trace viewer + diff UI
- Timeline of tool calls/messages; side-by-side two-run diff with score deltas
M8 — demo packaging
docker compose up --buildruns everything end-to-endrecord_demo.shtriggers two runs with a system-prompt tweak between them, captures screenshots via Playwright (falls back to manual URLs), and saves them todocs/.env.exampledocuments the only required secret (ANTHROPIC_API_KEY)
git clone https://github.com/your-org/rubriclab
cd rubriclab
cp .env.example .env # add your ANTHROPIC_API_KEY
docker compose up --buildTo run the full demo and compare two runs:
./record_demo.sh- Dashboard: http://localhost:3000
- API: http://localhost:8000
- API health: http://localhost:8000/health
Prerequisites: Python 3.12+, uv, Node 22+, pnpm
# Install JS dependencies
pnpm install
# Start the API
cd apps/api
uv sync
uv run uvicorn rubriclab.main:app --reload --port 8000
# In another terminal, start the web app
cd apps/web
pnpm devM8 shipped end-to-end: all components from AgentRunner through dashboard and diff UI are live.
┌─────────────────────────┐ ┌──────────────────────────┐
│ Next.js Dashboard │ │ CLI: rubriclab run ... │
│ (suites, runs, traces) │ └────────────┬─────────────┘
└────────────┬────────────┘ │
│ REST/JSON │
▼ ▼
┌─────────────────────────────────────────────┐
│ FastAPI backend │
│ /suites /runs /cases /traces /diff │
└──┬───────────────┬────────────────┬─────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ AgentRunner │ │ JudgeEngine │ │ SQLite │
│ (calls user │ │ (Anthropic │ │ (SQLModel) │
│ agent + │ │ rubric │ │ suites, │
│ captures │ │ scoring) │ │ cases, │
│ trace) │ │ │ │ runs, │
└──────┬───────┘ └──────┬───────┘ │ traces, │
│ │ │ scores │
▼ ▼ └──────────────┘
┌──────────────────────────────┐
│ Sample Research Agent │
│ (Anthropic + tools) │
└──────────────────────────────┘
Core data model: Suite 1—* Case, Run 1—* CaseResult, CaseResult 1—1 Trace, CaseResult 1—* RubricScore. Diff = join two Runs on Case and compute per-dimension deltas.
Fully runnable end-to-end as of M8. Every step below works with
docker compose up --buildor local dev.
docker compose up(or local Python + pnpm dev)- Open dashboard → see preloaded "Research Agent v1" suite with 8 cases
- Click Run → cases stream in as they complete; pass/fail badges + scores appear live
- Open a failed case → trace viewer shows the agent making a wrong tool call; judge's justification is shown inline
- Edit the agent's system prompt in
agents/research/prompt.md, hit Run again - Open Compare runs → side-by-side diff highlights which 3 cases improved, which 1 regressed, with score deltas per rubric dimension
| Milestone | Description | Status |
|---|---|---|
| M1 | Scaffold + README | ✅ Done |
| M2 | SQLite data model (SQLModel) | ✅ Done |
| M3 | Agent runner + trace capture | ✅ Done |
| M4 | LLM-as-judge engine | ✅ Done |
| M5 | FastAPI routes (/suites, /runs, /cases, /traces, /diff) | ✅ Done |
| M6 | Next.js dashboard (suite browser, run trigger, results) | ✅ Done |
| M7 | Trace viewer + two-run diff UI | ✅ Done |
| M8 | Demo packaging (docker compose, record_demo.sh) | ✅ Done |
| M9 | CLI (rubriclab run --suite=demo) |
⬜ Planned |
Contributions welcome! Please open an issue before submitting a large PR.
- Fork the repo
- Create a feature branch:
git checkout -b feat/my-feature - Commit with Conventional Commits:
git commit -m "feat: add thing" - Open a PR
MIT © 2024 RubricLab contributors. See LICENSE.

