Voice AI has eaten phone calls. The eval tooling has not caught up.
voice-eval-harnessis the open-source eval harness for voice agents — Retell-first, Vapi next, every other platform via plugin. Lint your agent config before it crashes Retell's importer. Replay last week's failed prod calls as deterministic regression cases. Stress-test with adversarial caller personas. Verify your knowledge base is actually wired and answerable. Gate CI on pass-rate, latency, and cost.
🚀 Live demo: voice-eval-harness.vercel.app — two live demos:
- Vapi live demo ▶ — 5-case live evaluation against a Vapi front-desk assistant (2 PASS / 3 FAIL, persona simulator catches premature transfer, LLM judge catches semantic drift, tool-call assertion confirms
get_available_slotsfires) - Retell prod audit ▶ —
voxeval auditread-only against 18 real production calls from an ENT scheduling agent's last 7 days (3 PASS / 15 FAIL, off-topic-wander + latency + AI-disclosure assertions). Clinic names anonymised, no real PHI.
Click any case to inspect transcript + assertion-level results.
If you've built voice agents on Retell, you have already lost time to:
is_transfer_cfmissing → "Cannot read properties of undefined" on importparameters.required: []→ HTTP 400 on importtool_idrefs that look right but point at nothing- KB IDs left empty while the global prompt cheerfully references "KB doc 02"
language: "en-US"silently blocking Spanish callers- Hand-written test cases sitting in a JSON file that nothing ever runs
- ngrok dev tunnels rotting in production agent JSON
Existing closed-source platforms (Hamming, Cekura, Coval, Bluejay) solve some of this — for money, on their cloud. The only OSS option (LiveKit's RunResult.expect.judge()) only works for LiveKit-native agents. This project is the Promptfoo-equivalent for voice, and it runs locally.
pip install voice-eval-harness
voxeval --versionvoxeval init --provider retell # scaffold voxeval.yaml + .env.example
voxeval generate --agent agents/eva.json # 🪄 auto-generate 20+ healthcare test cases from one agent JSON
voxeval lint agents/eva.json # 17-rule structural linter
voxeval pin-urls agents/eva.json --lock # probe tool/webhook URLs; fails on ngrok rot
voxeval run --max-cost 0.50 --junit out.xml # full eval suite with budget cap + CI report
voxeval diff main.json feature.json # per-case regression diff (exits 1 on regression)
voxeval kb-coverage --kb 'kb/*.md' # auto-Q&A your KB and verify agent answers
voxeval replay --since 7d # regression fixtures from last week's failed prod calls
voxeval audit --since 24h # score yesterday's prod calls against assert_* contracts
voxeval drift-watch --sample 20 # check cached LLM-judge verdicts for model driftvoxeval generate turns one agent JSON into a complete eval suite covering
every healthcare scenario you'd otherwise script by hand:
voxeval generate --agent agents/eva-scheduling.json --out voxeval.yamlOutput: a 20-30 case suite with one happy-path test per tool the agent
declares (auto-derived assert_tool_called + assert_tool_shape from the
JSONSchema parameters) plus 19 curated healthcare scenarios:
- New patient happy-path booking, returning-patient flow (no redundant intake)
- Urgent symptom triage — chest pain, stroke symptoms (FAST) → must escalate, not schedule
- Insurance verification — known plan (KB-driven), unknown plan (must hedge, not fabricate)
- Provider preference — caller asks for a doctor; agent must verify they exist
- Reschedule/cancel with tool-call enforcement
- Wrong-number callers — agent MUST NOT collect PHI
- After-hours queries, prescription refills (must defer)
- Transfer-to-human, referral inbound capture
- Spanish language drift detection
- 4 persona stress tests (impatient, accented, code-switching, KB-probing)
Real numbers from this repo's examples/healthcare-clinic/:
| Agent | Auto-generated cases |
|---|---|
| eva-scheduling (ENT) | 24 (5 tool-calls + 19 scenarios) |
| linda-scheduling (Endoscopy) | 22 (3 tool-calls + 19 scenarios) |
| iris-en (Ophthalmology) | 20 (1 tool-call + 19 scenarios) |
| iris-en.prod (Cardiology) | 20 (1 tool-call + 19 scenarios) |
| router-agent (multilingual router) | 17 (0 tool-calls + 17 scenarios) |
Total: 103 test cases generated from 5 production agents — 5 commands, 0 hand-written YAML.
| Feature | What it catches |
|---|---|
| Retell JSON linter (RTL-001 – RTL-017) | 15 rules ported from a battle-tested validator + RTL-016 (ngrok URL rot) + RTL-017 (KB empty but referenced in prompts) |
| Persona simulator | 4 adversarial callers (impatient, accented, code-switching, KB-probing) with overridable Jinja prompts |
| KB coverage analyzer | Auto-generates Q&A from your markdown KB, verifies the agent can actually answer (LLM-judge or sentence-transformers backend) |
| Production-call replay + audit | Pulls failed calls from Retell logs, scrubs PHI (regex + optional Presidio), turns them into regression cases. audit scores yesterday's calls against your contracts. |
| LLM-judge with budget guardrail | Semantic intent checks; --max-cost USD ceiling refuses calls past the limit instead of burning past |
assert_tool_shape |
Runtime tool-args contract validator (type/enum/min/max/regex) |
voxeval diff |
Per-case regression diff between two runs or two YAMLs (exits 1 on regression — CI gate) |
voxeval pin-urls |
HEADs every tool/webhook URL, writes a lock file, fails on rot |
| CI integration | JUnit XML output, report.json, pre-commit hook, GitHub Action template |
| Dashboard | TypeScript / Next 15 / Recharts / Supabase under dashboard/ |
| Connectors | Retell (text + audio), Vapi (full), Mock (deterministic), LiveKit / Pipecat / Bland (stubs) |
Lives at dashboard/ — Next.js 15 + React 19 + Recharts + Supabase. Drag-and-drop a report.json produced by voxeval run --json out.json and it lights up: pass-rate sparkline per suite, per-case grid with transcript drawer, ingest API (POST /api/runs with Bearer auth). One-time setup: paste dashboard/supabase/schema.sql into your Supabase SQL editor, copy three env vars into dashboard/.env.local, pnpm dev. ~10 minutes from clone to live.
- Insurance-acceptance fact matrix (healthcare-vertical assertion)
- Multi-judge cross-model evaluation
- Carbon-cost report
- Drift-watch with prompt replay (v1.0 ships verdict-distribution baseline only)
- LiveKit, Pipecat, Bland connectors with their native test framework integrations
- Diff view in the dashboard UI
Pointed at the 8 most recent production Retell agents in a healthcare voice-AI shop (15+ live agents across ENT, ophthalmology, cardiology):
| Agent | RTL fatals | RTL warnings | Notes |
|---|---|---|---|
| linda-scheduling.json | 0 | 0 | known-good baseline ✅ |
| eva-scheduling.local.json | 0 | 6 | 6× ngrok dev URLs baked in (would rot in prod) |
| iris-en.json | 2 | 0 | missing is_transfer_cf + KB empty but referenced |
| iris-es.json | 2 | 0 | same |
| iris-zh.json | 2 | 0 | same |
| router-agent.json | 4 | 0 | missing response_engine + CF required keys |
| stockton iris-en.prod | 1 | 0 | KB empty but prompt references KB |
| stockton iris-en.dev | 1 | 0 | same |
That's 13 fatal bugs across 6 of 8 agents the linter would have caught before any Retell import attempt — the exact "Cannot read properties of undefined" crash from the team's bug history shows up in three Iris agents right now.
- Text-mode requires a Retell agent registered with
channel=chat. Voice agents return HTTP 422 ("Cannot start a chat session with selected agent") against/create-chat. In practice you either (a) create a parallel chat-channel agent in the Retell dashboard with the same prompt + tools for testing, or (b) wait for v0.2 audio-mode which calls the real PSTN number with cost guardrails. - LLM judge and KB generator are model-cost-bearing. Default cache keeps repeat runs near-free; first run on a new suite + KB costs a few cents at gpt-4o-mini rates.
- PHI scrubbing in
voxeval replayis regex-only by default. The optional[phi]extra adds Microsoft Presidio for stronger NER-based redaction. Always inspectreplay_cases/*.yamlbefore committing them.
Apache 2.0 — see LICENSE.