GitHub

Agentic eval framework that validates GAIA agent quality using Claude Code as user simulator + judge. The foundation for the eval→fine-tuning quality flywheel.

What Ships

Automated eval harness, ground truth test suite, quality metrics dashboard.

Use Cases Enabled

Validated agent reliability — every release tested against real workflows before shipping
Quality regression detection — catch tool-calling failures, hallucinations, format errors
Training data generation — eval results feed directly into v0.19.0 fine-tuning pipeline

Value Proposition

"Agents you can trust — every release is tested against real workflows. When something breaks, it's caught before it reaches you."

The Quality Flywheel

Eval runs → identifies failures → failures become training data (v0.19.0)
→ GRPO fine-tuning improves model → re-eval confirms improvement → repeat

Key Deliverables

Claude Code-based eval harness (user simulator + judge)
Ground truth test suite for core agent capabilities
Pass/fail metrics with tool trace analysis
See docs/plans/agent-ui-eval-benchmark.md for full spec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.18.0 — Agent Eval Benchmark [OSS]

What Ships

Use Cases Enabled

Value Proposition

The Quality Flywheel

Key Deliverables

Agent Eval: Replace existing eval framework

Agent Eval: Extensibility — plugin API for custom scenarios, scorers, and document types

Agent Eval: Clean up legacy eval framework (9,200 lines of dead code)

Agent Eval: Test coverage for public API surface (runner, scorecard, scenario loading)

Docs: Clarify GAIA's identity and audience paths across README, docs site, and quickstart

feat(eval): synthetic mbox fixture for email-agent eval (#848)

Feature/custom agent configs

docs: rewrite Agent UI guide — lead with features, not plumbing

feat(eval): retrieval-quality eval harness for code_index

docs(connectors): write setup guides for remaining catalog entries

v0.18.0 — Agent Eval Benchmark [OSS]

What Ships

Use Cases Enabled

Value Proposition

The Quality Flywheel

Key Deliverables

List view