Skip to content

Commit ae20e63

Browse files
Add What is, Why, and How It Works sections to README
Adds three overview sections based on aevals.ai site content to help new visitors quickly understand the project's purpose and workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 0d34fab commit ae20e63

1 file changed

Lines changed: 39 additions & 1 deletion

File tree

README.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,17 +31,55 @@ AgentEvals scores performance and inference quality from OpenTelemetry traces
3131

3232
---
3333

34-
Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
34+
## What is AgentEvals?
35+
36+
AgentEvals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.
37+
38+
It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
3539

3640
- **CLI** for scripting and CI pipelines
3741
- **Web UI** for visual inspection and local developer experience
3842
- **MCP server** so MCP clients can run evaluations from a conversation
3943

44+
## Why AgentEvals?
45+
46+
Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. AgentEvals takes a different approach:
47+
48+
- **No re-execution** — score agents from existing traces without replaying expensive LLM calls
49+
- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans
50+
- **Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating
51+
- **Custom evaluators** — write scoring logic in Python, JavaScript, or any language
52+
- **CI/CD ready** — gate deployments on quality thresholds directly in your pipeline
53+
- **Local-first** — no cloud dependency required; everything runs on your machine
54+
55+
## How It Works
56+
57+
AgentEvals follows three simple steps:
58+
59+
1. **Collect traces** — Instrument your agent with OpenTelemetry (or export Jaeger JSON). Point the OTLP exporter at the AgentEvals receiver, or load trace files directly.
60+
2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
61+
3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.
62+
63+
```
64+
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
65+
│ Your Agent │────▶│ OTel Traces │────▶│ AgentEvals │
66+
│ (any framework) │ (OTLP/Jaeger) │ CLI · UI · MCP │
67+
└─────────────┘ └──────────────┘ └──────────────────┘
68+
69+
┌───────┴────────┐
70+
│ Eval Sets │
71+
│ (golden refs) │
72+
└────────────────┘
73+
```
74+
4075
> [!IMPORTANT]
4176
> This project is under active development. Expect breaking changes.
4277
4378
## Contents
4479

80+
- [What is AgentEvals?](#what-is-agentevals)
81+
- [Why AgentEvals?](#why-agentevals)
82+
- [How It Works](#how-it-works)
4583
- [Installation](#installation)
4684
- [Quick Start](#quick-start)
4785
- [Integration](#integration)

0 commit comments

Comments
 (0)