You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add What is, Why, and How It Works sections to README
Adds three overview sections based on aevals.ai site content to help
new visitors quickly understand the project's purpose and workflow.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+39-1Lines changed: 39 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,17 +31,55 @@ AgentEvals scores performance and inference quality from OpenTelemetry traces
31
31
32
32
---
33
33
34
-
Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
34
+
## What is AgentEvals?
35
+
36
+
AgentEvals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.
37
+
38
+
It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
35
39
36
40
-**CLI** for scripting and CI pipelines
37
41
-**Web UI** for visual inspection and local developer experience
38
42
-**MCP server** so MCP clients can run evaluations from a conversation
39
43
44
+
## Why AgentEvals?
45
+
46
+
Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. AgentEvals takes a different approach:
47
+
48
+
-**No re-execution** — score agents from existing traces without replaying expensive LLM calls
49
+
-**Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans
50
+
-**Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating
51
+
-**Custom evaluators** — write scoring logic in Python, JavaScript, or any language
52
+
-**CI/CD ready** — gate deployments on quality thresholds directly in your pipeline
53
+
-**Local-first** — no cloud dependency required; everything runs on your machine
54
+
55
+
## How It Works
56
+
57
+
AgentEvals follows three simple steps:
58
+
59
+
1.**Collect traces** — Instrument your agent with OpenTelemetry (or export Jaeger JSON). Point the OTLP exporter at the AgentEvals receiver, or load trace files directly.
60
+
2.**Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
61
+
3.**Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.
0 commit comments