|
1 | 1 | <p align="center"> |
2 | | - <img src="docs/assets/logo-color.png" alt="agentevals" width="420" /> |
| 2 | + <picture> |
| 3 | + <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-color-on-transparent.svg"> |
| 4 | + <source media="(prefers-color-scheme: light)" srcset="docs/assets/logo-dark-on-transparent.svg"> |
| 5 | + <img src="docs/assets/logo-color-on-transparent.svg" alt="agentevals" width="420" /> |
| 6 | + </picture> |
3 | 7 | </p> |
4 | 8 |
|
5 | | -`agentevals` evaluates AI agent behavior from OpenTelemetry traces, without re-running the agent. Record once, score as many times as you want. |
| 9 | +<h1 align="center">Ship Agents Reliably</h1> |
6 | 10 |
|
7 | | -Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges. |
| 11 | +<p align="center"> |
| 12 | +Benchmark your agents before they hit production.<br> |
| 13 | +agentevals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork. |
| 14 | +</p> |
| 15 | + |
| 16 | +<p align="center"> |
| 17 | + <a href="https://github.com/agentevals-dev/agentevals/stargazers"><img src="https://img.shields.io/github/stars/agentevals-dev/agentevals?style=social" alt="GitHub Stars"></a> |
| 18 | + |
| 19 | + <a href="https://discord.gg/cpveEn8Ah2"><img src="https://img.shields.io/discord/1435836734666707190?label=Discord&logo=discord&logoColor=white&color=5865F2" alt="Discord"></a> |
| 20 | + |
| 21 | + <a href="https://github.com/agentevals-dev/agentevals/releases"><img src="https://img.shields.io/github/v/release/agentevals-dev/agentevals?label=Release" alt="Release"></a> |
| 22 | + |
| 23 | + <a href="https://github.com/agentevals-dev/agentevals/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"></a> |
| 24 | + |
| 25 | + <a href="https://pypi.org/project/agentevals-cli/"><img src="https://img.shields.io/pypi/v/agentevals-cli?label=PyPI&color=blue" alt="PyPI"></a> |
| 26 | +</p> |
| 27 | + |
| 28 | +<p align="center"> |
| 29 | + <a href="#installation">Install</a> · <a href="#quick-start">Quick Start</a> · <a href="https://github.com/agentevals-dev/agentevals/releases">Releases</a> · <a href="CONTRIBUTING.md">Contributing</a> · <a href="https://discord.gg/cpveEn8Ah2">Discord</a> |
| 30 | +</p> |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## What is agentevals? |
| 35 | + |
| 36 | +agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork. |
| 37 | + |
| 38 | +It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges. |
8 | 39 |
|
9 | 40 | - **CLI** for scripting and CI pipelines |
10 | 41 | - **Web UI** for visual inspection and local developer experience |
11 | 42 | - **MCP server** so MCP clients can run evaluations from a conversation |
12 | 43 |
|
| 44 | +## Why agentevals? |
| 45 | + |
| 46 | +Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. agentevals takes a different approach: |
| 47 | + |
| 48 | +- **No re-execution** — score agents from existing traces without replaying expensive LLM calls |
| 49 | +- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans |
| 50 | +- **Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating |
| 51 | +- **Custom evaluators** — write scoring logic in Python, JavaScript, or any language |
| 52 | +- **CI/CD ready** — gate deployments on quality thresholds directly in your pipeline |
| 53 | +- **Local-first** — no cloud dependency required; everything runs on your machine |
| 54 | + |
| 55 | +## How It Works |
| 56 | + |
| 57 | +agentevals follows three simple steps: |
| 58 | + |
| 59 | +1. **Collect traces** — Instrument your agent with OpenTelemetry (or export traces from your tracing backend). Point the OTLP exporter at the agentevals receiver, or load trace files directly. |
| 60 | +2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like. |
| 61 | +3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns. |
| 62 | + |
| 63 | + |
13 | 64 | > [!IMPORTANT] |
14 | 65 | > This project is under active development. Expect breaking changes. |
15 | 66 |
|
|
0 commit comments