Skip to content

Commit fcbf65a

Browse files
Merge pull request #55 from agentevals-dev/docs/refresh-readme
Update README and docs
2 parents ff6cd3a + 39561b6 commit fcbf65a

3 files changed

Lines changed: 57 additions & 28 deletions

File tree

CONTRIBUTING.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Thank you for your interest in contributing to agentevals! This document covers
1717

1818
- Python 3.11+
1919
- [uv](https://docs.astral.sh/uv/) (Python package manager)
20-
- Node.js 18+ and npm (for the UI)
20+
- Node.js 20+ and npm (for the UI)
2121
- Optionally, [Nix](https://nixos.org/) — the project includes a `flake.nix` devshell
2222

2323
### Getting Started
@@ -143,6 +143,7 @@ test: add coverage for OTLP trace parsing
143143
src/agentevals/ # Python backend (FastAPI, CLI, evaluation engine)
144144
ui/src/ # React frontend (Vite, Ant Design, TypeScript)
145145
tests/ # Python tests (pytest)
146+
examples/ # Agent examples (zero-code, SDK, custom evaluators)
146147
samples/ # Example traces and eval sets
147148
docs/ # Documentation
148149
```

DEVELOPMENT.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22

33
## Distribution tiers
44

5-
agentevals ships as three distinct configurations from a single codebase:
5+
agentevals ships as two distribution variants from a single codebase:
66

7-
| Tier | Install | Serve behavior |
8-
|------|---------|----------------|
9-
| **Core** | `pip install agentevals` | REST API only — stateless batch evaluation endpoints |
10-
| **Bundle** | `pip install agentevals` (bundled wheel) | REST API + WebSocket streaming + session management + embedded React UI |
7+
| Tier | Install | What you get |
8+
|------|---------|--------------|
9+
| **Core** | `pip install agentevals` | CLI + REST API + live mode (WebSocket streaming, sessions, SSE) |
10+
| **Bundle** | `pip install agentevals` (bundled wheel) | Everything in Core + embedded React UI |
1111

12-
Live mode (WebSocket streaming, session management, SSE) is enabled automatically when `--dev` is passed or when the bundled UI is detected — no extra dependencies required.
12+
Live mode (WebSocket streaming, session management, SSE) is always enabled when running `agentevals serve`. The `--dev` flag adds hot reload and dev-friendly console output but does not change what features are active.
1313

1414
The optional `[live]` extra (`pip install "agentevals[live]"`) adds `mcp` and `httpx`, which are only needed for the MCP server (`agentevals mcp`). The bundled wheel is built with `make build-bundle` and includes compiled UI assets baked into the package.
1515

@@ -43,7 +43,7 @@ Both `build` and `build-bundle` produce `dist/agentevals-*.whl` with the same pa
4343
make test # run all tests (unit + integration, excludes e2e)
4444
make test-unit # unit tests only (fast, no server startup)
4545
make test-integration # integration tests — OTLP pipeline, session grouping, timing (no API keys)
46-
make test-e2e # E2E tests — real agents as subprocesses (requires OPENAI_API_KEY)
46+
make test-e2e # E2E tests — real agents as subprocesses (requires OPENAI_API_KEY and/or GOOGLE_API_KEY)
4747
```
4848

4949
### Cleanup
@@ -62,15 +62,15 @@ Tests are organized into three tiers with different trade-offs:
6262
|------|----------|-----------|----------|------------------|
6363
| **Unit** | `tests/` (excl. integration) | `TestClient` / mocks | None | Business logic, route handlers, converters |
6464
| **Integration** | `tests/integration/` | ASGI in-process | None | OTLP session grouping, timing, concurrent batches, eval pipeline |
65-
| **E2E** | `tests/integration/test_live_agents.py` | Real uvicorn servers | `OPENAI_API_KEY` | Full pipeline — real agent → OTLP export → session creation → invocation extraction → API visibility |
65+
| **E2E** | `tests/integration/test_live_agents.py` | Real uvicorn servers | `OPENAI_API_KEY`, `GOOGLE_API_KEY` | Full pipeline — real agent → OTLP export → session creation → invocation extraction → API visibility |
6666

6767
Integration tests use `httpx.ASGITransport` to hit the OTLP and streaming API routes in-process (no ports, no real HTTP). Timers are configured fast (0.1s grace, 0.5s idle) for quick deterministic tests.
6868

6969
E2E tests start real uvicorn servers on ephemeral ports in a background thread, then run example agent scripts as subprocesses that emit real OTLP traces with `BatchSpanProcessor`/`BatchLogRecordProcessor` flush timing.
7070

7171
### Running E2E tests
7272

73-
E2E tests require `OPENAI_API_KEY` (used by LangChain and Strands agents). They are skipped automatically when the key is not set.
73+
E2E tests require `OPENAI_API_KEY` (LangChain and Strands agents) and/or `GOOGLE_API_KEY` (ADK agents). Each test class is skipped automatically when its required key is not set.
7474

7575
```bash
7676
# Source your .env and run
@@ -81,7 +81,7 @@ set -a && source .env && set +a && make test-e2e
8181

8282
When adding a new example agent to `examples/`, add corresponding E2E tests to ensure the full OTLP pipeline works:
8383

84-
1. Add a test class in `tests/integration/test_live_agents.py` following the existing pattern (`TestLangchainZeroCode`, `TestStrandsZeroCode`)
84+
1. Add a test class in `tests/integration/test_live_agents.py` following the existing pattern (`TestLangchainZeroCode`, `TestStrandsZeroCode`, `TestAdkZeroCode`)
8585
2. Each agent should have at minimum three tests:
8686
- **Session creation** — agent runs successfully, session is created with spans (and logs if applicable)
8787
- **Invocation extraction** — invocations are extracted with user/agent content
@@ -93,13 +93,13 @@ When adding a new example agent to `examples/`, add corresponding E2E tests to e
9393

9494
## Runtime behavior
9595

96-
The serve command auto-detects the active mode:
96+
The serve command always enables live mode (WebSocket, streaming, sessions). The flags control UI serving and reload behavior:
9797

98-
- `agentevals serve`REST-only if no bundled UI present; full experience if bundled wheel
99-
- `agentevals serve --dev`always enables live mode (WebSocket + streaming + sessions)
100-
- `agentevals serve --headless`disables UI serving even in bundled builds (API-only)
98+
- `agentevals serve`live mode + REST API; UI served if bundled `_static/` is present
99+
- `agentevals serve --dev`same as above + hot reload on source changes + dev console output
100+
- `agentevals serve --headless`live mode + REST API, UI suppressed even if bundled
101101

102-
Controlled by environment variables `AGENTEVALS_LIVE=1` and `AGENTEVALS_HEADLESS=1`, which the CLI sets automatically based on flags and detected `_static/` presence.
102+
Controlled by environment variables `AGENTEVALS_LIVE=1` (always set by the CLI) and `AGENTEVALS_HEADLESS=1` (set when `--headless` is passed).
103103

104104
## NixOS / Nix devshell
105105

README.md

Lines changed: 40 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,35 @@
22
<img src="docs/assets/logo-color.png" alt="agentevals" width="420" />
33
</p>
44

5-
`agentevals` scores agent behavior from OpenTelemetry traces without re-running the agent. It parses OTLP streams and Jaeger JSON traces, then evaluates them against golden eval sets using ADK's evaluation framework.
5+
`agentevals` evaluates AI agent behavior from OpenTelemetry traces, without re-running the agent. Record once, score as many times as you want.
66

7-
Ships as a **CLI** for scripting and CI, a **web UI** for visual inspection and interactive evaluation, and an **MCP server** so Claude Code can run evaluations directly from a conversation.
7+
Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
8+
9+
- **CLI** for scripting and CI pipelines
10+
- **Web UI** for visual inspection and local developer experience
11+
- **MCP server** so MCP clients can run evaluations from a conversation
812

913
> [!IMPORTANT]
1014
> This project is under active development. Expect breaking changes.
1115
12-
## Installation
16+
## Contents
17+
18+
- [Installation](#installation)
19+
- [Quick Start](#quick-start)
20+
- [Integration](#integration)
21+
- [CLI](#cli)
22+
- [Custom Evaluators](#custom-evaluators)
23+
- [Web UI](#web-ui)
24+
- [REST API Reference](#rest-api-reference)
25+
- [MCP Server](#mcp-server)
26+
- [Claude Code Skills](#claude-code-skills)
27+
- [Docs](#docs)
28+
- [Development](#development)
29+
- [FAQ](#faq)
1330

14-
Download a release wheel from the [releases page](../../releases):
31+
## Installation
1532

16-
| Variant | Description |
17-
|---------|-------------|
18-
| **core** | CLI + REST API, batch evaluation only |
19-
| **bundle** | CLI + REST API + Streaming + embedded web UI |
33+
Grab a wheel from the [releases page](../../releases). The **core** wheel has the CLI and REST API. The **bundle** wheel adds streaming and the embedded web UI.
2034

2135
```bash
2236
pip install agentevals-<version>-py3-none-any.whl
@@ -28,7 +42,7 @@ pip install "agentevals-<version>-py3-none-any.whl[live]"
2842
**From source** with `uv` or Nix:
2943

3044
```bash
31-
uv sync
45+
uv sync
3246
# or: nix develop .
3347
```
3448

@@ -44,10 +58,10 @@ uv run agentevals run samples/helm.json \
4458
-m tool_trajectory_avg_score
4559
```
4660

47-
List available metrics:
61+
List available evaluators:
4862

4963
```bash
50-
uv run agentevals list-metrics
64+
uv run agentevals evaluator list
5165
```
5266

5367
## Integration
@@ -102,6 +116,12 @@ uv run agentevals run samples/helm.json samples/k8s.json \
102116
uv run agentevals run samples/helm.json \
103117
--eval-set samples/eval_set_helm.json \
104118
--output json
119+
120+
# List available evaluators (builtin + community)
121+
uv run agentevals evaluator list
122+
123+
# List only builtin evaluators
124+
uv run agentevals evaluator list --source builtin
105125
```
106126

107127
## Custom Evaluators
@@ -112,7 +132,14 @@ Beyond the built-in metrics, you can write your own evaluators in Python, JavaSc
112132
agentevals evaluator init my_evaluator
113133
```
114134

115-
This scaffolds a directory with boilerplate and a manifest. Implement your scoring logic, then reference it in an eval config:
135+
This scaffolds a directory with boilerplate and a manifest. You can also list supported runtimes and generate config snippets:
136+
137+
```bash
138+
agentevals evaluator runtimes # show supported languages
139+
agentevals evaluator config my_evaluator --path ./evaluators/my_evaluator.py
140+
```
141+
142+
Implement your scoring logic, then reference it in an eval config:
116143

117144
```yaml
118145
# eval_config.yaml
@@ -195,6 +222,7 @@ Two slash-command workflows in `.claude/skills/`, available automatically in thi
195222
|-------|-------------|
196223
| [Eval Set Format](docs/eval-set-format.md) | Schema, field reference, and examples for golden eval set JSON files |
197224
| [Custom Evaluators](docs/custom-evaluators.md) | Write your own scoring logic in Python, JavaScript, or any language |
225+
| [Live Streaming](docs/streaming.md) | Real-time trace streaming, dev server setup, and session management |
198226
| [OpenTelemetry Compatibility](docs/otel-compatibility.md) | Supported OTel conventions, message delivery mechanisms, and OTLP receiver |
199227

200228
## Development

0 commit comments

Comments
 (0)