You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Live mode (WebSocket streaming, session management, SSE) is enabled automatically when `--dev`is passed or when the bundled UI is detected — no extra dependencies required.
12
+
Live mode (WebSocket streaming, session management, SSE) is always enabled when running `agentevals serve`. The `--dev`flag adds hot reload and dev-friendly console output but does not change what features are active.
13
13
14
14
The optional `[live]` extra (`pip install "agentevals[live]"`) adds `mcp` and `httpx`, which are only needed for the MCP server (`agentevals mcp`). The bundled wheel is built with `make build-bundle` and includes compiled UI assets baked into the package.
15
15
@@ -43,7 +43,7 @@ Both `build` and `build-bundle` produce `dist/agentevals-*.whl` with the same pa
43
43
make test# run all tests (unit + integration, excludes e2e)
44
44
make test-unit # unit tests only (fast, no server startup)
45
45
make test-integration # integration tests — OTLP pipeline, session grouping, timing (no API keys)
46
-
make test-e2e # E2E tests — real agents as subprocesses (requires OPENAI_API_KEY)
46
+
make test-e2e # E2E tests — real agents as subprocesses (requires OPENAI_API_KEY and/or GOOGLE_API_KEY)
47
47
```
48
48
49
49
### Cleanup
@@ -62,15 +62,15 @@ Tests are organized into three tiers with different trade-offs:
|**E2E**|`tests/integration/test_live_agents.py`| Real uvicorn servers |`OPENAI_API_KEY`| Full pipeline — real agent → OTLP export → session creation → invocation extraction → API visibility |
65
+
|**E2E**|`tests/integration/test_live_agents.py`| Real uvicorn servers |`OPENAI_API_KEY`, `GOOGLE_API_KEY`| Full pipeline — real agent → OTLP export → session creation → invocation extraction → API visibility |
66
66
67
67
Integration tests use `httpx.ASGITransport` to hit the OTLP and streaming API routes in-process (no ports, no real HTTP). Timers are configured fast (0.1s grace, 0.5s idle) for quick deterministic tests.
68
68
69
69
E2E tests start real uvicorn servers on ephemeral ports in a background thread, then run example agent scripts as subprocesses that emit real OTLP traces with `BatchSpanProcessor`/`BatchLogRecordProcessor` flush timing.
70
70
71
71
### Running E2E tests
72
72
73
-
E2E tests require `OPENAI_API_KEY` (used by LangChain and Strands agents). They are skipped automatically when the key is not set.
73
+
E2E tests require `OPENAI_API_KEY` (LangChain and Strands agents) and/or `GOOGLE_API_KEY` (ADK agents). Each test class is skipped automatically when its required key is not set.
74
74
75
75
```bash
76
76
# Source your .env and run
@@ -81,7 +81,7 @@ set -a && source .env && set +a && make test-e2e
81
81
82
82
When adding a new example agent to `examples/`, add corresponding E2E tests to ensure the full OTLP pipeline works:
83
83
84
-
1. Add a test class in `tests/integration/test_live_agents.py` following the existing pattern (`TestLangchainZeroCode`, `TestStrandsZeroCode`)
84
+
1. Add a test class in `tests/integration/test_live_agents.py` following the existing pattern (`TestLangchainZeroCode`, `TestStrandsZeroCode`, `TestAdkZeroCode`)
85
85
2. Each agent should have at minimum three tests:
86
86
-**Session creation** — agent runs successfully, session is created with spans (and logs if applicable)
87
87
-**Invocation extraction** — invocations are extracted with user/agent content
@@ -93,13 +93,13 @@ When adding a new example agent to `examples/`, add corresponding E2E tests to e
93
93
94
94
## Runtime behavior
95
95
96
-
The serve command auto-detects the active mode:
96
+
The serve command always enables live mode (WebSocket, streaming, sessions). The flags control UI serving and reload behavior:
97
97
98
-
-`agentevals serve` — REST-only if no bundled UI present; full experience if bundled wheel
-`agentevals serve --headless` — disables UI serving even in bundled builds (API-only)
98
+
-`agentevals serve` — live mode + REST API; UI served if bundled `_static/` is present
99
+
-`agentevals serve --dev` — same as above + hot reload on source changes + dev console output
100
+
-`agentevals serve --headless` — live mode + REST API, UI suppressed even if bundled
101
101
102
-
Controlled by environment variables `AGENTEVALS_LIVE=1`and `AGENTEVALS_HEADLESS=1`, which the CLI sets automatically based on flags and detected `_static/` presence.
102
+
Controlled by environment variables `AGENTEVALS_LIVE=1`(always set by the CLI) and `AGENTEVALS_HEADLESS=1` (set when `--headless` is passed).
`agentevals`scores agent behavior from OpenTelemetry traces without re-running the agent. It parses OTLP streams and Jaeger JSON traces, then evaluates them against golden eval sets using ADK's evaluation framework.
5
+
`agentevals`evaluates AI agent behavior from OpenTelemetry traces, without re-running the agent. Record once, score as many times as you want.
6
6
7
-
Ships as a **CLI** for scripting and CI, a **web UI** for visual inspection and interactive evaluation, and an **MCP server** so Claude Code can run evaluations directly from a conversation.
7
+
Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
8
+
9
+
-**CLI** for scripting and CI pipelines
10
+
-**Web UI** for visual inspection and local developer experience
11
+
-**MCP server** so MCP clients can run evaluations from a conversation
8
12
9
13
> [!IMPORTANT]
10
14
> This project is under active development. Expect breaking changes.
11
15
12
-
## Installation
16
+
## Contents
17
+
18
+
-[Installation](#installation)
19
+
-[Quick Start](#quick-start)
20
+
-[Integration](#integration)
21
+
-[CLI](#cli)
22
+
-[Custom Evaluators](#custom-evaluators)
23
+
-[Web UI](#web-ui)
24
+
-[REST API Reference](#rest-api-reference)
25
+
-[MCP Server](#mcp-server)
26
+
-[Claude Code Skills](#claude-code-skills)
27
+
-[Docs](#docs)
28
+
-[Development](#development)
29
+
-[FAQ](#faq)
13
30
14
-
Download a release wheel from the [releases page](../../releases):
31
+
## Installation
15
32
16
-
| Variant | Description |
17
-
|---------|-------------|
18
-
|**core**| CLI + REST API, batch evaluation only |
19
-
|**bundle**| CLI + REST API + Streaming + embedded web UI |
33
+
Grab a wheel from the [releases page](../../releases). The **core** wheel has the CLI and REST API. The **bundle** wheel adds streaming and the embedded web UI.
0 commit comments