Merge pull request #55 from agentevals-dev/docs/refresh-readme

krisztianfekete · web-flow · commit fcbf65ac9e73 · 2026-03-20T12:36:24.000+01:00
Update README and docs
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -17,7 +17,7 @@ Thank you for your interest in contributing to agentevals! This document covers
 
 - Python 3.11+
 - [uv](https://docs.astral.sh/uv/) (Python package manager)
-- Node.js 18+ and npm (for the UI)
+- Node.js 20+ and npm (for the UI)
 - Optionally, [Nix](https://nixos.org/) — the project includes a `flake.nix` devshell
 
 ### Getting Started
@@ -143,6 +143,7 @@ test: add coverage for OTLP trace parsing
 src/agentevals/       # Python backend (FastAPI, CLI, evaluation engine)
 ui/src/               # React frontend (Vite, Ant Design, TypeScript)
 tests/                # Python tests (pytest)
+examples/             # Agent examples (zero-code, SDK, custom evaluators)
 samples/              # Example traces and eval sets
 docs/                 # Documentation
 ```
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -2,14 +2,14 @@
 
 ## Distribution tiers
 
-agentevals ships as three distinct configurations from a single codebase:
+agentevals ships as two distribution variants from a single codebase:
 
-| Tier | Install | Serve behavior |
-|------|---------|----------------|
-| **Core** | `pip install agentevals` | REST API only — stateless batch evaluation endpoints |
-| **Bundle** | `pip install agentevals` (bundled wheel) | REST API + WebSocket streaming + session management + embedded React UI |
+| Tier | Install | What you get |
+|------|---------|--------------|
+| **Core** | `pip install agentevals` | CLI + REST API + live mode (WebSocket streaming, sessions, SSE) |
+| **Bundle** | `pip install agentevals` (bundled wheel) | Everything in Core + embedded React UI |
 
-Live mode (WebSocket streaming, session management, SSE) is enabled automatically when `--dev` is passed or when the bundled UI is detected — no extra dependencies required.
+Live mode (WebSocket streaming, session management, SSE) is always enabled when running `agentevals serve`. The `--dev` flag adds hot reload and dev-friendly console output but does not change what features are active.
 
 The optional `[live]` extra (`pip install "agentevals[live]"`) adds `mcp` and `httpx`, which are only needed for the MCP server (`agentevals mcp`). The bundled wheel is built with `make build-bundle` and includes compiled UI assets baked into the package.
 
@@ -43,7 +43,7 @@ Both `build` and `build-bundle` produce `dist/agentevals-*.whl` with the same pa
 make test              # run all tests (unit + integration, excludes e2e)
 make test-unit         # unit tests only (fast, no server startup)
 make test-integration  # integration tests — OTLP pipeline, session grouping, timing (no API keys)
-make test-e2e          # E2E tests — real agents as subprocesses (requires OPENAI_API_KEY)
+make test-e2e          # E2E tests — real agents as subprocesses (requires OPENAI_API_KEY and/or GOOGLE_API_KEY)
 ```
 
 ### Cleanup
@@ -62,15 +62,15 @@ Tests are organized into three tiers with different trade-offs:
 |------|----------|-----------|----------|------------------|
 | **Unit** | `tests/` (excl. integration) | `TestClient` / mocks | None | Business logic, route handlers, converters |
 | **Integration** | `tests/integration/` | ASGI in-process | None | OTLP session grouping, timing, concurrent batches, eval pipeline |
-| **E2E** | `tests/integration/test_live_agents.py` | Real uvicorn servers | `OPENAI_API_KEY` | Full pipeline — real agent → OTLP export → session creation → invocation extraction → API visibility |
+| **E2E** | `tests/integration/test_live_agents.py` | Real uvicorn servers | `OPENAI_API_KEY`, `GOOGLE_API_KEY` | Full pipeline — real agent → OTLP export → session creation → invocation extraction → API visibility |
 
 Integration tests use `httpx.ASGITransport` to hit the OTLP and streaming API routes in-process (no ports, no real HTTP). Timers are configured fast (0.1s grace, 0.5s idle) for quick deterministic tests.
 
 E2E tests start real uvicorn servers on ephemeral ports in a background thread, then run example agent scripts as subprocesses that emit real OTLP traces with `BatchSpanProcessor`/`BatchLogRecordProcessor` flush timing.
 
 ### Running E2E tests
 
-E2E tests require `OPENAI_API_KEY` (used by LangChain and Strands agents). They are skipped automatically when the key is not set.
+E2E tests require `OPENAI_API_KEY` (LangChain and Strands agents) and/or `GOOGLE_API_KEY` (ADK agents). Each test class is skipped automatically when its required key is not set.
 
 ```bash
 # Source your .env and run
@@ -81,7 +81,7 @@ set -a && source .env && set +a && make test-e2e
 
 When adding a new example agent to `examples/`, add corresponding E2E tests to ensure the full OTLP pipeline works:
 
-1. Add a test class in `tests/integration/test_live_agents.py` following the existing pattern (`TestLangchainZeroCode`, `TestStrandsZeroCode`)
+1. Add a test class in `tests/integration/test_live_agents.py` following the existing pattern (`TestLangchainZeroCode`, `TestStrandsZeroCode`, `TestAdkZeroCode`)
 2. Each agent should have at minimum three tests:
    - **Session creation** — agent runs successfully, session is created with spans (and logs if applicable)
    - **Invocation extraction** — invocations are extracted with user/agent content
@@ -93,13 +93,13 @@ When adding a new example agent to `examples/`, add corresponding E2E tests to e
 
 ## Runtime behavior
 
-The serve command auto-detects the active mode:
+The serve command always enables live mode (WebSocket, streaming, sessions). The flags control UI serving and reload behavior:
 
-- `agentevals serve` — REST-only if no bundled UI present; full experience if bundled wheel
-- `agentevals serve --dev` — always enables live mode (WebSocket + streaming + sessions)
-- `agentevals serve --headless` — disables UI serving even in bundled builds (API-only)
+- `agentevals serve` — live mode + REST API; UI served if bundled `_static/` is present
+- `agentevals serve --dev` — same as above + hot reload on source changes + dev console output
+- `agentevals serve --headless` — live mode + REST API, UI suppressed even if bundled
 
-Controlled by environment variables `AGENTEVALS_LIVE=1` and `AGENTEVALS_HEADLESS=1`, which the CLI sets automatically based on flags and detected `_static/` presence.
+Controlled by environment variables `AGENTEVALS_LIVE=1` (always set by the CLI) and `AGENTEVALS_HEADLESS=1` (set when `--headless` is passed).
 
 ## NixOS / Nix devshell
 
diff --git a/README.md b/README.md
@@ -2,21 +2,35 @@
   <img src="docs/assets/logo-color.png" alt="agentevals" width="420" />
 </p>
 
-`agentevals` scores agent behavior from OpenTelemetry traces without re-running the agent. It parses OTLP streams and Jaeger JSON traces, then evaluates them against golden eval sets using ADK's evaluation framework.
+`agentevals` evaluates AI agent behavior from OpenTelemetry traces, without re-running the agent. Record once, score as many times as you want.
 
-Ships as a **CLI** for scripting and CI, a **web UI** for visual inspection and interactive evaluation, and an **MCP server** so Claude Code can run evaluations directly from a conversation.
+Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
+
+- **CLI** for scripting and CI pipelines
+- **Web UI** for visual inspection and local developer experience
+- **MCP server** so MCP clients can run evaluations from a conversation
 
 > [!IMPORTANT]
 > This project is under active development. Expect breaking changes.
 
-## Installation
+## Contents
+
+- [Installation](#installation)
+- [Quick Start](#quick-start)
+- [Integration](#integration)
+- [CLI](#cli)
+- [Custom Evaluators](#custom-evaluators)
+- [Web UI](#web-ui)
+- [REST API Reference](#rest-api-reference)
+- [MCP Server](#mcp-server)
+- [Claude Code Skills](#claude-code-skills)
+- [Docs](#docs)
+- [Development](#development)
+- [FAQ](#faq)
 
-Download a release wheel from the [releases page](../../releases):
+## Installation
 
-| Variant | Description |
-|---------|-------------|
-| **core** | CLI + REST API, batch evaluation only |
-| **bundle** | CLI + REST API + Streaming + embedded web UI |
+Grab a wheel from the [releases page](../../releases). The **core** wheel has the CLI and REST API. The **bundle** wheel adds streaming and the embedded web UI.
 
 ```bash
 pip install agentevals-<version>-py3-none-any.whl
@@ -28,7 +42,7 @@ pip install "agentevals-<version>-py3-none-any.whl[live]"
 **From source** with `uv` or Nix:
 
 ```bash
-uv sync              
+uv sync
 # or: nix develop .
 ```
 
@@ -44,10 +58,10 @@ uv run agentevals run samples/helm.json \
   -m tool_trajectory_avg_score
 ```
 
-List available metrics:
+List available evaluators:
 
 ```bash
-uv run agentevals list-metrics
+uv run agentevals evaluator list
 ```
 
 ## Integration
@@ -102,6 +116,12 @@ uv run agentevals run samples/helm.json samples/k8s.json \
 uv run agentevals run samples/helm.json \
   --eval-set samples/eval_set_helm.json \
   --output json
+
+# List available evaluators (builtin + community)
+uv run agentevals evaluator list
+
+# List only builtin evaluators
+uv run agentevals evaluator list --source builtin
 ```
 
 ## Custom Evaluators
@@ -112,7 +132,14 @@ Beyond the built-in metrics, you can write your own evaluators in Python, JavaSc
 agentevals evaluator init my_evaluator
 ```
 
-This scaffolds a directory with boilerplate and a manifest. Implement your scoring logic, then reference it in an eval config:
+This scaffolds a directory with boilerplate and a manifest. You can also list supported runtimes and generate config snippets:
+
+```bash
+agentevals evaluator runtimes           # show supported languages
+agentevals evaluator config my_evaluator --path ./evaluators/my_evaluator.py
+```
+
+Implement your scoring logic, then reference it in an eval config:
 
 ```yaml
 # eval_config.yaml
@@ -195,6 +222,7 @@ Two slash-command workflows in `.claude/skills/`, available automatically in thi
 |-------|-------------|
 | [Eval Set Format](docs/eval-set-format.md) | Schema, field reference, and examples for golden eval set JSON files |
 | [Custom Evaluators](docs/custom-evaluators.md) | Write your own scoring logic in Python, JavaScript, or any language |
+| [Live Streaming](docs/streaming.md) | Real-time trace streaming, dev server setup, and session management |
 | [OpenTelemetry Compatibility](docs/otel-compatibility.md) | Supported OTel conventions, message delivery mechanisms, and OTLP receiver |
 
 ## Development