AGeval

The autopilot for AI agents — not a flight recorder.

Every other eval tool (LangSmith, Langfuse, Braintrust, Arize) is retrospective: observe → ingest → score later. By the time you see the number, the bad output already shipped. AGeval renders a trustworthy verdict mid-run — scoring each step as it happens against four layers of evaluation memory — and hands the agent an actionable allow / warn / escalate / block so it can repair, route to review, or stop before bad output reaches a user.

It's also a full episodic evaluation framework: one import traces every tool call, scores every run with 28 built-in metrics + 3 scorers, and remembers every episode. Works with any framework — LangGraph, CrewAI, AutoGen, MCP, OpenAI, Anthropic, or fully custom. Zero to your first evaluated episode in under 5 minutes.

Proven on a real fleet of 142 agents across 20 industry verticals hitting live public APIs (SEC EDGAR, openFDA, ClinicalTrials, FX, transit, …) plus 20 elaborate multi-stage workflows — not toy demos.

Why AGeval?

Problem	AGeval Solution
"Can I stop a bad run before it ships output?"	Live in-the-loop verdict — `session.evaluate_step()` scores a step against memory in-process (no LLM) and returns `allow/warn/escalate/block` to act on
"Is my agent getting better or worse?"	28 built-in metrics + 3 independent scorers — every run gets a reliability number
"What happened in that failed run?"	Full trace — every tool call, input, output, latency, reasoning
"Has my agent seen a task like this before?"	Episodic memory — pgvector similarity search across all past runs
"Is this failure new or recurring?"	Failure-pattern memory — signatures cluster known failure modes and track recurrence
"How does this run compare to peers?"	Peer-relative scoring — percentile bands within semantic cluster
"Which framework do I need to use?"	Any framework — LangGraph, OpenAI, CrewAI, AutoGen, Anthropic, MCP, or fully custom
"I can't change my agent code"	Zero-code auto-instrumentation — `import ageval.auto` patches everything globally

Install

pip install ageval-sdk

# With framework-specific extras:
pip install ageval-sdk[openai]      # For OpenAI function-calling agents
pip install ageval-sdk[langchain]   # For LangGraph / LangChain agents
pip install ageval-sdk[anthropic]   # For Anthropic (Claude) agents
pip install ageval-sdk[otel]        # For OpenTelemetry export
pip install ageval-sdk[all]         # Everything

Live, in-the-loop evaluation (the wedge)

Ask for a verdict before a step runs, and branch on it:

from ageval import AgentSession

s = AgentSession(agent_id="credit_analyst_v1")
s.start()

# BEFORE you execute a tool, score the proposed step against memory:
v = s.evaluate_step("process_payment", {"amount": 4200},
                    reasoning="charge the order total")

if v.action == "escalate":        # matches a known failure / big outlier
    route_to_human(v.explain())   # "matches known failure · charge before verify"
elif v.action == "warn":          # off the golden path, mild anomaly
    log.warning(v.explain())
else:
    process_payment(amount=4200)  # allow → proceed

evaluate_step() is in-process and LLM-free (cosine vs failure signatures, z-score vs baselines, golden-path adherence) so it sits in the hot path. It fails open (returns allow) if AGeval is unreachable. Shadow-first by default: verdicts are advisory until you opt a policy into enforce mode, and a policy can only ever make actions stricter, never looser.

Want zero code changes? Set AGEVAL_GUARD=1 and import ageval.auto — each LangChain tool call is evaluated in on_tool_start and an enforce-mode block raises AgevalBlocked before the tool runs.

Quick Start — 5 Ways to Integrate

1. Zero-Code Auto-Instrumentation

import ageval.auto  # patches OpenAI + Anthropic SDKs and LangChain callbacks globally
                    # idempotent, framework-agnostic, silent no-op if AGEVAL_API_KEY is unset

2. Any Agent (Universal)

from ageval import AgentSession

with AgentSession(agent_id="my_agent_v1", task="book a flight to Paris") as session:
    result = search_flights("Paris")
    session.record_step(
        tool_name="search_flights",
        tool_input={"destination": "Paris"},
        tool_output=result,
        success=True,
        reasoning="User wants to go to Paris",
        latency_ms=120,
    )
    # Or auto-wrap any callable:
    hotels = session.traced(search_hotels, reasoning="Finding hotels")("Paris", budget="moderate")

3. OpenAI Function Calling

from ageval import trace_openai
from openai import OpenAI

result = trace_openai(
    client=OpenAI(),
    messages=[{"role": "user", "content": "Plan a trip to Paris"}],
    tools=my_tool_definitions,
    tool_functions={"search_flights": search_flights, "search_hotels": search_hotels},
    agent_id="trip_planner_v1",
    task="Plan a trip to Paris",
)
# result["episode_id"] → query scores later

4. Anthropic (Claude) Tool Use

from ageval import trace_anthropic
from anthropic import Anthropic

result = trace_anthropic(
    client=Anthropic(),
    messages=[{"role": "user", "content": "Plan a trip to Paris"}],
    tools=my_tool_definitions,           # Anthropic format (name/description/input_schema)
    tool_functions={"search_flights": search_flights, "search_hotels": search_hotels},
    agent_id="trip_planner_v1",
    task="Plan a trip to Paris",
    model="claude-haiku-4-5-20251001",
)
# result["episode_id"], result["usage"] → tokens captured for cost metrics

5. LangGraph / LangChain (Zero Changes)

from ageval import trace_agent

result = trace_agent(
    agent    = your_langgraph_app,
    input    = {"messages": [("user", "Plan a trip to Paris")]},
    agent_id = "trip_planner_v1",
    task     = "Plan a trip to Paris",
)

What Gets Captured

For every tool call your agent makes:

Field	What it means
`tool_name`	Which tool was called
`tool_input`	What was passed in (any JSON)
`tool_output`	What came back (any JSON)
`success`	Did it work
`error_category`	`agent_error` / `env_error` / `unknown`
`is_recoverable`	Should the agent retry
`reasoning`	Why the agent made this call
`latency_ms`	How long it took

Scoring

Three independent scorers run automatically after every episode.

Rule-based scorer (`eval/rules.py`)

Deterministic, always available, no LLM required:

Metric	Weight	What it measures
`success_rate`	0.25	Fraction of tool calls that succeeded
`recovery_rate`	0.25	Env errors followed by a successful step
`reasoning_coverage`	0.25	Steps with reasoning provided
`efficiency_score`	0.25	Penalises back-to-back duplicate tool calls

LLM judge (`eval/llm_judge.py`)

Configurable model (default: gpt-4o-mini, override with AGEVAL_JUDGE_MODEL):

Dimension	Weight	What it measures
`task_completion`	0.20	Did the agent achieve the stated goal?
`hallucination_free`	0.17	Did the agent avoid facts not grounded in tool outputs?
`output_quality`	0.13	Is the final output useful and accurate?
`reasoning_quality`	0.12	Was the chain-of-thought coherent?
`instruction_following`	0.12	Did the agent stay within user's stated constraints?
`error_handling`	0.10	Did the agent recover gracefully?
`efficiency`	0.08	Did the agent avoid unnecessary steps?
`tool_appropriateness`	0.08	Were the right tools used for each sub-task?

28 built-in deterministic metrics (no LLM required)

All 28 auto-computed metrics are persisted as the custom scorer after every episode. List them via GET /metrics/catalogue or ageval.list_metrics().

Reliability

Metric	What it measures
`agent_error_rate`	Fraction of steps failing due to agent mistakes
`env_error_rate`	Fraction failing due to environment / transient errors
`fatal_error_rate`	Fraction of failures that are non-recoverable
`first_call_success`	Success on the first tool call (upfront understanding)
`last_call_success`	Final step succeeded (clean landing)

Efficiency

Metric	What it measures
`step_economy`	Penalises long episodes (1.0 at ≤3 steps, 0.0 at ≥20)
`p95_step_latency`	Per-call responsiveness (1.0 at ≤1s, 0.0 at ≥15s)
`retry_overhead`	Fraction of steps that repeat the same tool/input after failure

Agentic Behaviour

Metric	What it measures
`tool_call_precision`	Successful unique-purpose tool calls / total calls
`goal_progress`	Forward momentum — fraction of transitions to new tools
`reasoning_depth`	Average reasoning string length (chain-of-thought quality)

Cost & Backtracking

Metric	What it measures
`backtrack_rate`	Repeated `(tool, input)` pairs anywhere in trajectory
`token_economy`	Token efficiency (1.0 at ≤2k tokens, 0.0 at ≥50k)
`reasoning_action_alignment`	Tool calls preceded by reasoning AND succeeded

Observability

Metric	What it measures
`tool_diversity`	Unique tools used / total steps
`multi_tool_usage`	Binary: agent used 2+ distinct tools
`output_richness`	Information density of tool outputs (avg JSON length)
`latency_budget`	Decays from 1.0 at 5s to 0.0 at 60s total episode time
`error_recovery_speed`	Steps to recover from an env_error

Deep Evaluation v2

Metric	What it measures
`recovery_success_rate`	Steps after failure that ultimately succeed
`failure_clustering`	Failures grouped or isolated (1.0 = isolated blips)
`tool_selection_entropy`	Balanced exploration across tools (normalized Shannon entropy)
`progress_monotonicity`	No rework on already-solved sub-tasks
`cost_per_success`	Tokens per successful step (1.0 at ≤1k, 0.0 at ≥20k)
`latency_consistency`	Stability of per-step timing (1 − coefficient of variation)
`error_concentration`	Errors in one tool (diagnosable) vs spread (systemic)

Optional advanced scorers

Scorer	Description
`trajectory`	Edit-distance adherence to the golden tool path mined per task cluster
`pairwise`	LCS trajectory diff + LLM comparative judgment between two episodes
`reference`	RAG-grounded faithfulness and answer relevance scoring (`POST /episodes/{id}/score/reference`)

Custom metrics

from ageval import register_metric

@register_metric("cost_efficiency", weight=0.3)
def cost_efficiency(steps, episode):
    prices = [s["tool_output"].get("price", 999) for s in steps if s.get("success")]
    return 1.0 if min(prices, default=999) < 500 else 0.5

Evaluation Memory

AGeval builds four memory layers across every episode. Each layer compounds value over time.

Layer 1 — Failure-Pattern Memory

Clusters failed steps by (error_category, tool_name, position_band, error_embedding) into named signatures. New episodes are triaged against known failure patterns — you see "this failure appeared in 14 runs over 3 days" before opening a single trace. Signatures seed auto-generated regression evals.

Layer 2 — Semantic Cluster Baselines (Peer-Relative Scoring)

Embeds every episode by task and groups with K-means. Maintains running score distributions (mean, p10, p50, p90, stddev) per cluster. Every GET /episodes/{id} response includes a percentile and band ("bottom 10%", "typical", "top 10%") relative to the agent's peer group. Cold-start safe: falls back to absolute scores below n=20.

Layer 3 — Procedural Memory (Golden Trajectories)

Mines the highest-scoring episodes per cluster to extract canonical tool sequences and expected step counts. Enables trajectory_adherence scoring — catches agents that reach the right answer via a wasteful or fragile path.

Layer 4 — Regression & Drift Detection

Compares the last 7 days of runs vs the prior 7-day baseline per metric, per scorer, per agent. Surfaces score deltas, new failure signatures, step-count drift, and trajectory shape changes. Online drift alerts fire when a cluster's recent mean drops more than k·σ below baseline (DRIFT_K=2.0 by default).

BudgetGuard

A hard-cap spend proxy for Anthropic that protects against runaway costs:

from examples.agents.budget_guard import BudgetGuard
from anthropic import Anthropic

guarded = BudgetGuard(Anthropic(), cap_usd=0.50)
# Works like the native client — raises BudgetExceeded before hitting the API
# if the worst-case cost of the call would breach the cap.
response = guarded.messages.create(model="claude-haiku-4-5-20251001", ...)

BudgetGuard pre-estimates cost using a token-counter + max_tokens assumption and reconciles with actual usage after each call. Pricing table covers Haiku, Sonnet, and Opus.

The real fleet — 142 agents on live APIs

Beyond the local-toolkit reference agents below, examples/agents/fleet/ is a fleet of 142 real business agents across 20 industry verticals, each running a real OpenAI brain against live external APIs — and recorded as scored AGeval episodes. No toy demos: a credit analyst pulling SEC EDGAR 10-Ks, a pharmacovigilance bot scanning openFDA recalls, a logistics planner hitting live transit data.

real_tools.py — ~40 real read tools (SEC EDGAR, openFDA, ClinicalTrials, USGS, arXiv, Crossref, CoinGecko, World Bank, NHTSA, CityBikes, …) behind a polite HTTP client (declared User-Agent, per-host rate limiting, backoff, no caching).
sideeffects.py — real side-effecting tools (Supabase write, is.gd short links, QR, webhooks; gated email/Slack) behind a master AGEVAL_LIVE_SIDE_EFFECTS=1 switch + honest per-tool credential gate.
fleet/registry.py + factory.py — 142 declarative AgentSpecs → real traced episodes.
fleet/run_fleet.py — sweeps the fleet under a hard USD cap (OpenAIBudgetGuard) with per-host rate limiting; prints a vertical × framework coverage matrix.
fleet/flagships/ — hand-written framework-depth agents (LangGraph StateGraph + ReAct, MCP, zero-code import ageval.auto).

python -m examples.agents.real_tools                              # prove the tools are live (re-run → values change)
python -m examples.agents.fleet.run_fleet --only finance_banking  # one vertical, real episodes
python -m examples.agents.fleet.run_fleet --sample 1 --cap 1.50   # ~1 agent/vertical under a budget

20 elaborate multi-stage workflows

examples/agents/fleet/workflows/ holds 20 real-life workflows — several live tool stages that feed each other, an LLM synthesis, and sometimes a real side-effect action. Each run is a ≥4-step trajectory scored against the golden path (M&A diligence, property underwriting, drug-safety triage, supply-chain incident response, compliance monitor, …).

# --explain streams the live verdict + rationale per stage ("watch the eval think")
python -m examples.agents.fleet.workflows.run_workflows --only wf.finance.ma_diligence --explain
python -m examples.agents.fleet.workflows.run_workflows --cap 2.00   # all 20, under budget

See examples/agents/fleet/README.md and examples/agents/fleet/workflows/README.md.

Example Agents (22 reference implementations)

All examples live in examples/agents/ and share a common toolkit (toolkit.py) with 20+ realistic production tools — HTTP, SQL, vector search, payments, file I/O, code execution, calendar, Slack, and more.

#	Agent	Framework	What it demonstrates
01	Customer Support Agent	OpenAI	`get_customer`, `vector_search`, `sql_query`, `create_ticket`, `send_email` — tier-1 support with knowledge-base lookup and ticket escalation
02	E-commerce Order Agent	OpenAI	Multi-step order processing — product lookup, inventory check, payment flow
03	Travel Concierge Agent	OpenAI	`get_weather`, `search_flights`, `currency_convert`, `book_calendar` — cross-tool data flow
04	DevOps Incident Agent	OpenAI	Infrastructure incident response — health checks, alerting, mitigation
05	Financial Analyst Agent	OpenAI	Market analysis and structured reporting
06	Research Assistant	Anthropic	Knowledge synthesis from multiple sources
07	Sales Outreach Agent	Anthropic	`sql_query` → `get_customer` → `send_email` → `book_calendar` — full outreach pipeline
08	Coding Agent	Anthropic	`write_file`, `read_file`, `run_python` — write function, save, execute in sandbox
09	LangGraph Support Router	LangGraph	Real `StateGraph` with tool node — traces MCP-served tools via `AgentSession`
10	LangGraph Data Pipeline	LangGraph	Multi-step ETL workflow
11	LangGraph Human-in-Loop	LangGraph	Agent with approval gate checkpoints
12	CrewAI Marketing Crew	CrewAI	Two-agent crew (Researcher + Copywriter) generating launch copy
13	CrewAI Recruiting Crew	CrewAI	Multi-agent recruitment process
14	MCP Server Agent	MCP	Real MCP server/client transport — proves AGeval traces MCP tool calls
15	MCP + Anthropic Agent	MCP + Anthropic	Anthropic client calling an MCP server
16	AutoGen Group Chat	AutoGen	Multi-agent group conversation
17	RPA Back Office Agent	Custom	File processing and form-filling — robotic process automation
18	LangGraph Long Research	LangGraph	15–25 tool calls — stresses step volume and aggregation
19	LangGraph Supervisor Team	LangGraph	Supervisor-routed multi-agent team
20	Failing Retry Loop (intentional)	LangGraph	`reconcile_inventory` always fails — demonstrates failure detection for stuck retry loops
21	Bad Arguments (intentional)	LangGraph	Argument validation failures — shows `agent_error` vs `env_error` classification
22	BudgetGuard	Anthropic	Hard USD spend cap with worst-case cost estimation

Run all examples:

python examples/agents/run_all.py

Supported Frameworks

Framework	Integration	Effort
Any custom agent	`AgentSession` + `record_step()`	Wrap each tool call
OpenAI function calling	`trace_openai()` — full tool loop	One function call
Anthropic (Claude) tool use	`trace_anthropic()` — full tool loop	One function call
LangGraph / LangChain	`trace_agent()` — StateGraph drop-in	Zero changes
CrewAI	`AgentSession` + `traced()` (via LangChain)	Wrap each tool call
AutoGen	`AgentSession` + `traced()` (via OpenAI/Anthropic)	Wrap each tool call
MCP servers	`AgentSession` + `record_step()`	Wrap MCP tool calls
Any framework (zero-code)	`import ageval.auto`	Zero changes

Dashboard

The Next.js 14 frontend leads with the live-evaluation surface.

Guardrails (live evaluation)

Page	Description
`/`	Landing — autopilot hero with a live verdict-stream demo, the wedge (flight-recorder vs autopilot), animated 142-agent fleet grid + 20-workflow showcase
`/live-eval`	Live Eval — two tabs: Recorded (verdict trail + score provenance per episode, via `GET /episodes/{id}/explain`) and Run live (stream a verdict per step from `POST /evaluate/stream`; "Deviate from golden path" to watch a `warn` fire)

Overview & Traces

Page	Description
`/dashboard`	KPI aggregate — outcomes, avg scores per scorer, failure rate, latency
`/traces`	Full episode list with real-time search (ID / task / agent ID) and outcome filter
`/episodes/[id]`	Single episode — step timeline, reasoning, tool inputs/outputs, all scorer breakdowns, plus a "Why this score →" deep-link to the provenance view

Evaluation Memory

Page	Description
`/clusters`	Semantic task clusters with drift indicators
`/failures`	Failure-pattern signatures — recurrence counts, first/last seen, triage view
`/regression`	Score trajectory across versions — per-metric delta vs baseline

Evaluation Tools

Page	Description
`/compare`	A/B episode comparison — LCS trajectory diff + LLM pairwise verdict
`/recall`	Semantic similarity search across all past runs (pgvector)
`/datasets`	Golden dataset management — Supabase-backed, user-scoped test cases
`/red-teaming`	Adversarial probe runner — prompt injection, data exfiltration, jailbreak
`/test-suites`	Test suite collections
`/playground`	Live scoring sandbox

Admin

Page	Description
`/settings`	Self-service API key generation, rotation, and revocation
`/team`	Multi-user management

API Reference

Ingestion (SDK → Server)

POST /episodes            — create a stub episode
POST /steps               — write one step
POST /steps/batch         — write multiple steps in one request
POST /jobs                — trigger scoring for a completed episode
POST /webhooks            — register a webhook for score / anomaly alerts

Query

GET  /overview                          — KPI aggregate: outcomes, avg score, metric breakdown
GET  /episodes                          — list episodes (filter: ?agent_id= &outcome=)
GET  /episodes/{id}                     — full detail + steps + scores + relative_scores
GET  /episodes/{id}/steps               — paginated steps
GET  /episodes/{id}/explain             — score provenance: metrics ranked by shortfall, step evidence, live verdict trail
GET  /agents                            — distinct agent_ids for the authenticated user
GET  /trends?agent_id=X                 — score time-series (scorer = rules | custom | llm_judge)
GET  /metrics/catalogue                 — list all 28 built-in metrics + descriptions
GET  /similar?episode_id=X              — find similar episodes (pgvector kNN)
GET  /recall?task=...                   — semantic search by free-text task
GET  /compare?episode_a=X&episode_b=Y   — trajectory diff + pairwise judgment
GET  /jobs/{id}/status                  — poll merge/scoring job
GET  /health                            — liveness probe
GET  /metrics                           — operational metrics (requests, latencies)

Live evaluation (the wedge)

POST /evaluate                          — in-the-loop verdict for one proposed step (allow/warn/escalate/block)
POST /evaluate/stream                   — SSE: stream a verdict per step for a proposed trajectory
GET  /agents/{id}/policies              — append-only enforcement policies (highest version = active)
POST /agents/{id}/policies              — publish a new policy version (log_only | enforce)
GET  /agents/{id}/live/verdicts         — shadow-mode verdict audit trail + summary

Evaluation Memory

GET  /clusters                          — list semantic task clusters
GET  /drift                             — clusters whose recent mean is drifting below baseline
GET  /drift/alerts                      — online drift alert feed
GET  /clusters/{id}/failures            — failure pattern aggregate for a cluster
GET  /agents/{id}/regression            — regression report (score deltas, new failure signatures, drift)
POST /episodes/{id}/score/reference     — RAG-grounded faithfulness + relevance scoring

Evaluation Services

POST /redteam/run                       — run adversarial probes, get a security scorecard
POST /synthetic/generate                — LLM-bootstrap a dataset from seed examples
GET  /v1/datasets?project_id=X          — list golden datasets
POST /v1/datasets                       — create a golden dataset + test cases

Key Management

POST /register          — create API key (admin only, requires AGEVAL_ADMIN_SECRET)
POST /keys              — self-service key generation
POST /keys/rotate       — rotate your key
GET  /keys              — list your keys
DELETE /keys/{id}       — revoke a key

All requests require Authorization: Bearer <token>. Dashboard users send a Supabase JWT (email + password login at /login). SDK agents send an ageval-sk-<48-hex-chars> key generated at /settings.

Run the Server

# Required:
AGEVAL_SUPABASE_URL=...
AGEVAL_SUPABASE_SERVICE_KEY=...
AGEVAL_ADMIN_SECRET=...            # no default — registration disabled without this

# Optional:
OPENAI_API_KEY=...                 # for pgvector embeddings + LLM judge
AGEVAL_JUDGE_MODEL=gpt-4o-mini     # override the LLM judge model
LANGSMITH_API_KEY=...              # only for LangChain / LangGraph agents
REDIS_URL=...                      # distributed rate limiting (falls back to in-memory)
DRIFT_K=2.0                        # drift alert sensitivity (std devs below baseline)
RECENT_DAYS=7                      # lookback window for regression + drift
POLL_INTERVAL=5                    # merger worker poll interval in seconds

# Start the API:
uvicorn main:app --reload

# Start the background merger worker:
python -m merger.worker

# Or with Docker:
docker-compose up

Background Worker

The merger worker (python -m merger.worker) runs the full evaluation pipeline:

Polls episode_jobs via Supabase SELECT FOR UPDATE SKIP LOCKED — no Redis needed, scales to many workers
Derives outcome, latency, total_steps, and episode_fingerprint (stable hash of tool sequence + outcome)
Runs all three scorers: rules + LLM judge + 28 deterministic metrics
Embeds tasks and clusters episodes with K-means (~every 5 min, configurable)
Updates cluster score baselines for peer-relative scoring
Folds failing steps into failure-pattern memory
Mines golden trajectories from top-scoring episodes per cluster
Runs regression comparison (last 7d vs prior 7d baseline per scorer and metric)
Fires drift alerts when a cluster's recent mean drops k·σ below baseline
Delivers webhooks for low-score or anomaly events (with HMAC-SHA256 signatures and retries)

Graceful degradation: missing scikit-learn → clustering skipped; missing OpenAI key → embeddings and LLM judge skipped; everything else continues.

Tests

pip install -r requirements-dev.txt
pytest tests/ -v

Test file	Covers
`test_session.py`	`AgentSession`, `trace_callable`
`test_tracers.py`	`trace_agent`, `trace_openai`, `trace_anthropic`
`test_auto_instrumentation.py`	Monkeypatching OpenAI / Anthropic / LangChain globally
`test_metrics.py`	All 28 built-in metrics + custom metric registry
`test_property_metrics.py`	Property-based metric testing (hypothesis)
`test_property_tracers.py`	Property tests for tracer logic
`test_eval_rules.py`	Rule-based scorer
`test_eval_memory.py`	Memory system integration
`test_phase2_memory.py`	Regression detection, procedural memory
`test_phase3_eval.py`	Pairwise, drift alerts, reference-grounded scoring
`test_langgraph_agents_live.py`	Live LangGraph integration
`test_edge_cases.py`	Error conditions, malformed inputs
`test_api_contract.py`	API schema validation
`test_sdk.py`	SDK integration
`test_datasets_router.py`	Dataset API endpoints
`test_failures_router.py`	Failure analysis endpoints
`test_redteam_synthetic.py`	Red-team + synthetic data generation
`test_rate_limiter.py`	Rate limiting
`test_key_management.py`	API key gen / rotate / revoke

Graceful Degradation

If AGEVAL_API_KEY is not set:

trace_agent() falls back to plain agent.invoke()
trace_openai() falls back to plain chat.completions.create()
trace_anthropic() falls back to plain messages.create()
import ageval.auto patches silently no-op
AgentSession records steps locally but doesn't send them
Zero crashes, zero overhead, zero exceptions

Security

API keys stored as SHA-256 hashes — raw key never persisted
SSRF protection on webhook URLs at registration and delivery time (DNS re-check)
Row Level Security (RLS) at Postgres layer — complete multi-tenant isolation
HMAC-SHA256 webhook signatures
Registration disabled unless AGEVAL_ADMIN_SECRET is explicitly set
Per-key rate limiting (Redis or in-memory fallback)
OpenTelemetry span export for production observability

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
ageval		ageval
api		api
eval		eval
examples		examples
frontend		frontend
github-action		github-action
merger		merger
sdk		sdk
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.vercelignore		.vercelignore
Dockerfile		Dockerfile
Dockerfile.worker		Dockerfile.worker
LAUNCH_POST.md		LAUNCH_POST.md
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
railway.toml		railway.toml
rate_limiter.py		rate_limiter.py
real_agent_test.py		real_agent_test.py
render.yaml		render.yaml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AGeval

Why AGeval?

Install

Live, in-the-loop evaluation (the wedge)

Quick Start — 5 Ways to Integrate

1. Zero-Code Auto-Instrumentation

2. Any Agent (Universal)

3. OpenAI Function Calling

4. Anthropic (Claude) Tool Use

5. LangGraph / LangChain (Zero Changes)

What Gets Captured

Scoring

Rule-based scorer (eval/rules.py)

LLM judge (eval/llm_judge.py)

28 built-in deterministic metrics (no LLM required)

Optional advanced scorers

Custom metrics

Evaluation Memory

Layer 1 — Failure-Pattern Memory

Layer 2 — Semantic Cluster Baselines (Peer-Relative Scoring)

Layer 3 — Procedural Memory (Golden Trajectories)

Layer 4 — Regression & Drift Detection

BudgetGuard

The real fleet — 142 agents on live APIs

20 elaborate multi-stage workflows

Example Agents (22 reference implementations)

Supported Frameworks

Dashboard

Guardrails (live evaluation)

Overview & Traces

Evaluation Memory

Evaluation Tools

Admin

API Reference

Ingestion (SDK → Server)

Query

Live evaluation (the wedge)

Evaluation Memory

Evaluation Services

Key Management

Run the Server

Background Worker

Tests

Graceful Degradation

Security

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Rule-based scorer (`eval/rules.py`)

LLM judge (`eval/llm_judge.py`)

Packages