agent-eval

Evaluation CLI for ADK agents — run interactions, evaluate with metrics, and analyze results.

Move from trial-and-error prompt engineering to systematic, measurable agent optimization.

What is agent-eval?

agent-eval is a command-line tool that acts as the glue between:

OpenTelemetry traces from ADK
Vertex AI Evaluation service for LLM-as-judge metrics
Gemini for analyzing results

It consolidates the full evaluation workflow into a single CLI: generate interactions, extract deterministic metrics from traces, run LLM-as-judge scoring, and produce AI-powered analysis reports.

How it works

  Interact           Evaluate              Analyze
 ──────────       ────────────          ────────────
│ simulate │ ──▶ │ Deterministic │ ──▶ │ Gemini AI  │
│ interact │     │ + LLM Judge   │     │ Diagnosis  │
 ──────────       ────────────          ────────────

 Generate           Score traces          Root cause
 agent traces       with metrics          analysis

1. Interact — Generate agent traces via ADK User Sim (multi-turn) or DIY queries (single-turn)

2. Evaluate — Score traces with two types of metrics:

Type	What it measures	How it works
Deterministic	Latency, tokens, cost, cache efficiency, tool reliability	Auto-extracted from OpenTelemetry traces — no configuration needed
LLM-as-Judge	Response quality, trajectory accuracy, tool use, safety	Scored by Vertex AI using customizable rubrics you define

3. Analyze — Gemini reads all metrics and generates a diagnosis report with root cause analysis and optimization recommendations.

Metrics at a glance

Deterministic (automatic, same for all agents):

Metric	What you learn
`token_usage.*`	Total tokens, prompt vs completion, estimated cost in USD
`latency_metrics.*`	Total time, LLM time, tool time, time to first response
`cache_efficiency.*`	KV-cache hit rate — are your prompts structured for caching?
`tool_success_rate.*`	How often tool calls succeed vs fail
`thinking_metrics.*`	How much the model "thinks" before responding

LLM-as-Judge (configurable per agent in metric_definitions.json):

Metric	What it scores
`general_quality`	Overall response quality (Vertex AI managed)
`trajectory_accuracy`	Did the agent take the right path through tools?
`tool_use_quality`	Were tool arguments correct and efficient?
`safety`	Safety compliance (Vertex AI managed)
Custom metrics	Anything you define with a scoring rubric

Metrics are defined in eval/metrics/metric_definitions.json. The init command can create starter metrics manually or generate tailored metrics with AI (--ai-metrics). See docs/reference.md for the full metrics glossary and custom metric creation guide.

Requirements

Requirement	Version	Notes
Python	3.10–3.12	Python 3.13+ not yet supported
uv	Latest	Package manager
gcloud CLI	Latest	Google Cloud authentication
Vertex AI API	Enabled	Required for evaluation metrics

Vertex AI Configuration

+------------------------------------------------------------------+
|  You MUST use Vertex AI for the evaluation pipeline to work.     |
|                                                                   |
|  Set: GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_LOCATION             |
|  Do NOT use: GOOGLE_API_KEY (metrics will be empty)              |
+------------------------------------------------------------------+

Installation

# Clone the repository
git clone REPO_URL_PLACEHOLDER agent-eval
cd agent-eval

# Install dependencies
uv sync

# Verify installation
uv run agent-eval --version

Authentication

# Authenticate with Google Cloud
gcloud auth login
gcloud auth application-default login

# Set your project
export GOOGLE_CLOUD_PROJECT="your-project-id"
export GOOGLE_CLOUD_LOCATION="us-central1"
gcloud auth application-default set-quota-project $GOOGLE_CLOUD_PROJECT

# Enable required APIs
gcloud services enable aiplatform.googleapis.com --project=$GOOGLE_CLOUD_PROJECT

Grant Vertex AI autorater model permissions:

export GOOGLE_CLOUD_PROJECT_NUMBER=$(gcloud projects describe $GOOGLE_CLOUD_PROJECT --format="value(projectNumber)")

gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
    --member="serviceAccount:service-$GOOGLE_CLOUD_PROJECT_NUMBER@gcp-sa-aiplatform.iam.gserviceaccount.com" \
    --role="roles/aiplatform.serviceAgent"

Quickstart

All uv run agent-eval commands run from the agent-eval repository root. You point the CLI to your agent using --agent-dir with the path to the folder containing your agent.py.

Don't have an agent yet? Create one with Agent Starter Pack:
uvx agent-starter-pack create my-agent
Or to follow along with the tutorial, use one of the example agents.

1. Initialize your project

uv run agent-eval init

The CLI scans for agent.py files and lets you select which agent to scaffold evaluation for. It creates an eval/ folder inside the agent module directory with starter metrics, scenario files, and a golden dataset.

AI-powered metric generation: In Step 3, choose "Generate with AI" (the default) and Gemini will analyze your agent's source code to create tailored LLM-as-judge metrics, plus recommendations for scenarios and test queries. You can provide evaluation priorities to guide generation.

# Non-interactive with AI-generated metrics
uv run agent-eval init -y --ai-metrics

2. Run the full pipeline

The fastest way to evaluate is the run command — it executes all four phases in sequence with interactive prompts at each step:

uv run agent-eval run --agent-dir path/to/your/agent_module

This runs: simulate → interact → evaluate → analyze, prompting you for configuration at each phase.

Note: The interact phase sends queries to a running agent. Make sure your agent is serving (e.g., adk web or make playground) before running. If the agent isn't reachable, interact is skipped gracefully.

3. Or run individual steps

You can also run each phase independently:

Method	Best for	Command
ADK User Sim	Multi-turn conversational agents	`uv run agent-eval simulate --agent-dir path/to/agent`
DIY Interactions	Single-turn agents, deployed endpoints	`uv run agent-eval interact --agent-dir path/to/agent`

After generating interactions, evaluate and analyze:

uv run agent-eval evaluate \
  --interaction-file <path-to-interactions.jsonl> \
  --metrics-files <path-to-metric_definitions.json> \
  --results-dir <path-to-run-dir>

uv run agent-eval analyze \
  --results-dir <path-to-run-dir> \
  --agent-dir <path-to-agent-module>

Tip: Both simulate and interact print the exact evaluate and analyze commands with the correct paths pre-filled at the end of their output. Just copy and paste them.

CLI Commands

Command	Purpose
`uv run agent-eval init`	Scaffold eval folder with optional AI metric generation
`uv run agent-eval run`	Full pipeline: simulate + interact + evaluate + analyze
`uv run agent-eval simulate`	Run ADK User Sim + convert traces (multi-turn)
`uv run agent-eval interact`	Run interactions against a live agent endpoint (single-turn)
`uv run agent-eval evaluate`	Run deterministic + LLM-as-judge metrics
`uv run agent-eval analyze`	Generate reports and AI-powered analysis
`uv run agent-eval convert`	Convert ADK traces to evaluation format (used by simulate)
`uv run agent-eval create-dataset`	Convert ADK test files to golden dataset format

Run uv run agent-eval --help or uv run agent-eval <command> --help for detailed usage.

Examples

The tutorial/example_agents/ folder contains two complete agent projects with pre-configured evaluation setups:

Example	Type	Description
customer-service	Multi-turn	ADK conversational agent evaluated with User Sim
retail-ai-location-strategy	Single-turn	ADK pipeline agent evaluated with DIY Interactions

See the tutorial for a guided walkthrough using these examples.

Documentation

Document	Contents
Reference Guide	CLI reference, metrics deep dive, data formats, custom metrics, troubleshooting
Tutorial	Step-by-step walkthrough using the example agents
Examples	Overview of the example agent projects

Building & Distributing

To use agent-eval in another project without copying the full repo, build a wheel:

./build_wheel.sh

This produces dist/agent_eval-<version>-py3-none-any.whl. To install it in a downstream project:

# Copy the wheel to your project
cp dist/agent_eval-*.whl your-project/vendor/

# Install separately (do NOT add to pyproject.toml — use pip install)
uv pip install ./vendor/agent_eval-*.whl

To build from source in another repo, copy these files:

build_wheel.sh
pyproject.toml
src/agent_eval/
uv.lock (optional but recommended)

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.gemini		.gemini
dashboard		dashboard
docs		docs
src/agent_eval		src/agent_eval
tests		tests
tutorial/example_agents		tutorial/example_agents
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
README.md		README.md
build_wheel.sh		build_wheel.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-eval

What is agent-eval?

How it works

Metrics at a glance

Requirements

Vertex AI Configuration

Installation

Authentication

Quickstart

1. Initialize your project

2. Run the full pipeline

3. Or run individual steps

CLI Commands

Examples

Documentation

Building & Distributing

Additional Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-eval

What is agent-eval?

How it works

Metrics at a glance

Requirements

Vertex AI Configuration

Installation

Authentication

Quickstart

1. Initialize your project

2. Run the full pipeline

3. Or run individual steps

CLI Commands

Examples

Documentation

Building & Distributing

Additional Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages