Comprehensive evaluation framework for the Dexter AI Agent, featuring automated quality assessment, benchmark datasets, and Prometheus metrics integration.
- Overview
- Quick Start
- Architecture
- Datasets
- Running Evaluations
- Metrics and Monitoring
- Evaluation Criteria
- Reports
- Extending the Framework
- Troubleshooting
The evaluation system provides:
- Benchmark Datasets: 100+ test cases covering all agent capabilities
- LLM-as-Judge: Automated quality assessment using GPT-3.5-turbo
- Prometheus Metrics: Real-time monitoring of agent performance
- Comprehensive Reports: Detailed analysis of evaluation results
- Multi-dimensional Scoring: Quality, tool usage, safety, and conversation metrics
All required dependencies are in the main requirements.txt:
pip install -r requirements.txt# Run the benchmark dataset
python evaluation/run_evaluation.py --dataset benchmark_v1
# Run with first 10 cases only (for testing)
python evaluation/run_evaluation.py --dataset benchmark_v1 --max-cases 10Results are saved in evaluation/results/:
# View latest results
ls -lt evaluation/results/
# Generate a report from latest run
python evaluation/run_evaluation.py --dataset benchmark_v1 --generate-report# Start Prometheus metrics endpoint
python evaluation/metrics_server.py
# Metrics available at http://localhost:9091/metricsevaluation/
├── config.py # Configuration settings
├── evaluator.py # Core evaluation engine
├── criteria.py # Evaluation dimensions and scoring
├── judge_prompts.py # LLM-as-judge prompts
├── metrics.py # Prometheus metrics definitions
├── metrics_collector.py # Metrics collection logic
├── metrics_server.py # Metrics HTTP server
├── report_generator.py # Report generation
├── run_evaluation.py # CLI tool
├── datasets/ # Test case datasets
│ ├── benchmark_v1.json
│ ├── edge_cases.json
│ └── multi_turn_conversations.json
└── results/ # Evaluation results
Test Case → Agent Execution → Collect Metadata → LLM Judge → Scoring → Metrics & Reports
100 test cases covering all agent capabilities:
- Product Search (25 cases): Price ranges, categories, features, complex queries
- Appointments (25 cases): Booking, cancellation, rescheduling, complex scheduling
- Knowledge Retrieval (25 cases): Policy queries, FAQs, troubleshooting
- Web Search (25 cases): Current events, comparisons, information lookup
Difficulty Levels:
- Easy: 40 cases - Basic single-intent queries
- Medium: 40 cases - Multi-parameter or contextual queries
- Hard: 20 cases - Complex, ambiguous, or multi-step queries
20 challenging scenarios:
- Ambiguous Queries: Unclear intent, missing context
- Contradictory Requests: Conflicting requirements
- Out-of-Scope: Requests beyond agent capabilities
- Error Handling: Invalid inputs, impossible requests
- Safety: Privacy, security, ethical boundaries
10 conversation scenarios (3-5 turns each):
- Progressive refinement
- Context switching
- Information correction
- Multi-intent conversations
- Complex negotiations
# Evaluate all test cases in a dataset
python evaluation/run_evaluation.py --dataset benchmark_v1
# Evaluate specific dataset
python evaluation/run_evaluation.py --dataset edge_cases
# Limit number of cases
python evaluation/run_evaluation.py --dataset benchmark_v1 --max-cases 25# Use different judge model
python evaluation/run_evaluation.py --dataset benchmark_v1 --model gpt-4
# Custom output directory
python evaluation/run_evaluation.py --dataset benchmark_v1 --output custom_results/
# Skip report generation
python evaluation/run_evaluation.py --dataset benchmark_v1 --no-report
# Verbose logging
python evaluation/run_evaluation.py --dataset benchmark_v1 --verbose# Compare multiple evaluation runs
python evaluation/run_evaluation.py --compare benchmark_v1 edge_cases
# Save comparison report
python evaluation/run_evaluation.py --compare benchmark_v1 edge_cases --output comparison.md# List all datasets
python evaluation/run_evaluation.py --list-datasets
# List recent results
python evaluation/run_evaluation.py --list-resultsThe evaluation system exposes metrics on port 9091 by default.
python evaluation/metrics_server.py
# Custom port
python evaluation/metrics_server.py --port 9092The monitoring/prometheus.yml is already configured to scrape evaluation metrics:
scrape_configs:
- job_name: 'ai_agent_evaluation'
static_configs:
- targets: ['localhost:9091']Evaluation Metrics:
agent_evaluation_runs_total- Total evaluation runsagent_evaluation_test_cases_total- Test cases by resultagent_evaluation_duration_seconds- Evaluation run durationagent_evaluation_quality_score- Quality scores by dimensionagent_evaluation_pass_rate- Pass rate by category
Agent Performance Metrics:
agent_tool_usage_total- Tool usage countsagent_tool_success_rate- Success rate per toolagent_tool_latency_seconds- Tool execution latencyagent_response_latency_seconds- End-to-end response timeagent_response_quality_score- Quality scoresagent_memory_retrieval_time_seconds- Memory operation timing
Error Metrics:
agent_evaluation_errors_total- Evaluation errorsagent_tool_errors_total- Tool execution errors
Create custom dashboards in Grafana to visualize:
- Pass rate trends over time
- Quality score distributions
- Tool usage patterns
- Latency percentiles
- Error rates
- Relevance (1.0): Addresses user's query
- Accuracy (1.2): Factually correct information
- Completeness (0.9): Covers all aspects
- Coherence (0.8): Logically structured
- Clarity (0.8): Easy to understand
- Tool Selection (1.0): Right tool chosen
- Parameter Extraction (0.9): Correct parameters
- Tool Success (0.8): Successful execution
- Hallucination Detection (1.1): No fabricated info
- Uncertainty Expression (0.7): Admits uncertainty
- Context Maintenance (0.8): Maintains conversation context
- Coherence (0.8): Natural flow
- Memory Retrieval (0.7): Retrieves relevant memories
- Memory Usage (0.7): Uses memories effectively
- 0-3: Poor/Unacceptable
- 4-5: Below Average
- 6-7: Good/Acceptable ✓
- 8-9: Excellent
- 10: Perfect
Pass Threshold: 6.0 (configurable in config.py)
The overall score is calculated as a weighted average across all applicable dimensions, with higher weights for critical dimensions like accuracy and hallucination detection.
Automatically generated after each evaluation:
evaluation/results/
├── results_benchmark_v1_20240108_143052.json # Detailed results
├── summary_benchmark_v1_20240108_143052.json # Summary statistics
├── failures_benchmark_v1_20240108_143052.json # Failed cases only
└── report_benchmark_v1_20240108_143052.md # Human-readable report
- Summary Statistics: Pass rate, average scores, counts
- Dimension Scores: Visual bars showing scores by dimension
- Category Performance: Breakdown by tool/category
- Failed Cases: Details of failures with scores and errors
- Top Performers: Best-scoring test cases
Generate comparison reports across multiple runs:
python evaluation/run_evaluation.py --compare benchmark_v1 edge_casesComparison reports include:
- Side-by-side metrics
- Score trends across runs
- Dimension-by-dimension comparison
- Category performance comparison
Edit dataset JSON files:
{
"id": "unique_id",
"category": "product_search",
"difficulty": "medium",
"user_message": "Your test query",
"expected_behavior": "What should happen",
"expected_tool": "tool_name",
"expected_parameters": {
"param": "value"
},
"evaluation_criteria": ["criteria1", "criteria2"],
"multi_turn": false,
"context_messages": []
}- Create a new JSON file in
evaluation/datasets/ - Follow the schema from existing datasets
- Run evaluation:
python evaluation/run_evaluation.py --dataset your_dataset
Edit evaluation/criteria.py:
NEW_DIMENSION = EvaluationDimension(
name="your_dimension",
description="What it measures",
weight=1.0
)Edit evaluation/judge_prompts.py to customize evaluation prompts.
Add new metrics in evaluation/metrics.py:
from prometheus_client import Counter, Gauge, Histogram
your_metric = Counter(
'your_metric_name',
'Description',
['label1', 'label2']
)Edit evaluation/config.py to customize:
class EvaluationConfig:
# LLM Judge
JUDGE_MODEL = "gpt-3.5-turbo" # or "gpt-4"
JUDGE_TEMPERATURE = 0.1
# Evaluation
MAX_RETRIES = 3
TIMEOUT_SECONDS = 60
PASS_THRESHOLD = 6.0
# Metrics
ENABLE_METRICS = True
METRICS_PORT = 9091# Check available datasets
python evaluation/run_evaluation.py --list-datasets
# Use correct path
python evaluation/run_evaluation.py --dataset evaluation/datasets/benchmark_v1.jsonThe evaluation system includes delays between test cases. If you hit rate limits:
- Use
--max-casesto limit evaluation size - Increase delays in
evaluator.py(line withasyncio.sleep) - Use GPT-3.5-turbo instead of GPT-4
# Use different port
python evaluation/metrics_server.py --port 9092
# Update prometheus.yml accordinglyIf all scores are low:
- Check judge model is working (
--model gpt-4for better judgments) - Verify agent is properly initialized
- Review judge prompts in
judge_prompts.py - Check agent logs for errors
# Test with subset first
python evaluation/run_evaluation.py --dataset benchmark_v1 --max-cases 10
# Use faster judge model
python evaluation/run_evaluation.py --dataset benchmark_v1 --model gpt-3.5-turboEnable verbose logging:
python evaluation/run_evaluation.py --dataset benchmark_v1 --verboseValidate your dataset structure:
import json
with open('evaluation/datasets/your_dataset.json') as f:
data = json.load(f)
print(f"Loaded {len(data)} test cases")- Start Small: Test with
--max-cases 10first - Baseline: Run evaluations before making changes to establish baseline
- Regular Runs: Evaluate after significant code changes
- Version Control: Commit evaluation results for comparison
- Monitor Metrics: Watch Prometheus metrics during evaluation
- Cover Edge Cases: Include ambiguous, contradictory, and error scenarios
- Realistic Queries: Use natural language users would actually use
- Diverse Difficulty: Mix easy, medium, and hard cases
- Clear Expectations: Specify expected behavior precisely
- Meaningful Criteria: Choose relevant evaluation criteria
- Look at Trends: Compare across multiple runs
- Category Analysis: Identify which tools/categories need improvement
- Failed Cases: Focus on understanding why cases fail
- Dimension Scores: Identify specific weaknesses (e.g., tool selection)
- Context Matters: Consider test difficulty when evaluating scores
- Check this README first
- Review example datasets for reference
- Check logs with
--verboseflag - Review code comments in evaluation modules
To contribute new test cases or improvements:
- Create test cases following existing schema
- Test locally with
--max-casesfirst - Document any new evaluation criteria
- Update this README if adding major features
evaluation/
├── __init__.py # Package initialization
├── README.md # This file
├── config.py # Configuration
├── criteria.py # Evaluation dimensions
├── evaluator.py # Core evaluation engine
├── judge_prompts.py # LLM judge prompts
├── metrics.py # Prometheus metrics
├── metrics_collector.py # Metrics collection
├── metrics_server.py # Metrics HTTP server
├── report_generator.py # Report generation
├── run_evaluation.py # CLI entry point
├── datasets/ # Test datasets
│ ├── benchmark_v1.json # Main benchmark (100 cases)
│ ├── edge_cases.json # Edge cases (20 cases)
│ └── multi_turn_conversations.json # Conversations (10 scenarios)
└── results/ # Evaluation results
└── .gitkeep
# Judge model configuration
export EVAL_JUDGE_MODEL="gpt-3.5-turbo"
export EVAL_JUDGE_TEMPERATURE="0.1"
# Evaluation settings
export EVAL_MAX_RETRIES="3"
export EVAL_TIMEOUT_SECONDS="60"
# Metrics
export EVAL_ENABLE_METRICS="true"
export EVAL_METRICS_PORT="9091"============================================================
Starting Agent Evaluation
============================================================
Initializing evaluator with judge model: gpt-3.5-turbo
Evaluating dataset: benchmark_v1.json
Progress: 1/100
Progress: 2/100
...
============================================================
EVALUATION SUMMARY
============================================================
Dataset: benchmark_v1
Total Cases: 100
Passed: 85 (85.0%)
Failed: 15
Errors: 0
Average Overall Score: 7.82/10
============================================================
Markdown report generated: evaluation/results/report_benchmark_v1_20240108_143052.md
Version: 0.1.0
Last Updated: 2024
Maintainer: Dexter AI Team