An evaluation and observability framework for AI agents. Features real-time trace visualization, "Golden Path" trajectory comparison, and LLM-based evaluation scoring.
Try It by running:
npx @goyamegh/agent-health@latestOpens http://localhost:4001 for the web UI.
- Evals: Real-time agent evaluation with trajectory streaming
- Experiments: Batch evaluation runs with configurable parameters
- Compare: Side-by-side trace comparison with aligned and merged views
- Agent Traces: Table-based trace view with latency histogram, filtering, and detailed flyout with input/output display
- Live Traces: Real-time trace monitoring with auto-refresh and filtering
- Trace Views: Timeline and Flow visualizations for debugging
- Reports: Evaluation reports with LLM judge reasoning
- Connectors: Pluggable protocol adapters for different agent types
For a detailed walkthrough, see Getting Started.
| Connector | Protocol | Description |
|---|---|---|
agui-streaming |
AG-UI SSE | ML-Commons agents (default) |
rest |
HTTP POST | Non-streaming REST APIs |
subprocess |
CLI | Command-line tools |
claude-code |
Claude CLI | Claude Code agent comparison |
mock |
In-memory | Demo and testing |
For creating custom connectors, see docs/CONNECTORS.md.
# Start the web UI
npx @opensearch-project/agent-health
# Open http://localhost:4001# Check configuration
npx @opensearch-project/agent-health doctor
# List available agents and connectors
npx @opensearch-project/agent-health list agents
npx @opensearch-project/agent-health list connectors
# Run a test case against an agent
npx @opensearch-project/agent-health run -t demo-otel-001 -a demo
# Initialize a new project
npx @opensearch-project/agent-health initFor full CLI documentation, see docs/CLI.md.
AWS credentials are required for the Bedrock LLM Judge to score evaluations.
Create a .env file:
cp .env.example .envAdd your AWS credentials:
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_SESSION_TOKEN=your_session_token # if using temporary credentialsAll optional settings have sensible defaults. Configure only what you need.
Agent endpoints default to localhost. Override if your agent runs elsewhere:
LANGGRAPH_ENDPOINT=http://localhost:3000
HOLMESGPT_ENDPOINT=http://localhost:5050/api/agui/chat
MLCOMMONS_ENDPOINT=http://localhost:9200/_plugins/_ml/agents/{agent_id}/_execute/streamFor persisting test cases, experiments, and runs. Features gracefully degrade if not configured.
OPENSEARCH_STORAGE_ENDPOINT=https://your-cluster.opensearch.amazonaws.com
OPENSEARCH_STORAGE_USERNAME=admin
OPENSEARCH_STORAGE_PASSWORD=your_password
OPENSEARCH_STORAGE_TLS_SKIP_VERIFY=false # Set to true for self-signed certificatesFor agent execution traces. Features gracefully degrade if not configured.
OPENSEARCH_LOGS_ENDPOINT=https://your-logs-cluster.opensearch.amazonaws.com
OPENSEARCH_LOGS_USERNAME=admin
OPENSEARCH_LOGS_PASSWORD=your_password
OPENSEARCH_LOGS_TLS_SKIP_VERIFY=false # Set to true for self-signed certificatesSee .env.example for all available options.
| Command | Description |
|---|---|
npm install |
Install dependencies |
npm run dev |
Start frontend dev server (port 4000) |
npm run dev:server |
Start backend server (port 4001) |
npm run build |
TypeScript compile + Vite production build |
npm test |
Run all tests |
npm run test:unit |
Run unit tests only |
npm run test:integration |
Run integration tests only |
npm run test:e2e |
Run E2E tests with Playwright |
npm run test:e2e:ui |
Run E2E tests with Playwright UI |
npm run test:all |
Run all tests (unit + integration + e2e) |
npm test -- --coverage |
Run tests with coverage report |
npm run build:all |
Build UI + server + CLI |
npm run build:cli |
Build CLI only |
npm run server # Build UI + start single server on port 4001After publishing, run directly with npx:
npx @opensearch-project/agent-health # Start server on port 4001
npx @opensearch-project/agent-health --port 8080
npx @opensearch-project/agent-health --env-file .env| Mode | Command | Port(s) |
|---|---|---|
| Dev (frontend) | npm run dev |
4000 |
| Dev (backend) | npm run dev:server |
4001 |
| Production | npm run server |
4001 |
| NPX | npx @opensearch-project/agent-health |
4001 (default) |
In development, the Vite dev server (4000) proxies /api requests to the backend (4001).
AgentEval uses a comprehensive test suite with three layers:
| Type | Location | Command | Description |
|---|---|---|---|
| Unit | tests/unit/ |
npm run test:unit |
Fast, isolated function tests |
| Integration | tests/integration/ |
npm run test:integration |
Tests with real backend server |
| E2E | tests/e2e/ |
npm run test:e2e |
Browser-based UI tests with Playwright |
# All tests
npm test # Unit + integration
npm run test:all # Unit + integration + E2E
# By type
npm run test:unit # Unit tests only
npm run test:integration # Integration tests (starts server)
npm run test:e2e # E2E tests (starts servers)
npm run test:e2e:ui # E2E with Playwright UI for debugging
# With coverage
npm run test:unit -- --coverage
# Specific file
npm test -- path/to/file.test.ts
npx playwright test tests/e2e/dashboard.spec.tsE2E tests use Playwright to test the UI in a real browser.
# First time: install browsers
npx playwright install
# Run all E2E tests
npm run test:e2e
# Interactive UI mode (recommended for debugging)
npm run test:e2e:ui
# View test report
npm run test:e2e:reportWriting E2E Tests:
- Place tests in
tests/e2e/*.spec.ts - Use
data-testidattributes for reliable selectors - Handle empty states gracefully (check if data exists before asserting)
- See existing tests for patterns
All PRs must pass these CI checks:
| Job | What it checks |
|---|---|
build-and-test |
Build + unit tests + 90% coverage |
lint-and-typecheck |
TypeScript compilation |
license-check |
SPDX headers on all source files |
integration-tests |
Backend integration tests with coverage |
e2e-tests |
Playwright browser tests with pass/fail tracking |
security-scan |
npm audit for vulnerabilities |
test-summary |
Consolidated test results summary |
| Test Type | Metric | Threshold |
|---|---|---|
| Unit | Lines | ≥ 90% |
| Unit | Branches | ≥ 80% |
| Unit | Functions | ≥ 80% |
| Unit | Statements | ≥ 90% |
| Integration | Lines | Informational (no threshold) |
| E2E | Pass Rate | 100% |
Each CI run produces these artifacts (downloadable from Actions tab):
| Artifact | Contents |
|---|---|
coverage-report |
Unit test coverage (HTML, LCOV) |
integration-coverage-report |
Integration test coverage |
playwright-report |
E2E test report with screenshots/traces |
test-badges |
Badge data JSON for coverage visualization |
The E2E test suite includes tests for the complete evaluation flow using mock modes:
- Demo Agent (
mock://demo) - Simulated AG-UI streaming responses - Demo Model (
provider: "demo") - Simulated LLM judge evaluation
This allows testing the full Create Test Case → Create Benchmark → Run Evaluation → View Results flow without requiring AWS credentials or a live agent in CI.
Agent Health supports multiple agent types:
| Agent | Endpoint Variable | Setup |
|---|---|---|
| Langgraph | LANGGRAPH_ENDPOINT |
Simple localhost agent |
| HolmesGPT | HOLMESGPT_ENDPOINT |
AG-UI compatible RCA agent |
| ML-Commons | MLCOMMONS_ENDPOINT |
See ML-Commons Setup |
Enable verbose debug logging to diagnose issues:
# Via environment variable
DEBUG=true npx @opensearch-project/agent-health
# Or toggle at runtime via API
curl -X POST http://localhost:4001/api/debug -H 'Content-Type: application/json' -d '{"enabled":true}'Debug logging can also be toggled from the Settings page using the "Verbose Logging" switch, which syncs to both the browser console and server terminal.
| Issue | Solution |
|---|---|
| Cannot connect to backend | Run npm run dev:server, check curl http://localhost:4001/health |
| AWS credentials expired | Refresh credentials in .env |
| Storage/Traces not working | Check OpenSearch endpoint and credentials in .env |
| Need verbose logs | Set DEBUG=true in .env or toggle in Settings page |
We welcome contributions! See CONTRIBUTING.md for guidelines.
- Fork and clone the repository
- Install dependencies:
npm install - Create a feature branch:
git checkout -b feature/your-feature - Make changes and add tests
- Run tests:
npm test - Commit with DCO signoff:
git commit -s -m "feat: your message" - Push and create a Pull Request
All commits require DCO signoff and all PRs must pass CI checks (tests, coverage, linting).
- Getting Started - Installation, demo mode, and usage walkthrough
- ML-Commons Agent Setup - Configure ML-Commons agent
- Development Guide - Architecture and coding conventions
- AG-UI Protocol
