harness-evals is an open-source AI evaluation framework for LLM agents, prompts, and structured outputs. It provides a pip install-able scoring engine with ~37 metrics across deterministic, structural, operational, reliability, RAG, safety, agent, and conversational categories.
Core principle: An eval always produces a Score. Every metric is a single class with a measure() method.
Data flow: Golden (authored) + agent output -> EvalCase (evaluated) -> Score (result)
Language: Python 3.10+
License: Apache 2.0
Package name: harness-evals
Import name: harness_evals
pip install -e "." # Core only (Phase 1 metrics, no LLM key needed)
pip install -e ".[llm]" # + OpenAI, Anthropic for LLM-judged metrics
pip install -e ".[dev]" # + pytest, ruff, pre-commit
pip install -e ".[all,dev]" # EverythingBuild tool: setuptools via pyproject.toml
No compiled extensions — pure Python.
pytest tests/ -v # All tests
pytest tests/ -v -m unit # Unit tests only
pytest tests/metrics/ -v # Specific directory
pytest tests/test_core.py -v # Specific file
pytest tests/test_core.py::test_evaluate -v # Specific function
pytest tests/ --cov=harness_evals --cov-report=html # With coverage- Mark tests:
@pytest.mark.unit,@pytest.mark.integration - Test data:
tests/data/
ALWAYS run pytest tests/ -v before committing.
ruff check src/ tests/ # Lint check
ruff format --check src/ tests/ # Format check
ruff format src/ tests/ # Auto-format
ruff check --fix src/ tests/ # Auto-fix lint issuesRuff handles both formatting and linting (replaces black + flake8 + isort).
The package is published automatically via the Harness CI pipeline (.harness/publish.yaml) when a version change is detected on main.
How it works: The pipeline compares version in pyproject.toml at HEAD vs HEAD~1. If the version changed, it builds and publishes to harness-pip-internal.
You MUST bump the version in pyproject.toml whenever your changes should be released. If you don't bump the version, the package will NOT be published — even if code changes are merged.
# In pyproject.toml, update:
version = "X.Y.Z" # Bump this to trigger a publishFollow semver: patch for fixes, minor for new metrics/features, major for breaking changes.
- Branch naming:
feat/short-descriptionorfix/short-description - Commit format:
type: descriptionwhere type isfeat,fix,chore,refactor,test,docs - Default branch:
main
- Use type hints on all function signatures
- Follow existing patterns — look at any metric file as a template
- Use
@dataclassfor structured data (Golden, EvalCase, Score) - Keep metrics as single-file, single-class modules
- Write a test file for every new metric
- Use async/await for I/O (LLM calls, HTTP) — override
a_measure()for async metrics - Run
ruff checkandpytestbefore committing
- Never force push to main
- Never commit secrets or
.envfiles - Never add heavy dependencies (torch, transformers) to core — use optional extras
- Never modify
Golden,EvalCase, orScorefields without updating PLAN.md - Never average safety scores into an overall score — report them separately
- Don't use
print()— use the sink system for output
harness-evals/
├── pyproject.toml # Package config, dependencies, tool settings
├── README.md # User-facing documentation
├── AGENTS.md # This file
├── PLAN.md # Full vision spec with all phases
├── LICENSE # Apache 2.0
├── .gitignore
├── .pre-commit-config.yaml
├── .github/workflows/ci.yml
│
├── src/harness_evals/
│ ├── __init__.py # Public API: Golden, EvalCase, Score, evaluate, assert_test, etc.
│ ├── py.typed # PEP 561 marker
│ │
│ ├── core/
│ │ ├── __init__.py
│ │ ├── golden.py # Golden dataclass (authored data)
│ │ ├── eval_case.py # EvalCase dataclass (what metrics receive)
│ │ ├── score.py # Score dataclass (passed auto-computed)
│ │ ├── metric.py # BaseMetric, ReliabilityMetric ABCs
│ │ ├── sink.py # BaseSink ABC
│ │ └── runner.py # evaluate(), assert_test(), evaluate_cases(), evaluate_dataset()
│ │
│ ├── metrics/
│ │ ├── __init__.py # Re-exports all metrics
│ │ ├── deterministic/ # ExactMatch, Contains, Regex, NumericDiff
│ │ ├── structural/ # JsonDiff, SchemaValidation
│ │ ├── operational/ # Latency, TokenCost, CostEfficiency, RetryCount
│ │ └── reliability/ # OutcomeConsistency, ResourceConsistency
│ │
│ └── sinks/
│ ├── __init__.py
│ ├── stdout.py # StdoutSink
│ └── json_sink.py # JsonSink
│
├── tests/
│ ├── conftest.py # Shared fixtures
│ ├── test_core.py # Golden, EvalCase, Score, evaluate, assert_test, etc.
│ └── metrics/ # One test file per metric category
│
└── examples/
└── basic_eval.py # Minimal working example
This is the most common task an AI agent will do. Follow these steps:
- Pick the category — deterministic, structural, operational, reliability, etc.
- Create the file —
src/harness_evals/metrics/<category>/<metric_name>.py - Implement the class — extend
BaseMetric(orReliabilityMetricfor multi-run):
from harness_evals.core.metric import BaseMetric
from harness_evals.core.score import Score
from harness_evals.core.eval_case import EvalCase
class MyMetric(BaseMetric):
def __init__(self, threshold: float = 1.0, **kwargs):
super().__init__(name="my_metric", threshold=threshold, **kwargs)
def measure(self, eval_case: EvalCase) -> Score:
value = ... # compute 0.0–1.0
return Score(
name=self.name,
value=value,
threshold=self.threshold,
)- Export it — add to
src/harness_evals/metrics/<category>/__init__.pyandsrc/harness_evals/metrics/__init__.py - Write the test —
tests/metrics/test_<metric_name>.py:
import pytest
from harness_evals.core.eval_case import EvalCase
from harness_evals.metrics.<category>.<metric_name> import MyMetric
@pytest.mark.unit
def test_my_metric_perfect():
ec = EvalCase(input="x", output="y", expected="y")
score = MyMetric(threshold=0.8).measure(ec)
assert score.passed
assert score.value == 1.0
@pytest.mark.unit
def test_my_metric_failure():
ec = EvalCase(input="x", output="wrong", expected="y")
score = MyMetric(threshold=0.8).measure(ec)
assert not score.passed- Run tests —
pytest tests/ -v
@dataclass
class Golden:
input: str | dict | list
expected: str | dict | list | None = None
context: list[str] | None = None
metadata: dict[str, Any] | None = None
tags: dict[str, str] | None = None@dataclass
class EvalCase:
input: str | dict | list
output: str | dict | list
expected: str | dict | list | None = None
context: list[str] | None = None
latency_ms: float | None = None # typed operational fields
token_count: int | None = None
cost_usd: float | None = None
retry_count: int | None = None
confidence: float | None = None
tags: dict[str, str] | None = None
metadata: dict[str, Any] | None = None # extensible for custom keys
runs: list["EvalCase"] | None = None # K runs for reliability metrics@dataclass
class Score:
name: str
value: float # 0.0 to 1.0
threshold: float # pass/fail threshold
passed: bool # auto-computed: value >= threshold (not in constructor)
reason: str | None = None
metadata: dict[str, Any] | None = None
created_at: datetime # auto-set to UTC nowclass BaseMetric(ABC):
name: str
threshold: float
@abstractmethod
def measure(self, eval_case: EvalCase) -> Score: ...
async def a_measure(self, eval_case: EvalCase) -> Score:
"""Async variant. Override for I/O-bound metrics. Default calls measure()."""
return self.measure(eval_case)class ReliabilityMetric(BaseMetric):
k: int # number of runs expected
@abstractmethod
def measure_runs(self, eval_case: EvalCase) -> Score:
"""Evaluate across eval_case.runs. Called by measure()."""
def measure(self, eval_case: EvalCase) -> Score:
if eval_case.runs:
return self.measure_runs(eval_case)
return Score(name=self.name, value=0.0, threshold=self.threshold,
reason="No runs provided")See PLAN.md for the full vision with 6 phases and ~37 metrics. Phase 1 (this skeleton) covers core framework + 12 metrics. Each subsequent phase adds metrics, capabilities, and directory structure as described in PLAN.md.
Core (Phase 1): deepdiff>=7.0, jsonschema>=4.0 — two dependencies total.
LLM (Phase 2+): openai>=1.0, anthropic>=0.30 — optional.
Dev: pytest>=8.0, ruff>=0.4, pytest-cov, pytest-asyncio, pre-commit.