harness-evals

Open-source AI evaluation framework for LLM agents, prompts, and structured outputs.

Every metric produces a normalized Score (0.0–1.0). Pass/fail is determined by a configurable threshold. No magic, no hidden state.

Install

pip install harness-evals            # core only
pip install harness-evals[llm]       # + LLM-judged metrics (OpenAI, Anthropic)
pip install harness-evals[otlp]      # + OTLP metrics & traces export
pip install harness-evals[langfuse]  # + Langfuse source/sink
pip install harness-evals[similarity]# + BLEU metric (nltk)
pip install harness-evals[harness]   # + Harness AI Service LLM provider
pip install harness-evals[all]       # everything

Five Dimensions

Every metric belongs to one of five evaluation dimensions. Together they answer: where is my agent strong, and where is it weak?

Dimension	Question	Example Metrics
Correctness	Is it right?	ExactMatch, TaskCompletion, GEval, GoalAccuracy
Groundedness	Is it supported by evidence?	Faithfulness, ContextPrecision, AnswerRelevancy
Safety	Did it violate policy?	PII, Toxicity, PromptInjection, Hallucination
Trajectory	Did it take a good path?	PlanAdherence, StepEfficiency, ToolCorrectness
Performance	Was it fast and cheap?	Latency, TokenCost, CostEfficiency

Dimensions are set by the metric author — not user-configured. Any combination of metrics in a suite automatically produces a radar chart grouped by dimension.

Core Concepts

Golden — what you author: an input, the expected output, and optional context. Lives in your dataset files.

EvalCase — what metrics receive: a Golden enriched with the agent's actual output and runtime metadata (latency, tokens, cost).

BaseMetric — a scoring function. Takes an EvalCase, returns a Score. Each metric is a single class with a measure() method. Specialized base classes: ReliabilityMetric for multi-run metrics, SafetyMetric for safety metrics (reported separately, never averaged).

Message — a conversation turn: role, content, and optional tool calls. Maps to OpenAI chat messages, Langfuse generations, OTEL LLM spans.

ToolCall — a tool/function invocation: name, input, output. Maps to OpenAI function calls, Anthropic tool_use blocks, MCP invocations.

Score — the result: a value between 0.0 and 1.0, a threshold, and an auto-computed passed boolean.

evaluate() — runs multiple metrics on an eval case. Never raises — returns all scores including failures.

assert_test() — same as evaluate(), but raises AssertionError if any metric fails. Drop it into pytest.

Sources — adapters that hydrate EvalCase from production trace data. LangfuseSource and OTELSource map traces to typed fields automatically.

Data Flow

Golden (authored) + agent output → EvalCase → Score (result)
Production traces (Langfuse/OTEL) → Source → EvalCase → Score (result)

Usage

Evaluate a response

from harness_evals import EvalCase, evaluate
from harness_evals.metrics import ExactMatchMetric, LatencyMetric

ec = EvalCase(
    input="What is the capital of France?",
    output="Paris",
    expected="Paris",
    latency_ms=320,
)

scores = evaluate(ec, metrics=[
    ExactMatchMetric(),
    LatencyMetric(max_ms=2000, threshold=0.5),
])

for s in scores:
    print(f"{'PASS' if s.passed else 'FAIL'} {s.name}: {s.value:.2f}")
# PASS exact_match: 1.00
# PASS latency: 0.84

Evaluate structured output (JSON/YAML)

from harness_evals import EvalCase, assert_test
from harness_evals.metrics import JsonDiffMetric, SchemaValidationMetric

expected = {"apiVersion": "apps/v1", "kind": "Deployment", "metadata": {"name": "nginx"}}
schema = {"type": "object", "required": ["apiVersion", "kind"], "properties": {
    "apiVersion": {"type": "string"}, "kind": {"type": "string"}
}}

ec = EvalCase(
    input="Create a K8s deployment for nginx",
    output={"apiVersion": "apps/v1", "kind": "Deployment", "metadata": {"name": "nginx"}},
    expected=expected,
)

assert_test(ec, metrics=[
    JsonDiffMetric(threshold=0.9),
    SchemaValidationMetric(schema=schema),
])

Use with pytest

def test_agent_accuracy():
    ec = EvalCase(
        input="What is 2+2?",
        output=agent.run("What is 2+2?"),
        expected="4",
    )
    assert_test(ec, metrics=[ExactMatchMetric()])

assert_test() raises AssertionError on failure — works natively with pytest, unittest, and any CI system.

Evaluate a dataset with an agent

import asyncio
from harness_evals import Golden, EvalCase, evaluate_dataset
from harness_evals.metrics import ExactMatchMetric

goldens = [
    Golden(input="What is 2+2?", expected="4"),
    Golden(input="Capital of France?", expected="Paris"),
]

async def run_agent(golden: Golden) -> EvalCase:
    result = await agent.arun(golden.input)
    return EvalCase.from_golden(golden, output=result)

results = asyncio.run(evaluate_dataset(goldens, run_agent, metrics=[ExactMatchMetric()]))

Measure reliability across multiple runs

from harness_evals import EvalCase, evaluate
from harness_evals.metrics import OutcomeConsistencyMetric

runs = [
    EvalCase(input="task", output=agent.run("task"))
    for _ in range(5)
]

ec = EvalCase(input="task", output=runs[0].output, runs=runs)
scores = evaluate(ec, metrics=[OutcomeConsistencyMetric(threshold=0.8)])

Evaluate with typed tool calls

from harness_evals import EvalCase, ToolCall, evaluate
from harness_evals.metrics import ToolCorrectnessMetric

ec = EvalCase(
    input="Check weather in Paris",
    output="It's 18C and sunny",
    tool_calls=[ToolCall(name="get_weather", input={"city": "Paris"})],
    expected_tools=["get_weather"],
)
scores = evaluate(ec, metrics=[ToolCorrectnessMetric()])

For deterministic argument checks, pair ToolCorrectnessMetric with ToolArgumentMatchMetric:

from harness_evals import EvalCase, ToolCall, evaluate
from harness_evals.metrics import ToolArgumentMatchMetric, ToolCorrectnessMetric

ec = EvalCase(
    input="Check weather in Paris",
    output="It's 18C and sunny",
    tool_calls=[ToolCall(name="get_weather", input={"city": "Paris", "units": "C"})],
    expected_tools=["get_weather"],
    expected_tool_calls=[ToolCall(name="get_weather", input={"city": "Paris"})],
)
scores = evaluate(
    ec,
    metrics=[
        ToolCorrectnessMetric(mode="exact"),
        ToolArgumentMatchMetric(arg_match="subset", ignore_keys={"trace_id"}),
    ],
)

Evaluate conversation messages

from harness_evals import EvalCase, Message

ec = EvalCase(
    input="Help me debug this error",
    output="The issue is a null pointer...",
    messages=[
        Message(role="user", content="Help me debug this error"),
        Message(role="assistant", content="Can you share the stack trace?"),
        Message(role="user", content="Here it is: NullPointerException at..."),
        Message(role="assistant", content="The issue is a null pointer..."),
    ],
)

Evaluate production traces from Langfuse

from harness_evals.sources.langfuse import LangfuseSource
from harness_evals import evaluate
from harness_evals.metrics import FaithfulnessMetric, LatencyMetric, PIIMetric
from harness_evals.sinks.langfuse_sink import LangfuseSink

source = LangfuseSource(langfuse_client)
ec = source.from_trace("trace-id-123")

scores = evaluate(ec, metrics=[
    FaithfulnessMetric(llm=llm),
    LatencyMetric(max_ms=3000),
    PIIMetric(),
], sinks=[LangfuseSink()])  # scores written back to the same trace

Batch-evaluate Langfuse traces by filter

from harness_evals.sources.langfuse import LangfuseSource
from harness_evals import evaluate_cases
from harness_evals.metrics import LatencyMetric, PIIMetric
from harness_evals.sinks.langfuse_sink import LangfuseSink

source = LangfuseSource(langfuse_client)
cases = source.from_traces(tags=["production"], user_id="user_123", limit=50)

all_scores = evaluate_cases(cases, metrics=[
    LatencyMetric(max_ms=3000),
    PIIMetric(),
], sinks=[LangfuseSink()])

Evaluate production traces from OpenTelemetry

from harness_evals.sources.otel import OTELSource

ec = OTELSource.from_spans(collected_spans)
scores = evaluate(ec, metrics=[...])

Write results to a file

from harness_evals.sinks import StdoutSink, JsonSink
from harness_evals.sinks.langfuse_sink import LangfuseSink

scores = evaluate(ec, metrics=[...], sinks=[
    StdoutSink(),
    JsonSink("results/scores.jsonl"),
    LangfuseSink(),  # requires pip install harness-evals[langfuse]
])

Export to an OTLP-compatible backend

For protocol="http" (the default in some integrations is grpc), set endpoint to the OTLP HTTP base URL; OtlpSink appends /v1/traces and /v1/metrics for the exporters. Use protocol="grpc" with a host:port endpoint for gRPC OTLP.

from harness_evals.sinks.otlp_sink import OtlpSink  # requires pip install harness-evals[otlp]

sink = OtlpSink(
    endpoint="http://collector:4317",
    run_id="my-eval-run-001",
    resource_attributes={"deployment.environment": "ci"},
    extra_attributes={"eval.suite_id": "nightly-regression"},
)

scores = evaluate(ec, metrics=[...], sinks=[sink])

Attach eval spans to an existing trace

If your eval engine already creates OTel spans, pass a parent_context so the eval-run span becomes a child (same trace ID, unified view in Jaeger/Tempo):

from opentelemetry import trace
from harness_evals.sinks.otlp_sink import OtlpSink

tracer = trace.get_tracer("my-engine")
with tracer.start_as_current_span("orchestration") as parent:
    ctx = trace.set_span_in_context(parent)
    sink = OtlpSink(endpoint="http://collector:4317", parent_context=ctx)
    evaluate_cases(cases, metrics=[...], sinks=[sink])

Share a TracerProvider (single export pipeline)

For full control, pass your own TracerProvider and/or MeterProvider. The sink won't flush or shutdown providers it doesn't own — you retain lifecycle control:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from harness_evals.sinks.otlp_sink import OtlpSink

provider = TracerProvider(resource=my_resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317")))
tracer = provider.get_tracer("my-engine")

with tracer.start_as_current_span("orchestration") as parent:
    ctx = trace.set_span_in_context(parent)
    sink = OtlpSink(tracer_provider=provider, parent_context=ctx, run_id="run-123")
    evaluate_cases(cases, metrics=[...], sinks=[sink])

provider.shutdown()  # caller owns lifecycle

Evaluate security remediations

from harness_evals import EvalCase, evaluate
from harness_evals.llm.openai import OpenAILLM  # or AnthropicLLM, HarnessAILLM
from harness_evals.metrics.security import (
    VulnerabilityCorrectnessMetric,
    SecurityCompletenessMetric,
    CodeSafetyMetric,
    CodeQualityMetric,
    ExplanationQualityMetric,
    RootCauseAnalysisMetric,
    ActionabilityMetric,
    remediation_quality_index,
)

llm = OpenAILLM()  # uses OPENAI_API_KEY env var

ec = EvalCase(
    input="CWE-79: Reflected XSS in user_profile.py line 42. User input rendered without escaping.",
    output="## Fix\n```python\nfrom markupsafe import escape\nname = escape(request.args.get('name', ''))\n```",
)

scores = evaluate(ec, metrics=[
    VulnerabilityCorrectnessMetric(llm=llm, threshold=0.5),
    SecurityCompletenessMetric(llm=llm, threshold=0.5),
    CodeSafetyMetric(llm=llm, threshold=0.5),
    CodeQualityMetric(llm=llm, threshold=0.5),
    ExplanationQualityMetric(llm=llm, threshold=0.5),
    RootCauseAnalysisMetric(llm=llm, threshold=0.5),
    ActionabilityMetric(llm=llm, threshold=0.5),
])

rqi = remediation_quality_index(scores)
print(f"RQI: {rqi.value:.3f} ({'PASS' if rqi.passed else 'FAIL'})")

Summarize results across a dataset

from harness_evals import evaluate_cases, summarize

all_scores = evaluate_cases(eval_cases, metrics=[...])
summary = summarize(all_scores)

for name, m in summary.by_metric.items():
    print(f"{name}: mean={m.mean:.2f} pass_rate={m.pass_rate:.0%} ({m.count} cases)")

Available Metrics

Category	Metrics	What They Measure
Deterministic	ExactMatch, Contains, Regex, NumericDiff, ListContains	Exact comparison against expected output
Structural	JsonDiff, SchemaValidation	Structural similarity and schema conformance for JSON/YAML
Operational	Latency, TokenCost, CostEfficiency, RetryCount	Performance and cost from typed fields
Reliability	OutcomeConsistency, ResourceConsistency, TrajectoryConsistency, PromptRobustness, EnvironmentRobustness, FaultRobustness, BrierScore	Consistency across repeated runs, trajectory similarity, robustness to prompt/environment/fault perturbations
Predictability	Calibration, Discrimination	Expected calibration error and AUC-ROC over confidence scores
MCP	ToolSelectionAccuracy, MCPTraceCompleteness	MCP tool selection accuracy and trace completeness
Similarity	Levenshtein, BLEU, EmbeddingSimilarity	String distance, n-gram overlap, and semantic vector similarity
LLM-Judged	GEval, RubricJudge, Pairwise	LLM scores output against criteria, rubric, or A/B comparison. `GEval` supports free-form criteria, numbered `evaluation_steps`, and integer score-band rubrics via `list[RubricLevel]`; `RubricJudge` uses a flat level → description rubric. (requires `[llm]`)
RAG	Faithfulness, AnswerRelevancy, ContextPrecision, ContextRecall, AnswerCorrectness, AnswerSimilarity, ContextEntityRecall, ContextRelevancy	Retrieval-augmented generation quality (requires `[llm]`)
Safety	PII, Toxicity, PromptInjection, Hallucination	PII leaks, toxic content, prompt injection, hallucination (reported separately, never averaged)
Agent	ToolCorrectness, ToolArgumentMatch, TaskCompletion, ArgumentCorrectness, PlanQuality, PlanAdherence, StepEfficiency	Tool call correctness, deterministic argument match, task completion, LLM-judged argument validation, plan quality/adherence, step efficiency (some require `[llm]`)
Conversation	ConversationCoherence, ConversationResolution, ConversationCompleteness, TurnEfficiency, TurnRelevancy, KnowledgeRetention, RoleAdherence, TopicAdherence, GoalAccuracy, ToolUse	Multi-turn coherence, resolution, completeness, efficiency, relevancy, memory, role/topic adherence, goal accuracy, tool usage (requires `[llm]`)
Security	VulnerabilityCorrectness, SecurityCompleteness, CodeSafety, CodeQuality, ExplanationQuality, RootCauseAnalysis, Actionability	LLM-as-Judge metrics for AI-generated security vulnerability remediations, with composite Remediation Quality Index (requires LLM provider: `[llm]` or `[harness]`)

EvalCase Fields

EvalCase(
    input="the prompt or task",                    # required
    output="what the agent produced",              # required
    expected="ground truth",                       # optional (not needed for LLM-judged metrics)
    context=["retrieved doc 1", "retrieved doc 2"],# optional (for RAG metrics)
    messages=[Message(role="user", content="...")], # optional (for conversation metrics)
    tool_calls=[ToolCall(name="fn", input={...})], # optional (for agent/tool metrics)
    expected_tools=["fn1", "fn2"],                 # optional (expected tool names)
    latency_ms=320,                                # optional (typed, for LatencyMetric)
    token_count=150,                               # optional (typed, for TokenCostMetric)
    cost_usd=0.003,                                # optional (typed, for CostEfficiencyMetric)
    retry_count=0,                                 # optional (typed, for RetryCountMetric)
    confidence=0.95,                               # optional (typed)
    tags={"env": "ci", "model": "gpt-4o"},         # optional (for filtering)
    metadata={"custom_key": "value"},              # optional (extensible)
    runs=[...],                                    # optional (for reliability metrics)
)

Golden also supports expected_tools for defining expected tool names in datasets.

Extending

Custom metric

from harness_evals.core.metric import BaseMetric
from harness_evals.core.score import Score
from harness_evals.core.eval_case import EvalCase

class MyMetric(BaseMetric):
    def __init__(self, threshold: float = 0.8):
        super().__init__(name="my_metric", threshold=threshold)

    def measure(self, eval_case: EvalCase) -> Score | None:
        value = compute_something(eval_case)  # return 0.0–1.0
        return Score(name=self.name, value=value, threshold=self.threshold)
        # return None to skip this case (excluded from aggregation)

Custom sink

from harness_evals.core.sink import BaseSink

class MySink(BaseSink):
    def write(self, scores, eval_case):
        for s in scores:
            send_to_my_system(s.name, s.value, s.passed)

Documentation

Architecture — system design, data flow, extension points
Metrics Guide — how to write a new metric, templates for every category
Integration Guide — pytest, GitHub Actions, Harness CI, GitLab CI
Contributing — development workflow, code style, PR process
Architecture Decision Records — why we made key design choices
Changelog — version history

Development

git clone git@github.com:sunilgattupalle/harness-evals.git
cd harness-evals
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[all,dev]"
ruff check src/ tests/          # lint
ruff format --check src/ tests/ # format
pytest tests/ -v                # test

References

Rabanser, Kapoor, Kirgis, Liu, Utpala, Narayanan. "Towards a Science of AI Agent Reliability". Princeton, 2026. — Defines the 12 reliability metrics across 4 dimensions (consistency, robustness, predictability, safety) that inform this framework's reliability metric design.
DeepEval — LLM evaluation framework. Influenced the measure() / a_measure() metric interface pattern.
RAGAS — RAG evaluation toolkit. Informed the RAG metric decomposition (faithfulness, context precision/recall).
promptfoo — CLI-first eval tool. Validated the CI/CD-native approach (JUnit output, exit codes, baseline regression).

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github/workflows		.github/workflows
.harness		.harness
docs		docs
examples		examples
src/harness_evals		src/harness_evals
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
PLAN.md		PLAN.md
README.md		README.md
REVIEW.md		REVIEW.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

harness-evals

Install

Five Dimensions

Core Concepts

Data Flow

Usage

Evaluate a response

Evaluate structured output (JSON/YAML)

Use with pytest

Evaluate a dataset with an agent

Measure reliability across multiple runs

Evaluate with typed tool calls

Evaluate conversation messages

Evaluate production traces from Langfuse

Batch-evaluate Langfuse traces by filter

Evaluate production traces from OpenTelemetry

Write results to a file

Export to an OTLP-compatible backend

Attach eval spans to an existing trace

Share a TracerProvider (single export pipeline)

Evaluate security remediations

Summarize results across a dataset

Available Metrics

EvalCase Fields

Extending

Custom metric

Custom sink

Documentation

Development

References

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages