evalwire

Systematic, reproducible evaluation of LangGraph nodes and subgraphs against human-curated testsets, tracked in Arize Phoenix.

What it does

When iterating on a LangGraph agent, it is hard to know whether a change to a specific node improved or degraded its behaviour. Running the full graph end-to-end is expensive and makes it difficult to attribute a score change to a specific component.

evalwire solves this by:

Turning a human-curated CSV of queries and expected outputs into versioned Arize Phoenix datasets.
Letting you define a task that isolates and invokes individual LangGraph nodes independently of the rest of the graph.
Running those tasks against the stored datasets, scoring each output with one or more evaluators, and recording results in Phoenix — giving you a reproducible, comparable experiment per run.

Installation

pip install evalwire
# With LangGraph node-isolation helpers:
pip install 'evalwire[langgraph]'
# With LLM-as-a-judge evaluator:
pip install 'evalwire[llm-judge]'
# Everything:
pip install 'evalwire[all]'

Quick start

1. Upload your testset

evalwire upload --csv data/testset.csv

The CSV must contain a tags column whose values name the target Phoenix dataset (multiple tags can be pipe-delimited: es_search|source_router).

2. Structure your experiments

experiments/
├── es_search/
│   ├── task.py        # defines: async def task(example) -> Any
│   └── top_k.py       # defines: def top_k(output, expected) -> float
└── source_router/
    ├── task.py
    └── accuracy.py

3. Run experiments

evalwire run --experiments experiments/

Built-in evaluators

All factories are importable from evalwire.evaluators and return a callable with signature (output, expected: dict) -> float | bool.

Factory	Returns	Use case
`make_top_k_evaluator(K=20)`	`float`	Position-weighted retrieval scoring
`make_membership_evaluator()`	`bool`	Classification / routing label check
`make_exact_match_evaluator()`	`bool`	Extractive QA, single ground-truth string
`make_contains_evaluator()`	`bool`	Free-text generation, required phrase present
`make_regex_evaluator()`	`bool`	Structured format validation (dates, IDs, …)
`make_json_match_evaluator(keys)`	`float`	Tool-call / structured-output key matching
`make_schema_evaluator(schema)`	`bool`	JSON Schema conformance
`make_numeric_tolerance_evaluator(atol, rtol)`	`bool`	Math / calculation tasks with tolerance
`make_llm_judge_evaluator(model, prompt, schema)`	`float\|bool`	LLM-as-a-judge with structured output

Example

from evalwire.evaluators import make_top_k_evaluator, make_exact_match_evaluator

# Drop the factory return value into your experiment directory as the evaluator
top_k = make_top_k_evaluator(K=5)
exact = make_exact_match_evaluator()

LLM judge

from pydantic import BaseModel
from langchain.chat_models import init_chat_model
from evalwire.evaluators import make_llm_judge_evaluator

class Verdict(BaseModel):
    explanation: str
    score: bool  # True = correct

llm_judge = make_llm_judge_evaluator(
    model=init_chat_model("gpt-4o-mini"),
    prompt_template=(
        "Output: {output}\n"
        "Expected: {expected_output}\n"
        "Is the output correct? Think step by step, then set score."
    ),
    output_schema=Verdict,
)

Requires pip install 'evalwire[llm-judge]'.

Node isolation

Use invoke_node to call a single LangGraph node without compiling a full graph:

from evalwire.langgraph import invoke_node

async def task(example) -> list[str]:
    result = await invoke_node(retrieve, example.input["user_query"], RAGState)
    return result["retrieved_titles"]

CLI reference

Command	Description
`evalwire upload --csv PATH`	Upload CSV testset to Phoenix
`evalwire run --experiments DIR`	Discover and run all experiments
`evalwire run --name NAME`	Run a single named experiment
`evalwire run --dry-run N`	Run N examples without recording results
`evalwire run --concurrency N`	Run N experiments in parallel

Configuration

Create evalwire.toml in your project root to avoid repeating flags:

[dataset]
csv_path = "data/testset.csv"
on_exist = "skip"

[experiments]
dir = "experiments"
prefix = "eval"
concurrency = 4

Requirements

Python >= 3.10
arize-phoenix >= 13.0, < 14
A running Phoenix instance (local or cloud)

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github		.github
demo		demo
docs		docs
src/evalwire		src/evalwire
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
release-please-config.json		release-please-config.json
ruff.toml		ruff.toml
ty.toml		ty.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

evalwire

What it does

Installation

Quick start

1. Upload your testset

2. Structure your experiments

3. Run experiments

Built-in evaluators

Example

LLM judge

Node isolation

CLI reference

Configuration

Requirements

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

evalwire

What it does

Installation

Quick start

1. Upload your testset

2. Structure your experiments

3. Run experiments

Built-in evaluators

Example

LLM judge

Node isolation

CLI reference

Configuration

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages