Name	Name	Last commit message	Last commit date
parent directory ..
framework	framework
rubrics	rubrics
scripts	scripts
suites	suites
templates/suite	templates/suite
README.md	README.md
project.json	project.json
promptfoo.yaml	promptfoo.yaml
tsconfig.json	tsconfig.json

AI Tool Evaluations

Evals are to AI tools what tests are to traditional code. This framework uses Promptfoo for declarative, CI-integrated evaluations of skills and plugins.

Philosophy

Why Evals Matter

Traditional software tests verify deterministic behavior: given input X, expect output Y. AI tools are probabilistic - the same prompt might produce different (but equally valid) outputs. Evals bridge this gap by:

Defining expected behaviors rather than exact outputs
Measuring quality across multiple dimensions (accuracy, completeness, safety)
Detecting regressions when prompts or models change
Comparing performance across different LLM backends

Eval vs Test

Aspect	Traditional Test	AI Eval
Output	Exact match	Semantic similarity
Pass/Fail	Binary	Scored (0-1)
Determinism	Always same result	Statistical confidence
What's Checked	Correctness	Quality, safety, helpfulness

Structure

evals/
├── promptfoo.yaml          # Root config with default providers
├── README.md               # This file
├── framework/
│   └── types.ts            # TypeScript types (for programmatic use)
├── rubrics/                # Shared evaluation rubrics
│   └── security-checklist.txt
├── scripts/
│   └── anthropic-provider.ts  # Custom provider for OAuth support
├── suites/
│   └── <skill-name>/
│       ├── promptfoo.yaml  # Suite-specific config
│       ├── prompt-wrapper.txt  # Optional prompt template (see Prompt Template Pattern)
│       ├── cases/          # Test case prompts (markdown)
│       │   └── *.md
│       └── rubrics/        # Skill-specific rubrics (must use .txt)
│           └── *.txt
└── templates/              # Templates for new suites
    └── suite/

Quick Start

1. Setup Authentication

# Option A: Use 1Password (recommended for team)
nx run evals:setup

# Option B: Set environment variable directly
export ANTHROPIC_API_KEY="sk-ant-..."
# OR
export CLAUDE_CODE_OAUTH_TOKEN="..."

2. Run Evals

# Run a specific suite
nx run evals:eval --suite=v4-security-foundations

# Run all suites
nx run evals:eval:all

# View results in browser
nx run evals:eval:view

# Clear eval cache
nx run evals:eval:cache-clear

Authentication

Evals support two authentication methods (API key takes priority if both set):

Method	Environment Variable	Use Case
API Key	`ANTHROPIC_API_KEY`	CI, production
OAuth Token	`CLAUDE_CODE_OAUTH_TOKEN`	Local development

Setup with 1Password

The setup script fetches secrets from 1Password (requires 1Password CLI):

# One-time setup
eval $(op signin)
nx run evals:setup

# Then run evals
nx run evals:eval --suite=v4-security-foundations

Prompt Template Pattern

For skills that rely on loaded context (SKILL.md, reference materials), use a prompt template to inject that context into each test case. This mirrors how skills are loaded in production and produces more realistic evals.

How It Works

Create a prompt-wrapper.txt template with Promptfoo {{ variable }} placeholders
Set shared variables (skill content, references) in defaultTest.vars
Override per-test variables (e.g., case_content) in each test entry

# promptfoo.yaml
prompts:
  - file://prompt-wrapper.txt

defaultTest:
  vars:
    skill_content: file://../../../packages/plugins/<plugin>/skills/<skill>/SKILL.md
    reference_doc: file://../../../packages/plugins/<plugin>/skills/<skill>/references/doc.md

tests:
  - vars:
      case_content: file://cases/my-test-case.md
    assert:
      - type: llm-rubric
        value: file://rubrics/my-rubric.txt

# prompt-wrapper.txt
You are an AI assistant with the following skill loaded.

{{ skill_content }}

***

Reference material:

{{ reference_doc }}

***

User request:

{{ case_content }}

Important: Use *** (not ---) as section separators in prompt templates. Promptfoo treats --- on its own line as a multi-prompt separator, which splits one template into multiple incomplete prompts. *** renders as a markdown horizontal rule without triggering this behavior.

See suites/v4-security-foundations/ for a working example.

Writing Evals

1. Create a Test Case

Create a markdown file in suites/<skill>/cases/:

# Basic Swap Hook

Create a simple hook that logs swap events.

## Context

- Pool: ETH/USDC
- Chain: Ethereum mainnet
- No custom fees required

## Requirements

1. Implement `afterSwap` callback
2. Emit an event with swap details
3. No state changes needed

2. Define Rubrics

Create rubric files in suites/<skill>/rubrics/ with .txt extension:

Important: Promptfoo's grader only supports .txt, .json, and .yaml file types for rubric file:// references. Use .txt for markdown-formatted rubrics.

correctness.txt:

# Correctness Rubric

Evaluate whether the generated code correctly implements the requirements.

## Required Elements

1. Inherits from BaseHook
2. Implements getHookPermissions()
3. Sets afterSwap to true
4. Emits event in afterSwap callback

## Scoring

- 4/4 elements: 1.0
- 3/4 elements: 0.75
- 2/4 elements: 0.5
- 1/4 elements: 0.25
- 0/4 elements: 0.0

3. Configure the Suite

Create promptfoo.yaml:

description: 'My Skill Evaluation'

prompts:
  - file://cases/basic.md

providers:
  - id: anthropic:claude-sonnet-4-5-20250929
    config:
      temperature: 0

tests:
  - vars:
      scenario: basic
    assert:
      - type: llm-rubric
        value: file://rubrics/correctness.txt
        threshold: 0.8
        provider: anthropic:claude-sonnet-4-5-20250929
      - type: contains
        value: 'BaseHook'

Assertion Types

LLM Rubrics (Qualitative)

Use for subjective evaluation:

- type: llm-rubric
  value: file://rubrics/correctness.txt
  threshold: 0.8

Deterministic Checks

Use for required patterns:

# Must contain
- type: contains
  value: 'getHookPermissions'

# Must not contain
- type: not-contains
  value: 'selfdestruct'

# Regex match
- type: regex
  value: 'function\\s+beforeSwap'

Evaluation Criteria

Accuracy (0-1)

Does the output correctly implement the requested functionality?

Completeness (0-1)

Does the output include all required elements?

Safety (0-1)

Does the output avoid security vulnerabilities and follow best practices?

For smart contract code, this should always have a threshold of 1.0 (non-negotiable).

CI Integration

Evals run automatically on PRs that modify:

packages/plugins/**
evals/**

Pass rate must be ≥85% for PR to pass. Results include:

Per-suite pass/fail counts
Inference cost tracking
PR comment with summary

Creating New Eval Suites

Copy templates/suite/ to suites/<skill-name>/
Rename .template files (remove .template extension)
Replace {{SKILL_NAME}} placeholders
Add test cases in cases/ directory
Define rubrics in rubrics/ directory
Update promptfoo.yaml with your prompts and assertions

Best Practices

Focus on outputs, not paths: Don't check for specific steps, check that the result is correct
Start with real failures: Build evals from actual issues found in usage
Test the edges: Include cases that stress the skill's capabilities
Use deterministic checks first: contains/not-contains are faster and cheaper than LLM rubrics
Set appropriate thresholds: Security = 1.0, correctness ≥ 0.8, completeness ≥ 0.85
Review transcripts: Regularly read eval outputs to validate rubric quality

Common Pitfalls

`---` Splits Prompts

Promptfoo treats --- on its own line in .txt prompt files as a multi-prompt separator. This silently splits one template into multiple incomplete prompts, causing evals to fail with confusing results (e.g., one prompt has skill context but no user request, another has the user request but no context).

Fix: Always use *** for visual separators in prompt template files.

Nunjucks Renders Everything

Promptfoo runs Nunjucks template rendering on all prompt content, including:

.txt template files
Return values from JavaScript prompt functions
Content loaded via file:// in vars:

This means URL-encoded JSON patterns like {%22feeAmount%22} in skill content will be interpreted as Nunjucks {% %} block tags, causing Template render error: unknown block tag errors.

Fix: For skills containing {% patterns (common in URL-encoded JSON), use a JavaScript prompt function that reads the file via fs.readFileSync and wraps it in {% raw %}...{% endraw %} blocks:

// prompt-wrapper.js
const fs = require('fs');
const path = require('path');

const skillPath = path.resolve(
  __dirname,
  '../../../packages/plugins/<plugin>/skills/<skill>/SKILL.md'
);
const skillContent = fs.readFileSync(skillPath, 'utf-8');

module.exports = function ({ vars }) {
  return `You are an AI assistant with the following skill loaded.

{% raw %}${skillContent}{% endraw %}

***

User request:

${vars.case_content}`;
};

Then reference it in promptfoo.yaml:

prompts:
  - file://prompt-wrapper.js

See suites/liquidity-planner/ for a working example.

Troubleshooting

Eval not finding config

Ensure promptfoo.yaml exists in the suite directory.

Authentication errors

Set either ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN environment variable.

Rubric scoring seems off

Review the rubric instructions - LLM judges need clear scoring guidelines.

Evals show multiple prompt columns

If the results table shows two or more prompt columns instead of one, check for --- in your prompt template files. Replace with ***.

Template render error: unknown block tag

The skill content likely contains {% patterns (e.g., URL-encoded JSON {%22key%22}). Use a JavaScript prompt function with {% raw %} blocks instead of a .txt template. See "Common Pitfalls" above.

Cost concerns

Use claude-sonnet-4-5-20250929 instead of claude-opus-4-5-20251101 for routine evals
Use deterministic assertions where possible
Run specific suites instead of all suites during development

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

AI Tool Evaluations

Philosophy

Why Evals Matter

Eval vs Test

Structure

Quick Start

1. Setup Authentication

2. Run Evals

Authentication

Setup with 1Password

Prompt Template Pattern

How It Works

Writing Evals

1. Create a Test Case

2. Define Rubrics

3. Configure the Suite

Assertion Types

LLM Rubrics (Qualitative)

Deterministic Checks

Evaluation Criteria

Accuracy (0-1)

Completeness (0-1)

Safety (0-1)

CI Integration

Creating New Eval Suites

Best Practices

Common Pitfalls

`---` Splits Prompts

Nunjucks Renders Everything

Troubleshooting

Eval not finding config

Authentication errors

Rubric scoring seems off

Evals show multiple prompt columns

Template render error: unknown block tag

Cost concerns

FilesExpand file tree

evals

Directory actions

More options

Directory actions

More options

Latest commit

History

evals

Folders and files

parent directory

README.md

AI Tool Evaluations

Philosophy

Why Evals Matter

Eval vs Test

Structure

Quick Start

1. Setup Authentication

2. Run Evals

Authentication

Setup with 1Password

Prompt Template Pattern

How It Works

Writing Evals

1. Create a Test Case

2. Define Rubrics

3. Configure the Suite

Assertion Types

LLM Rubrics (Qualitative)

Deterministic Checks

Evaluation Criteria

Accuracy (0-1)

Completeness (0-1)

Safety (0-1)

CI Integration

Creating New Eval Suites

Best Practices

Common Pitfalls

--- Splits Prompts

Nunjucks Renders Everything

Troubleshooting

Eval not finding config

Authentication errors

Rubric scoring seems off

Evals show multiple prompt columns

Template render error: unknown block tag

Cost concerns

`---` Splits Prompts