Guide: Adding a New Evaluator to the OpenAgents Framework

This guide explains how to create, register, and test a new Evaluator inside the evaluation framework located at evals/framework. It focuses on validating agent behaviors without coupling to internal implementation details.

1. Design Principles

Layer separation: Do not read directly from disk or the SDK inside the evaluator. Use only the provided timeline and sessionInfo.
Logical purity: An evaluator transforms TimelineEvent[] into an EvaluationResult; avoid side effects.
Traceable evidence: Each check must produce concrete evidence (phrases, data, timestamps) for auditability.
Transparent scoring: Use explicit check weights; avoid hidden logic.
Safe extensibility: Do not modify existing evaluators when adding a new one—just register yours.

2. Evaluator Anatomy

Starting from existing examples (approval-gate-evaluator.ts, tool-usage-evaluator.ts), the minimal structure:

export class MyNewEvaluator extends BaseEvaluator {
  name = 'my-new-rule';
  description = 'Brief description of the rule enforced';

  async evaluate(timeline: TimelineEvent[], sessionInfo: SessionInfo): Promise<EvaluationResult> {
    const checks: Check[] = [];
    const violations: Violation[] = [];
    const evidence: Evidence[] = [];

    // 1. Collect relevant events
    const toolCalls = this.getToolCalls(timeline);

    // 2. Apply rule logic
    // Example: count usage of a forbidden tool
    const forbidden = toolCalls.filter(e => e.data?.tool === 'bash' && /* lógica */ false);

    // 3. Register checks
    checks.push({
      name: 'no-forbidden-bash',
      passed: forbidden.length === 0,
      weight: 40,
      evidence: [
        this.createEvidence('bash-usage-summary', 'Resumen de llamadas bash', { count: forbidden.length })
      ]
    });

    // 4. Register violations
    if (forbidden.length > 0) {
      violations.push(
        this.createViolation('forbidden-bash', 'error', 'Uso de bash no permitido', forbidden[0].timestamp, {
          occurrences: forbidden.length
        })
      );
    }

    // 5. Additional contextual evidence
    evidence.push(
      this.createEvidence('session-meta', 'Información básica de sesión', { title: sessionInfo.title })
    );

    // 6. Build result
    return this.buildResult(this.name, checks, violations, evidence, {
      forbiddenCount: forbidden.length
    });
  }
}

3. Available Utilities (BaseEvaluator)

Method	Purpose
`getToolCalls(timeline)`	Extracts `tool_call` events.
`getToolCallsByName(timeline, name)`	Filters by a specific tool.
`getExecutionTools(timeline)`	Execution tools: bash/write/edit/task.
`getReadTools(timeline)`	Read tools: read/glob/grep/list.
`getAssistantMessages(timeline)`	Assistant messages (includes type `text`).
`getUserMessages(timeline)`	User messages.
`getEventsBefore/After(timeline, ts)`	Temporal navigation.
`detectApprovalRequest(text)`	Enhanced approval language detector.
`createEvidence(id, description, data?, timestamp?)`	Standardizes evidence.
`createViolation(code, severity, message, timestamp, data?)`	Creates traceable violation.
`buildResult(name, checks, violations, evidence, meta?)`	Assembles `EvaluationResult`.

4. Check Best Practices

Semantic name: e.g. approval-before-write, context-loaded-before-test.
Proportional weight: Add up to 100 across checks or justify a different total.
Minimum evidence: At least one evidence entry per check.
Optional meta: Use meta in buildResult to return aggregated metrics (e.g. approvalLatencyMs).

5. Registering the Evaluator

Depending on how the runner is instantiated:

A. Manual Registration when Building `EvaluatorRunner`

import { MyNewEvaluator } from './evaluators/my-new-evaluator';

const runner = new EvaluatorRunner({
  sessionReader,
  timelineBuilder,
  evaluators: [
    new ApprovalGateEvaluator(),
    new ContextLoadingEvaluator(),
    new MyNewEvaluator(), // <-- aquí
  ]
});

B. Dynamic Registration (factory approach)

runner.register(new MyNewEvaluator());

Confirm execution:

npm run eval:sdk -- --agent=openagent --debug

Check console output: Running evaluator: my-new-rule....

6. Add YAML Tests

Use the advanced schema (see test-design-guide.md). Example:

id: my-new-evaluator-positive-001
name: "MyNewEvaluator: Positive case"
agent: openagent
prompt: |
  Explica el README sin ejecutar comandos.
behavior:
  mustNotUseTools: [bash]
expectedViolations:
  - rule: my-new-rule
    shouldViolate: false
    severity: error
approvalStrategy:
  type: auto-approve

Negativo:

id: my-new-evaluator-negative-001
name: "MyNewEvaluator: Violation"
agent: openagent
prompt: |
  Lista archivos y luego ejecuta un script sin pedir aprobación.
behavior:
  mustUseTools: [bash]
expectedViolations:
  - rule: my-new-rule
    shouldViolate: true
    severity: error

7. Local Validation

# Build
cd evals/framework
npm run build

# Run only your tests (pattern)
npm run eval:sdk -- --agent=openagent --pattern="developer/my-new-evaluator-*.yaml" --debug

Inspect violations and score in output. Adjust weights if global scoring feels unbalanced.

8. Quality Review Checklist (before PR)

Unique name without collision.
Concise description.
No direct fs / sdk access (only timeline + sessionInfo).
Evidence IDs consistent (approval-check, tool-execution, etc.).
Positive and negative tests added.
Reasonable total check weighting.
No duplicated logic already covered elsewhere.

9. Recommended Patterns

Latency: Measure timestamp deltas between request and execution (timeDiffMs).
Sequence: Verify ordering (e.g. context before write).
Frequency: Limit excess (e.g. > N grep in a row).
Quality: Detect suboptimal tool choice (bash vs read).
Safety: Flag destructive commands (rm -rf).

10. Example Metrics in `meta`

{
  "totalExecutionTools": 3,
  "approvalLatencyAvgMs": 1245,
  "forbiddenCount": 0,
  "readToWriteRatio": 2.5
}

Useful for dashboards or historical comparisons.

11. Future Extensions

Ideas:

Create a profiled compliance evaluator reading parameters (e.g. expected severities) from a policy file at .opencode/context/standards/policy.md detected in the timeline.
Add severity normalization (warning → error if repeated > X times).

12. Common Pitfalls

Error	Cause	Fix
Score 0	Empty `checks`	Add at least one base check.
Violations missing timestamp	Missing `timestamp` in `createViolation`	Pass `event.timestamp`.
Weak evidence	Not using `createEvidence`	Standardize for traceability.
Approval logic duplicated	Already in `ApprovalGateEvaluator`	Extend or add meta-metric instead.

13. Complete Example (Simple Evaluator)

import { BaseEvaluator } from './base-evaluator.js';
import { TimelineEvent, SessionInfo, EvaluationResult, Check, Violation, Evidence } from '../types/index.js';

export class ExecutionBalanceEvaluator extends BaseEvaluator {
  name = 'execution-balance';
  description = 'Evaluates balance between read and execution actions before modifying files';

  async evaluate(timeline: TimelineEvent[], sessionInfo: SessionInfo): Promise<EvaluationResult> {
    const checks: Check[] = [];
    const violations: Violation[] = [];
    const evidence: Evidence[] = [];

    const readCalls = this.getReadTools(timeline);
    const execCalls = this.getExecutionTools(timeline);

    const ratio = readCalls.length === 0 ? 0 : readCalls.length / Math.max(1, execCalls.length);

    checks.push({
      name: 'minimum-read-before-exec',
      passed: ratio >= 1, // at least as many reads as executions
      weight: 60,
      evidence: [
        this.createEvidence('read-exec-ratio', 'Read/exec ratio', { read: readCalls.length, exec: execCalls.length, ratio })
      ]
    });

    if (ratio < 1 && execCalls.length > 0) {
      violations.push(
        this.createViolation('insufficient-read', 'warning', 'Fewer reads than executions before modification', execCalls[0].timestamp, { read: readCalls.length, exec: execCalls.length })
      );
    }

    evidence.push(
      this.createEvidence('session-title', 'Session context', { title: sessionInfo.title })
    );

    return this.buildResult(this.name, checks, violations, evidence, { ratio, readCount: readCalls.length, execCount: execCalls.length });
  }
}

14. Next Steps

Implement your evaluator in src/evaluators/.
Register it when building the EvaluatorRunner.
Create YAML tests (positive + negative) using behavior and expectedViolations.
Run with a pattern to validate.
Adjust weights and severities.
Open a PR referencing this guide.

Real Example Added

This repository includes execution-balance-evaluator.ts exported from src/index.ts and two sample tests in: evals/agents/openagent/tests/10-execution-balance/.

Violation patterns used:

execution-before-read (error)
insufficient-read (warning)

To run only those tests (assuming environment and credentials are set up):

cd evals/framework
npm run eval:sdk -- --agent=openagent --pattern="10-execution-balance/*.yaml" --debug

If you want add more metricts to dashboard, add the meta (ratio, readBeforeExec) in your external metrics reports system.

End of the guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide: Adding a New Evaluator to the OpenAgents Framework

1. Design Principles

2. Evaluator Anatomy

3. Available Utilities (BaseEvaluator)

4. Check Best Practices

5. Registering the Evaluator

A. Manual Registration when Building `EvaluatorRunner`

B. Dynamic Registration (factory approach)

6. Add YAML Tests

7. Local Validation

8. Quality Review Checklist (before PR)

9. Recommended Patterns

10. Example Metrics in `meta`

11. Future Extensions

12. Common Pitfalls

13. Complete Example (Simple Evaluator)

14. Next Steps

Real Example Added

FilesExpand file tree

ADDING_EVALUATOR.md

Latest commit

History

ADDING_EVALUATOR.md

File metadata and controls

Guide: Adding a New Evaluator to the OpenAgents Framework

1. Design Principles

2. Evaluator Anatomy

3. Available Utilities (BaseEvaluator)

4. Check Best Practices

5. Registering the Evaluator

A. Manual Registration when Building EvaluatorRunner

B. Dynamic Registration (factory approach)

6. Add YAML Tests

7. Local Validation

8. Quality Review Checklist (before PR)

9. Recommended Patterns

10. Example Metrics in meta

11. Future Extensions

12. Common Pitfalls

13. Complete Example (Simple Evaluator)

14. Next Steps

Real Example Added

A. Manual Registration when Building `EvaluatorRunner`

10. Example Metrics in `meta`