feat: evaluation framework - score, compare, and regression-test agent sessions (v0.8.0) by Siddhant-K-code · Pull Request #16 · Siddhant-K-code/agent-trace

Siddhant-K-code · 2026-04-04T05:27:02Z

Closes #10

What

Adds agent-strace eval - a subcommand for scoring sessions, comparing them across runs, and integrating quality checks into CI.

Commands

# Score a session against configured scorers
agent-strace eval run abc123
agent-strace eval run abc123 --format json

# Compare two sessions side-by-side
agent-strace eval compare session-a session-b

# CI integration - exits 1 if any scorer fails
agent-strace eval ci abc123

# Dataset management
agent-strace eval dataset add --session abc123 --label "fix auth bug"
agent-strace eval dataset list
agent-strace eval dataset export > evals.jsonl

Scorers

All built-in scorers use zero new dependencies:

Scorer	What it checks
`no_errors`	No ERROR events in the session
`regex`	Pattern match against any event type
`cost_under`	Estimated cost <= max_dollars
`files_scoped`	All file ops within allowed path prefixes
`duration_under`	Session duration <= max_seconds
`custom`	Any callable returning float in [0, 1]

Example output

Session: abc123
----------------------------------------------------------------------
  Scorer              Score  Threshold    Status    Reason
----------------------------------------------------------------------
  no_errors            1.00       1.00  v pass    no errors
  cost_under           0.82       1.00  x fail    $0.61 actual > $0.50 limit
  files_scoped         1.00       1.00  v pass    all files within allowed paths
  duration_under       1.00       1.00  v pass    4.0s <= 120.0s
----------------------------------------------------------------------
Overall: 3/4 passed

Config (`.agent-evals.yaml`)

scorers:
  - type: no_errors
    threshold: 1.0
  - type: cost_under
    max_dollars: 0.50
    threshold: 1.0
  - type: files_scoped
    allowed_paths: ["src/", "tests/"]
    threshold: 1.0
  - type: duration_under
    max_seconds: 120
    threshold: 0.8

thresholds:
  pass: 0.85
  warn: 0.70

Parsed with a stdlib-only YAML parser - no PyYAML dependency.

Implementation

src/agent_trace/eval/
├── __init__.py     # cmd_eval dispatcher
├── scorers.py      # built-in scorer implementations
├── config.py       # .agent-evals.yaml loader (stdlib YAML parser)
├── dataset.py      # dataset CRUD (JSONL-backed)
└── runner.py       # eval execution, formatting, compare, CI

Dataset entries stored as JSONL in .agent-traces/datasets/. No database, no external service.

41 new tests covering all scorers, dataset CRUD, runner formatting, compare output, and CI exit codes.

…ions (#10) Add `agent-strace eval` subcommand with run, compare, ci, and dataset management. Scorers (zero new dependencies): - no_errors: 1.0 if no ERROR events - regex: pattern match against any event type - cost_under: proportional score against a dollar budget - files_scoped: all file ops within allowed path prefixes - duration_under: session duration within a time budget - custom: any callable returning float in [0, 1] Commands: agent-strace eval run <session-id> [--format table|json] agent-strace eval compare <session-a> <session-b> agent-strace eval ci <session-id> # exits 1 on any failure agent-strace eval dataset add|list|export Config via .agent-evals.yaml (stdlib-only YAML parser, no PyYAML). Dataset storage is local JSONL in .agent-traces/datasets/. 41 new tests covering all scorers, dataset CRUD, runner formatting, compare output, and CI exit codes. Co-authored-by: Ona <no-reply@ona.com>

config.py: remove unused 'import re' and unused 'first_indent' variable in _parse_block (first_content was the only one read). dataset.py: remove unused TraceStore import. Co-authored-by: Ona <no-reply@ona.com>

scorers.py: - score_regex: wrap re.compile in try/except to return a ScoreResult instead of raising on invalid patterns - score_files_scoped: replace asymptotic scoring formula with violations/total_ops ratio so 100% violations scores 0.0 config.py: - ScorerConfig: remove redundant 'weight' field (threshold is the canonical name; weight was never read by the runner) runner.py: - EvalReport: make passed/failed properties derived from results list so they stay in sync if results is mutated after construction - cmd_eval_ci: route table output to stderr for clean CI piping Co-authored-by: Ona <no-reply@ona.com>

- score_regex: invalid pattern returns ScoreResult (not raises) - score_files_scoped: all-violations scores 0.0, partial is proportional - EvalReport.passed/failed: live properties stay in sync after mutation Co-authored-by: Ona <no-reply@ona.com>

Co-authored-by: Ona <no-reply@ona.com>

agent-strace share -- self-contained HTML replay (#15) agent-strace postmortem -- failure analysis (#15) agent-strace eval -- scoring and regression testing (#16) agent-strace watch -- live monitoring with circuit breakers (#17) Co-authored-by: Ona <no-reply@ona.com>

Siddhant-K-code mentioned this pull request Apr 4, 2026

feat: wire share, postmortem, eval, and watch into CLI #18

Merged

Siddhant-K-code force-pushed the feat/issue-10-eval-framework branch from bce3b4a to cb157f4 Compare April 5, 2026 09:06

fix: remove unused imports and dead variable in eval package

a6822c3

config.py: remove unused 'import re' and unused 'first_indent' variable in _parse_block (first_content was the only one read). dataset.py: remove unused TraceStore import. Co-authored-by: Ona <no-reply@ona.com>

Siddhant-K-code changed the title ~~feat: evaluation framework — score, compare, and regression-test agent sessions (v0.8.0)~~ feat: evaluation framework - score, compare, and regression-test agent sessions (v0.8.0) Apr 5, 2026

Siddhant-K-code and others added 2 commits April 5, 2026 09:10

test: add coverage for review fixes

ef151d7

- score_regex: invalid pattern returns ScoreResult (not raises) - score_files_scoped: all-violations scores 0.0, partial is proportional - EvalReport.passed/failed: live properties stay in sync after mutation Co-authored-by: Ona <no-reply@ona.com>

Siddhant-K-code added a commit that referenced this pull request Apr 5, 2026

Merge feat/issue-10-eval-framework: evaluation framework (#16)

5035079

Co-authored-by: Ona <no-reply@ona.com>

Siddhant-K-code merged commit 635247e into main Apr 5, 2026
4 checks passed

Siddhant-K-code deleted the feat/issue-10-eval-framework branch April 5, 2026 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: evaluation framework - score, compare, and regression-test agent sessions (v0.8.0)#16

feat: evaluation framework - score, compare, and regression-test agent sessions (v0.8.0)#16
Siddhant-K-code merged 4 commits into
mainfrom
feat/issue-10-eval-framework

Siddhant-K-code commented Apr 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Siddhant-K-code commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Commands

Scorers

Example output

Config (.agent-evals.yaml)

Implementation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Siddhant-K-code commented Apr 4, 2026 •

edited

Loading

Config (`.agent-evals.yaml`)