Skip to content

feat: evaluation framework - score, compare, and regression-test agent sessions (v0.8.0)#16

Merged
Siddhant-K-code merged 4 commits into
mainfrom
feat/issue-10-eval-framework
Apr 5, 2026
Merged

feat: evaluation framework - score, compare, and regression-test agent sessions (v0.8.0)#16
Siddhant-K-code merged 4 commits into
mainfrom
feat/issue-10-eval-framework

Conversation

@Siddhant-K-code
Copy link
Copy Markdown
Owner

@Siddhant-K-code Siddhant-K-code commented Apr 4, 2026

Closes #10

What

Adds agent-strace eval - a subcommand for scoring sessions, comparing them across runs, and integrating quality checks into CI.

Commands

# Score a session against configured scorers
agent-strace eval run abc123
agent-strace eval run abc123 --format json

# Compare two sessions side-by-side
agent-strace eval compare session-a session-b

# CI integration - exits 1 if any scorer fails
agent-strace eval ci abc123

# Dataset management
agent-strace eval dataset add --session abc123 --label "fix auth bug"
agent-strace eval dataset list
agent-strace eval dataset export > evals.jsonl

Scorers

All built-in scorers use zero new dependencies:

Scorer What it checks
no_errors No ERROR events in the session
regex Pattern match against any event type
cost_under Estimated cost <= max_dollars
files_scoped All file ops within allowed path prefixes
duration_under Session duration <= max_seconds
custom Any callable returning float in [0, 1]

Example output

Session: abc123
----------------------------------------------------------------------
  Scorer              Score  Threshold    Status    Reason
----------------------------------------------------------------------
  no_errors            1.00       1.00  v pass    no errors
  cost_under           0.82       1.00  x fail    $0.61 actual > $0.50 limit
  files_scoped         1.00       1.00  v pass    all files within allowed paths
  duration_under       1.00       1.00  v pass    4.0s <= 120.0s
----------------------------------------------------------------------
Overall: 3/4 passed

Config (.agent-evals.yaml)

scorers:
  - type: no_errors
    threshold: 1.0
  - type: cost_under
    max_dollars: 0.50
    threshold: 1.0
  - type: files_scoped
    allowed_paths: ["src/", "tests/"]
    threshold: 1.0
  - type: duration_under
    max_seconds: 120
    threshold: 0.8

thresholds:
  pass: 0.85
  warn: 0.70

Parsed with a stdlib-only YAML parser - no PyYAML dependency.

Implementation

src/agent_trace/eval/
├── __init__.py     # cmd_eval dispatcher
├── scorers.py      # built-in scorer implementations
├── config.py       # .agent-evals.yaml loader (stdlib YAML parser)
├── dataset.py      # dataset CRUD (JSONL-backed)
└── runner.py       # eval execution, formatting, compare, CI

Dataset entries stored as JSONL in .agent-traces/datasets/. No database, no external service.

41 new tests covering all scorers, dataset CRUD, runner formatting, compare output, and CI exit codes.

…ions (#10)

Add `agent-strace eval` subcommand with run, compare, ci, and dataset
management.

Scorers (zero new dependencies):
- no_errors: 1.0 if no ERROR events
- regex: pattern match against any event type
- cost_under: proportional score against a dollar budget
- files_scoped: all file ops within allowed path prefixes
- duration_under: session duration within a time budget
- custom: any callable returning float in [0, 1]

Commands:
  agent-strace eval run <session-id> [--format table|json]
  agent-strace eval compare <session-a> <session-b>
  agent-strace eval ci <session-id>   # exits 1 on any failure
  agent-strace eval dataset add|list|export

Config via .agent-evals.yaml (stdlib-only YAML parser, no PyYAML).
Dataset storage is local JSONL in .agent-traces/datasets/.

41 new tests covering all scorers, dataset CRUD, runner formatting,
compare output, and CI exit codes.

Co-authored-by: Ona <no-reply@ona.com>
@Siddhant-K-code Siddhant-K-code force-pushed the feat/issue-10-eval-framework branch from bce3b4a to cb157f4 Compare April 5, 2026 09:06
config.py: remove unused 'import re' and unused 'first_indent'
variable in _parse_block (first_content was the only one read).

dataset.py: remove unused TraceStore import.

Co-authored-by: Ona <no-reply@ona.com>
@Siddhant-K-code Siddhant-K-code changed the title feat: evaluation framework — score, compare, and regression-test agent sessions (v0.8.0) feat: evaluation framework - score, compare, and regression-test agent sessions (v0.8.0) Apr 5, 2026
Siddhant-K-code and others added 2 commits April 5, 2026 09:10
scorers.py:
- score_regex: wrap re.compile in try/except to return a ScoreResult
  instead of raising on invalid patterns
- score_files_scoped: replace asymptotic scoring formula with
  violations/total_ops ratio so 100% violations scores 0.0

config.py:
- ScorerConfig: remove redundant 'weight' field (threshold is the
  canonical name; weight was never read by the runner)

runner.py:
- EvalReport: make passed/failed properties derived from results list
  so they stay in sync if results is mutated after construction
- cmd_eval_ci: route table output to stderr for clean CI piping

Co-authored-by: Ona <no-reply@ona.com>
- score_regex: invalid pattern returns ScoreResult (not raises)
- score_files_scoped: all-violations scores 0.0, partial is proportional
- EvalReport.passed/failed: live properties stay in sync after mutation

Co-authored-by: Ona <no-reply@ona.com>
Siddhant-K-code added a commit that referenced this pull request Apr 5, 2026
@Siddhant-K-code Siddhant-K-code merged commit 635247e into main Apr 5, 2026
4 checks passed
@Siddhant-K-code Siddhant-K-code deleted the feat/issue-10-eval-framework branch April 5, 2026 09:11
Siddhant-K-code added a commit that referenced this pull request Apr 6, 2026
agent-strace share      -- self-contained HTML replay (#15)
agent-strace postmortem -- failure analysis (#15)
agent-strace eval       -- scoring and regression testing (#16)
agent-strace watch      -- live monitoring with circuit breakers (#17)

Co-authored-by: Ona <no-reply@ona.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.8.0: Evaluation framework — score, compare, and regression-test agent sessions

1 participant