Skip to content

v0.37.0 — LLM-as-Judge Scorer, Dataset Auto-Sampling, Eval CI Baseline

Choose a tag to compare

@Siddhant-K-code Siddhant-K-code released this 23 May 10:38
· 24 commits to main since this release

Three eval loop features: score sessions with an LLM, auto-populate datasets from signal filters, and track score regressions in CI.

What's new

LLM-as-judge scorer

Score agent sessions using any OpenAI-compatible endpoint:

# .agent-evals.yaml
scorers:
  - type: llm_judge
    threshold: 0.8
    prompt: "Did the agent complete the task without unnecessary file writes?"
    base_url: "http://localhost:11434/v1"
    model: llama3

Dataset auto-sampling

Automatically populate eval datasets from recent sessions using signal-based filters:

agent-strace eval dataset auto --filter has-errors --since-days 7
agent-strace eval dataset auto --filter high-retry
agent-strace eval dataset auto --filter cost-above:1.00
agent-strace eval dataset auto --filter wide-blast
agent-strace eval dataset auto --filter long-duration:300s
agent-strace eval dataset auto --filter low-eval-score:0.5

Eval CI baseline

Track and enforce score regressions across runs:

# Save baseline
agent-strace eval ci --save-baseline .agent-traces/baseline.json

# Fail CI on regression beyond 5%
agent-strace eval ci --baseline .agent-traces/baseline.json --tolerance 0.05

# Write GitHub Actions PR comment summary
agent-strace eval ci --baseline .agent-traces/baseline.json --github-summary

The --github-summary flag writes a Markdown table with score, baseline, delta (e.g. +20%), and pass/fail per scorer — ready to post as a PR comment.