Skip to content

Add delta comparison against LKG baseline in PR evaluation #412

@danmoseley

Description

@danmoseley

Add delta comparison against LKG baseline in PR evaluation

Problem

Today, the skill evaluation in PRs measures "skill vs no-skill" — whether the skill helps compared to having no skill at all. This catches skills that are net-negative, but it doesn't answer a key question for PR review: did this PR make the skill better or worse than it was before?

A PR could regress a skill's improvement score from 40% to 15% and still pass evaluation, because 15% is above the threshold. Reviewers currently have no automated signal to catch this.

Proposal

Compare the PR's evaluation scores against the last known good (LKG) results from the dashboard data source (the JSON files from the most recent clean main run with no timeouts). Show the delta in the PR comment alongside the existing pass/fail verdict.

Example output:

Skill: aot-compat
  Improvement vs no-skill: 22% (✅ pass, threshold 10%)
  Delta vs main (LKG):     -8%  (✅ within tolerance, was 30%)
Skill: aot-compat
  Improvement vs no-skill: 22% (✅ pass, threshold 10%)
  Delta vs main (LKG):    -18%  (❌ regression beyond threshold, was 40%)

This should be a blocking gate with a generous threshold to account for cross-session variance (different runs, API conditions, LLM sampling noise). For example, block only if the delta exceeds -15% or falls outside the LKG's confidence interval. This catches real regressions while tolerating normal run-to-run noise. We can tune over time, for now we need to add at least some protection for PR's that modify skills.

Why LKG rather than re-running the old skill

Re-running the previous version of the skill in the same eval pass would give the most accurate comparison, but:

  • It doubles API quota usage, which is already a constraint
  • LKG data from main already exists and requires zero additional agent runs
  • The comparison is still meaningful since eval scenarios and judge prompts are stable

Edge cases to consider

  • New skills (no LKG data): Skip delta, show "new skill — no baseline"
  • Changed eval scenarios: If tests/eval.yaml changed in the PR, the delta may not be apples-to-apples. Flag this in the comment. In future, we might re-run evaluation for old skill in this case.
  • LKG had timeouts: Use the most recent result set with no timeouts. If none exists, skip delta.

Context

Discussion in #366 identified this gap.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions