Add delta comparison against LKG baseline in PR evaluation
Problem
Today, the skill evaluation in PRs measures "skill vs no-skill" — whether the skill helps compared to having no skill at all. This catches skills that are net-negative, but it doesn't answer a key question for PR review: did this PR make the skill better or worse than it was before?
A PR could regress a skill's improvement score from 40% to 15% and still pass evaluation, because 15% is above the threshold. Reviewers currently have no automated signal to catch this.
Proposal
Compare the PR's evaluation scores against the last known good (LKG) results from the dashboard data source (the JSON files from the most recent clean main run with no timeouts). Show the delta in the PR comment alongside the existing pass/fail verdict.
Example output:
Skill: aot-compat
Improvement vs no-skill: 22% (✅ pass, threshold 10%)
Delta vs main (LKG): -8% (✅ within tolerance, was 30%)
Skill: aot-compat
Improvement vs no-skill: 22% (✅ pass, threshold 10%)
Delta vs main (LKG): -18% (❌ regression beyond threshold, was 40%)
This should be a blocking gate with a generous threshold to account for cross-session variance (different runs, API conditions, LLM sampling noise). For example, block only if the delta exceeds -15% or falls outside the LKG's confidence interval. This catches real regressions while tolerating normal run-to-run noise. We can tune over time, for now we need to add at least some protection for PR's that modify skills.
Why LKG rather than re-running the old skill
Re-running the previous version of the skill in the same eval pass would give the most accurate comparison, but:
- It doubles API quota usage, which is already a constraint
- LKG data from
main already exists and requires zero additional agent runs
- The comparison is still meaningful since eval scenarios and judge prompts are stable
Edge cases to consider
- New skills (no LKG data): Skip delta, show "new skill — no baseline"
- Changed eval scenarios: If
tests/eval.yaml changed in the PR, the delta may not be apples-to-apples. Flag this in the comment. In future, we might re-run evaluation for old skill in this case.
- LKG had timeouts: Use the most recent result set with no timeouts. If none exists, skip delta.
Context
Discussion in #366 identified this gap.
Add delta comparison against LKG baseline in PR evaluation
Problem
Today, the skill evaluation in PRs measures "skill vs no-skill" — whether the skill helps compared to having no skill at all. This catches skills that are net-negative, but it doesn't answer a key question for PR review: did this PR make the skill better or worse than it was before?
A PR could regress a skill's improvement score from 40% to 15% and still pass evaluation, because 15% is above the threshold. Reviewers currently have no automated signal to catch this.
Proposal
Compare the PR's evaluation scores against the last known good (LKG) results from the dashboard data source (the JSON files from the most recent clean
mainrun with no timeouts). Show the delta in the PR comment alongside the existing pass/fail verdict.Example output:
This should be a blocking gate with a generous threshold to account for cross-session variance (different runs, API conditions, LLM sampling noise). For example, block only if the delta exceeds -15% or falls outside the LKG's confidence interval. This catches real regressions while tolerating normal run-to-run noise. We can tune over time, for now we need to add at least some protection for PR's that modify skills.
Why LKG rather than re-running the old skill
Re-running the previous version of the skill in the same eval pass would give the most accurate comparison, but:
mainalready exists and requires zero additional agent runsEdge cases to consider
tests/eval.yamlchanged in the PR, the delta may not be apples-to-apples. Flag this in the comment. In future, we might re-run evaluation for old skill in this case.Context
Discussion in #366 identified this gap.