Add delta comparison against LKG baseline in PR evaluation

# Add delta comparison against LKG baseline in PR evaluation

## Problem

Today, the skill evaluation in PRs measures "skill vs no-skill" — whether the skill helps compared to having no skill at all. This catches skills that are net-negative, but it doesn't answer a key question for PR review: **did this PR make the skill better or worse than it was before?**

A PR could regress a skill's improvement score from 40% to 15% and still pass evaluation, because 15% is above the threshold. Reviewers currently have no automated signal to catch this.

## Proposal

Compare the PR's evaluation scores against the last known good (LKG) results from the dashboard data source (the JSON files from the most recent clean `main` run with no timeouts). Show the delta in the PR comment alongside the existing pass/fail verdict.

Example output:

```
Skill: aot-compat
  Improvement vs no-skill: 22% (✅ pass, threshold 10%)
  Delta vs main (LKG):     -8%  (✅ within tolerance, was 30%)
```

```
Skill: aot-compat
  Improvement vs no-skill: 22% (✅ pass, threshold 10%)
  Delta vs main (LKG):    -18%  (❌ regression beyond threshold, was 40%)
```

This should be a **blocking gate with a generous threshold** to account for cross-session variance (different runs, API conditions, LLM sampling noise). For example, block only if the delta exceeds -15% or falls outside the LKG's confidence interval. This catches real regressions while tolerating normal run-to-run noise. We can tune over time, for now we need to add at least some protection for PR's that modify skills.

## Why LKG rather than re-running the old skill

Re-running the previous version of the skill in the same eval pass would give the most accurate comparison, but:

- It doubles API quota usage, which is already a constraint
- LKG data from `main` already exists and requires zero additional agent runs
- The comparison is still meaningful since eval scenarios and judge prompts are stable

## Edge cases to consider

- **New skills** (no LKG data): Skip delta, show "new skill — no baseline" 
- **Changed eval scenarios**: If `tests/eval.yaml` changed in the PR, the delta may not be apples-to-apples. Flag this in the comment. In future, we might re-run evaluation for old skill in this case.
- **LKG had timeouts**: Use the most recent result set with no timeouts. If none exists, skip delta.

## Context

Discussion in https://github.com/dotnet/skills/pull/366 identified this gap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add delta comparison against LKG baseline in PR evaluation #412

Add delta comparison against LKG baseline in PR evaluation

Problem

Proposal

Why LKG rather than re-running the old skill

Edge cases to consider

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add delta comparison against LKG baseline in PR evaluation #412

Description

Add delta comparison against LKG baseline in PR evaluation

Problem

Proposal

Why LKG rather than re-running the old skill

Edge cases to consider

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions