Rework analyze-ci-failures script - add evals by vicroms · Pull Request #51513 · microsoft/vcpkg

vicroms · 2026-05-03T06:25:36Z

I created an evaluation project for the AI agent skills in vcpkg using Microsoft/waza.
The skill is evaluated analyze-ci-failures using five different models and grading the produced output. The skill is asked to analyze the output of a real CI run and generate a report of the regressions found.

The changes to the skill were motivated by the output of the waza check and waza run commands, in combination this evaluate the quality of the skill and the output it produces. Taking an iterative approach the skill was reworked to greatly reduce over-specificity and produce output that can pass the evaluation metrics.

I plan to make the evaluations public but probably kept in a separate repository (or maybe another branch?). The evaluation rubrics are:

The correct CI build is referenced.
All triplets with regressions are identified.
All ports with regressions are identified and root caused.
The skill produces a report and downloads the ADO failure logs for review.
The skill passes a quality judgement by an LLM.

I also ran a test comparing the best performing model with and without the skill.

AI Model Performance Comparison Report

Generated: 2026-05-02 22:25:55
Evaluation: analyze-ci-failures-eval
Skill Tested: analyze-ci-failures
Models Evaluated: 5

Executive Summary

Best Performing Model: gpt-5.4-mini (weighted score: 100.00%)

Model	Category	Weighted Score	Avg Duration/Trial	Est. Cost/Trial
claude-opus-4.7-1m	Unknown	100.00%	7.9m	N/A
gpt-5.3-codex	Powerful	100.00%	8.0m	$0.5744
gpt-5.4-mini	Lightweight	100.00%	7.4m	$0.2312
claude-opus-4.5	Powerful	94.72%	5.8m	$1.3974
claude-haiku-4.5	Versatile	86.67%	4.7m	$0.1979

Detailed Performance Analysis

claude-opus-4.7-1m

Overall Metrics

Weighted Score: 1.0000
Aggregate Score: 1.0000
Success Rate: 100%
Total Duration: 23.7m
Tests Passed/Failed: 1/0

Task: Real CI build failures

Status: PASSED
Runs: 3 (3 passed, 0 failed)
Avg Duration: 6.9m
Duration Range: 6.2m - 7.5m

Validation Results:

Grader	Weight	Avg Score	Run Scores
build_id_referenced	0.5	✅ 1.00	[1.00, 1.00, 1.00]
failure_type_classification	1.5	✅ 1.00	[1.00, 1.00, 1.00]
logs_directory_created	1	✅ 1.00	[1.00, 1.00, 1.00]
regression_ports_identified	2	✅ 1.00	[1.00, 1.00, 1.00]
report_content_quality	1.5	✅ 1.00	[1.00, 1.00, 1.00]
report_file_created	1	✅ 1.00	[1.00, 1.00, 1.00]
report_quality	1.5	✅ 1.00	[1.00, 1.00, 1.00]
triplet_coverage	1	✅ 1.00	[1.00, 1.00, 1.00]

gpt-5.3-codex

Overall Metrics

Weighted Score: 1.0000
Aggregate Score: 1.0000
Success Rate: 100%
Total Duration: 24.1m
Tests Passed/Failed: 1/0

Task: Real CI build failures

Status: PASSED
Runs: 3 (3 passed, 0 failed)
Avg Duration: 6.9m
Duration Range: 5.9m - 7.8m

Validation Results:

Grader	Weight	Avg Score	Run Scores
build_id_referenced	0.5	✅ 1.00	[1.00, 1.00, 1.00]
failure_type_classification	1.5	✅ 1.00	[1.00, 1.00, 1.00]
logs_directory_created	1	✅ 1.00	[1.00, 1.00, 1.00]
regression_ports_identified	2	✅ 1.00	[1.00, 1.00, 1.00]
report_content_quality	1.5	✅ 1.00	[1.00, 1.00, 1.00]
report_file_created	1	✅ 1.00	[1.00, 1.00, 1.00]
report_quality	1.5	✅ 1.00	[1.00, 1.00, 1.00]
triplet_coverage	1	✅ 1.00	[1.00, 1.00, 1.00]

gpt-5.4-mini

Overall Metrics

Weighted Score: 1.0000
Aggregate Score: 1.0000
Success Rate: 100%
Total Duration: 22.1m
Tests Passed/Failed: 1/0

Task: Real CI build failures

Status: PASSED
Runs: 3 (3 passed, 0 failed)
Avg Duration: 6.4m
Duration Range: 6.1m - 6.6m

Validation Results:

Grader	Weight	Avg Score	Run Scores
build_id_referenced	0.5	✅ 1.00	[1.00, 1.00, 1.00]
failure_type_classification	1.5	✅ 1.00	[1.00, 1.00, 1.00]
logs_directory_created	1	✅ 1.00	[1.00, 1.00, 1.00]
regression_ports_identified	2	✅ 1.00	[1.00, 1.00, 1.00]
report_content_quality	1.5	✅ 1.00	[1.00, 1.00, 1.00]
report_file_created	1	✅ 1.00	[1.00, 1.00, 1.00]
report_quality	1.5	✅ 1.00	[1.00, 1.00, 1.00]
triplet_coverage	1	✅ 1.00	[1.00, 1.00, 1.00]

claude-opus-4.5

Overall Metrics

Weighted Score: 0.9472
Aggregate Score: 0.9560
Success Rate: 0%
Total Duration: 17.3m
Tests Passed/Failed: 0/1

Task: Real CI build failures

Status: FAILED
Runs: 3 (1 passed, 2 failed)
Avg Duration: 4.4m
Duration Range: 4.2m - 4.7m

Validation Results:

Grader	Weight	Avg Score	Run Scores
build_id_referenced	0.5	✅ 1.00	[1.00, 1.00, 1.00]
failure_type_classification	1.5	❌ 0.67	[0.50, 1.00, 0.50]
logs_directory_created	1	✅ 1.00	[1.00, 1.00, 1.00]
regression_ports_identified	2	✅ 1.00	[1.00, 1.00, 1.00]
report_content_quality	1.5	⚠️ 0.98	[1.00, 1.00, 0.94]
report_file_created	1	✅ 1.00	[1.00, 1.00, 1.00]
report_quality	1.5	✅ 1.00	[1.00, 1.00, 1.00]
triplet_coverage	1	✅ 1.00	[1.00, 1.00, 1.00]

claude-haiku-4.5

Overall Metrics

Weighted Score: 0.8667
Aggregate Score: 0.8333
Success Rate: 0%
Total Duration: 14.1m
Tests Passed/Failed: 0/1

Task: Real CI build failures

Status: FAILED
Runs: 3 (2 passed, 1 failed)
Avg Duration: 3.8m
Duration Range: 3.3m - 4.5m

Validation Results:

Grader	Weight	Avg Score	Run Scores
build_id_referenced	0.5	❌ 0.67	[1.00, 1.00, 0.00]
failure_type_classification	1.5	✅ 1.00	[1.00, 1.00, 1.00]
logs_directory_created	1	❌ 0.67	[1.00, 1.00, 0.00]
regression_ports_identified	2	✅ 1.00	[1.00, 1.00, 1.00]
report_content_quality	1.5	❌ 0.67	[1.00, 1.00, 0.00]
report_file_created	1	❌ 0.67	[1.00, 1.00, 0.00]
report_quality	1.5	✅ 1.00	[1.00, 1.00, 1.00]
triplet_coverage	1	✅ 1.00	[1.00, 1.00, 1.00]

Cost Analysis

Costs are estimated based on GitHub Copilot pricing (per 1M tokens). Values shown are per trial averages.

Token Usage Per Trial

Model	Trials	Input Tokens	Cached Tokens	Output Tokens	Total Tokens
claude-opus-4.7-1m	3	89.9K	1.03M	22.6K	1.14M
gpt-5.3-codex	3	82.0K	704.2K	22.0K	808.2K
gpt-5.4-mini	3	91.8K	906.6K	21.0K	1.02M
claude-opus-4.5	3	110.5K	1.09M	12.1K	1.21M
claude-haiku-4.5	3	59.3K	820.0K	11.3K	890.6K

Cost Breakdown Per Trial

Model	Input Cost	Cached Cost	Output Cost	Cost/Trial
claude-opus-4.7-1m	N/A	N/A	N/A	N/A
gpt-5.3-codex	$0.1435	$0.1232	$0.3077	$0.5744
gpt-5.4-mini	$0.0689	$0.0680	$0.0943	$0.2312
claude-opus-4.5	$0.5525	$0.5426	$0.3023	$1.3974
claude-haiku-4.5	$0.0593	$0.0820	$0.0566	$0.1979

Cost-Performance Ratio (Per Trial)

Model	Score	Cost/Trial	Score per $1
claude-opus-4.7-1m	100.00%	N/A	N/A
gpt-5.3-codex	100.00%	$0.5744	1.74
gpt-5.4-mini	100.00%	$0.2312	4.32
claude-opus-4.5	94.72%	$1.3974	0.68
claude-haiku-4.5	86.67%	$0.1979	4.38

Model Pricing Reference

Prices per 1 million tokens (source: GitHub Copilot Models and Pricing):

Model	Category	Input	Cached	Output
claude-haiku-4.5	Versatile	$1.00	$0.100	$5.00
claude-opus-4.5	Powerful	$5.00	$0.500	$25.00
gpt-5.3-codex	Powerful	$1.75	$0.175	$14.00
gpt-5.4-mini	Lightweight	$0.75	$0.075	$4.50

Evaluation Rubrics

The following graders were used to evaluate model performance:

build_id_referenced

Type: text
Weight: 0.5

Expected content (must contain):

https://dev.azure.com/vcpkg/public/_build/results?buildId=129315

failure_type_classification

Type: text
Weight: 1.5

Expected content (must contain):

FILE_CONFLICTS
BUILD_FAILED

logs_directory_created

Type: file
Weight: 1

regression_ports_identified

Type: text
Weight: 2

Expected content (must contain):

kf6i18n
kf6itemmodels
flint
allegro5
mathgl
mdl-sdk
salome-med-fichier
sebsjames-maths
vcpkg-ci-matio

report_content_quality

Type: file
Weight: 1.5

report_file_created

Type: file
Weight: 1

report_quality

Type: prompt
Weight: 1.5

Evaluation prompt:

Read the report.md file you saved under ci-failure-analysis/ci-129315/ and grade it.

Ground truth regressions for build #129315:
- kf6i18n, kf6itemmodels: FILE_CONFLICTS on Windows triplets
- flint: BUILD_FAILED on x64-windows-static
- allegro5, salome-med-fichier, sebsjames-maths, vcpkg-ci-matio: BUILD_FAILED on arm64-linux

Check that:
1. The report includes:
  - a summary with triplet table, 
  - a list of regressions per triplet
  - root cause for identified regressions
  - action recommendations to take for each regression
2. All ground truth regressions are identified with correct failure types
3. Baseline/known failures are separated from new regressions

If the report follows the guidelines and identifies regressions correctly,
call set_waza_grade_pass.
Otherwise, call set_waza_grade_fail with your reasoning.

triplet_coverage

Type: text
Weight: 1

Expected content (must contain):

x86-windows
x64-windows
x64-windows-release
x64-windows-static
arm64-linux

Methodology

Evaluation Configuration

Runs per test: 3
Timeout: 1200 seconds
Judge model: chatgpt-5.4-mini
Engine type: copilot-sdk

Scoring

Weighted Score: Sum of (grader_score × grader_weight) / sum of weights
Aggregate Score: Average of individual grader scores
Success Rate: Percentage of tests that passed all graders

Run Aggregation

Each test is run multiple times to account for model variability. A test is considered:

Passed: If the majority of runs pass all graders
Failed: If any grader fails in the majority of runs

Raw Data Reference

Model	Eval ID	Timestamp	Result File
claude-haiku-4.5	run-1777783044	2026-05-02T21:23:18	`analyze-ci-failures-claude-haiku-4.5-20260502-203637.json`
claude-opus-4.5	run-1777784087	2026-05-02T21:37:29	`analyze-ci-failures-claude-opus-4.5-20260502-203637.json`
claude-opus-4.7-1m	run-1777785513	2026-05-02T21:54:52	`analyze-ci-failures-claude-opus-4.7-1m-20260502-203637.json`
gpt-5.3-codex	run-1777782185	2026-05-02T20:58:56	`analyze-ci-failures-gpt-5.3-codex-20260502-203637.json`
gpt-5.4-mini	run-1777780727	2026-05-02T20:36:42	`analyze-ci-failures-gpt-5.4-mini-20260502-203637.json`

Comparison against agent with no skill

./waza run analyze-ci-failures --baseline --keep-workspace --parallel --verbose --model chatgpt-5.4-mini --output eval-baseline-gptmini.json --trials 1
Running benchmark: analyze-ci-failures-eval
Skill: analyze-ci-failures
Engine: copilot-sdk
Model: chatgpt-5.4-mini
Judge Model: chatgpt-5.4-mini
Parallel: 4 workers
Skill Directories:
  - C:\dev\copilot\vcpkg-ai-evals\skills\analyze-ci-failures


════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
Starting benchmark with 1 test(s)...

[1/1] Running test: Real CI build failures
  Run 1/1...  [PROMPT] Analyze https://dev.azure.com/vcpkg/public/_build/results?buildId=129315
  [RESPONSE] Now let me analyze the failure logs for all regression ports and check baselines.Now I have all the data needed. Let me generate the report.# vcpkg CI Failure Report — Build #129315

**Build:** [202...
  [TOOLS] 33 tool call(s)
  [GRADER] ✓ build_id_referenced score=1.00 (0s) — All text checks passed
  [GRADER] ✓ failure_type_classification score=1.00 (0s) — All text checks passed
  [GRADER] ✓ logs_directory_created score=1.00 (0s) — All file checks passed
  [GRADER] ✓ regression_ports_identified score=1.00 (0s) — All text checks passed
  [GRADER] ✓ report_content_quality score=1.00 (0s) — All file checks passed
  [GRADER] ✓ report_file_created score=1.00 (0s) — All file checks passed
  [GRADER] ✓ report_quality score=1.00 (27.574s) — All prompts passed
  [GRADER] ✓ triplet_coverage score=1.00 (0s) — All text checks passed
 passed (6m35.193s)
  Workspace: C:\Users\viromer\AppData\Local\Temp\waza-767351820
  Test Real CI build failures: passed

Benchmark completed in 7m35.24s


════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
Starting benchmark with 1 test(s)...

[1/1] Running test: Real CI build failures
  Run 1/1...  [PROMPT] Analyze https://dev.azure.com/vcpkg/public/_build/results?buildId=129315
  [RESPONSE] The Azure DevOps page is a JavaScript SPA, so the HTML doesn't contain build data. Let me try the REST API instead.Here's the analysis of **build #129315** (`microsoft.vcpkg.ci` #20260330.1):

---

##...
  [TOOLS] 13 tool call(s)
  [GRADER] ✗ build_id_referenced score=0.00 (0s) — Missing expected substring: https://dev.azure.com/vcpkg/public/_build/results?buildId=129315
  [GRADER] ✓ failure_type_classification score=1.00 (0s) — All text checks passed
  [GRADER] ✗ logs_directory_created score=0.00 (0s) — File must exist but not found: ci-failure-analysis/ci-129315/logs
  [GRADER] ✓ regression_ports_identified score=1.00 (0s) — All text checks passed
  [GRADER] ✗ report_content_quality score=0.00 (0s) — File not found for content check: ci-failure-analysis/ci-129315/report.md; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): https://dev\.azure\.com/vcpkg/public/_build/results\?buildId=129315; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): kf6i18n; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): kf6itemmodels; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): flint; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): allegro5; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): mathgl; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): mdl-sdk; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): salome-med-fichier; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): sebsjames-maths; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): vcpkg-ci-matio; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): FILE_CONFLICTS; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): BUILD_FAILED; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): x86-windows; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): x64-windows; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): x64-windows-release; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): x64-windows-static; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): arm64-linux
  [GRADER] ✗ report_file_created score=0.00 (0s) — File must exist but not found: ci-failure-analysis/ci-129315/report.md
  [GRADER] ✗ report_quality score=0.00 (21.274s) — fail: No report.md file was saved: The file `ci-failure-analysis/ci-129315/report.md` does not exist. The analysis was provided inline in the chat response but was never saved to disk as a report file. The grading criteria require a saved report with: a summary with triplet table, a list of regressions per triplet, root cause analysis, and action recommendations. Since no file exists, these criteria cannot be met.
  [GRADER] ✓ triplet_coverage score=1.00 (0s) — All text checks passed
 failed (1m54.993s)
  Workspace: C:\Users\viromer\AppData\Local\Temp\waza-3318899754
  Test Real CI build failures: failed

Benchmark completed in 2m16.28s


════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
  With Skills:    100.0% (1/1 tasks passed)
  Without Skills: 0.0% (0/1 tasks passed)
  Impact:         +100.0 percentage points

Per-Task Breakdown:
  • Real CI build failures         [IMPROVED]  0% → 100% (+100pp)

Verdict: Skills have POSITIVE IMPACT (improved 1/1 tasks)
════════════════════════════════════════════════════════════════
Workspace preserved: C:\Users\viromer\AppData\Local\Temp\waza-767351820
Workspace preserved: C:\Users\viromer\AppData\Local\Temp\waza-3318899754
===================================================
 BENCHMARK RESULTS
===================================================

Total Tests:    1
Succeeded:      1
Failed:         0
Errors:         0
Success Rate:   100.0%
Aggregate Score: 1.00
Min Score:      1.00
Max Score:      1.00
Std Dev:        0.0000
Duration:       7m35.24s

---------------------------------------------------
 PER-TASK BREAKDOWN
---------------------------------------------------
  ✓ Real CI build failures [passed]
      pass_rate=100.0%  avg=1.00  min=1.00  max=1.00  stddev=0.0000  avg_dur=395193ms

BillyONeal · 2026-05-03T06:57:11Z

Neat report! I'm not that surprised to see GPT win at this.

BillyONeal · 2026-05-03T06:59:01Z

+
+## Critical Rules
+
+- Use `Invoke-WebRequest` for ZIPs — `web_fetch` can't download binaries


Suggested change

- Use `Invoke-WebRequest` for ZIPs — `web_fetch` can't download binaries

- Use `Invoke-WebRequest` or curl shell commands for ZIPs — `web_fetch` can't download binaries

Just to hopefully do better when copilot-cli is in a bash/zsh/etc. shell

Rework analyze-ci-failures script - pass evals

159cb19

vicroms changed the title ~~Rework analyze-ci-failures script - pass evals~~ Rework analyze-ci-failures script - add evals May 3, 2026

vicroms mentioned this pull request May 3, 2026

[skills] Add skills to create ports and patches #51187

Draft

BillyONeal approved these changes May 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework analyze-ci-failures script - add evals#51513

Rework analyze-ci-failures script - add evals#51513
vicroms wants to merge 1 commit intomicrosoft:masterfrom
vicroms:skill/analyze-ci-failures-improvements

vicroms commented May 3, 2026

Uh oh!

BillyONeal commented May 3, 2026

Uh oh!

BillyONeal May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## Critical Rules

		- Use `Invoke-WebRequest` for ZIPs — `web_fetch` can't download binaries

	- Use `Invoke-WebRequest` for ZIPs — `web_fetch` can't download binaries
	- Use `Invoke-WebRequest` or curl shell commands for ZIPs — `web_fetch` can't download binaries

Conversation

vicroms commented May 3, 2026

AI Model Performance Comparison Report

Executive Summary

Detailed Performance Analysis

claude-opus-4.7-1m

Overall Metrics

Task: Real CI build failures

gpt-5.3-codex

Overall Metrics

Task: Real CI build failures

gpt-5.4-mini

Overall Metrics

Task: Real CI build failures

claude-opus-4.5

Overall Metrics

Task: Real CI build failures

claude-haiku-4.5

Overall Metrics

Task: Real CI build failures

Cost Analysis

Token Usage Per Trial

Cost Breakdown Per Trial

Cost-Performance Ratio (Per Trial)

Model Pricing Reference

Evaluation Rubrics

build_id_referenced

failure_type_classification

logs_directory_created

regression_ports_identified

report_content_quality

report_file_created

report_quality

triplet_coverage

Methodology

Evaluation Configuration

Scoring

Run Aggregation

Raw Data Reference

Comparison against agent with no skill

Uh oh!

BillyONeal commented May 3, 2026

Uh oh!

BillyONeal May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants