Skip to content

Rework analyze-ci-failures script - add evals#51513

Open
vicroms wants to merge 1 commit intomicrosoft:masterfrom
vicroms:skill/analyze-ci-failures-improvements
Open

Rework analyze-ci-failures script - add evals#51513
vicroms wants to merge 1 commit intomicrosoft:masterfrom
vicroms:skill/analyze-ci-failures-improvements

Conversation

@vicroms
Copy link
Copy Markdown
Member

@vicroms vicroms commented May 3, 2026

I created an evaluation project for the AI agent skills in vcpkg using Microsoft/waza.
The skill is evaluated analyze-ci-failures using five different models and grading the produced output. The skill is asked to analyze the output of a real CI run and generate a report of the regressions found.

The changes to the skill were motivated by the output of the waza check and waza run commands, in combination this evaluate the quality of the skill and the output it produces. Taking an iterative approach the skill was reworked to greatly reduce over-specificity and produce output that can pass the evaluation metrics.

I plan to make the evaluations public but probably kept in a separate repository (or maybe another branch?). The evaluation rubrics are:

  • The correct CI build is referenced.
  • All triplets with regressions are identified.
  • All ports with regressions are identified and root caused.
  • The skill produces a report and downloads the ADO failure logs for review.
  • The skill passes a quality judgement by an LLM.

I also ran a test comparing the best performing model with and without the skill.


AI Model Performance Comparison Report

Generated: 2026-05-02 22:25:55
Evaluation: analyze-ci-failures-eval
Skill Tested: analyze-ci-failures
Models Evaluated: 5

Executive Summary

Best Performing Model: gpt-5.4-mini (weighted score: 100.00%)

Model Category Weighted Score Avg Duration/Trial Est. Cost/Trial
claude-opus-4.7-1m Unknown 100.00% 7.9m N/A
gpt-5.3-codex Powerful 100.00% 8.0m $0.5744
gpt-5.4-mini Lightweight 100.00% 7.4m $0.2312
claude-opus-4.5 Powerful 94.72% 5.8m $1.3974
claude-haiku-4.5 Versatile 86.67% 4.7m $0.1979

Detailed Performance Analysis

claude-opus-4.7-1m

Overall Metrics

  • Weighted Score: 1.0000
  • Aggregate Score: 1.0000
  • Success Rate: 100%
  • Total Duration: 23.7m
  • Tests Passed/Failed: 1/0

Task: Real CI build failures

  • Status: PASSED
  • Runs: 3 (3 passed, 0 failed)
  • Avg Duration: 6.9m
  • Duration Range: 6.2m - 7.5m

Validation Results:

Grader Weight Avg Score Run Scores
build_id_referenced 0.5 ✅ 1.00 [1.00, 1.00, 1.00]
failure_type_classification 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
logs_directory_created 1 ✅ 1.00 [1.00, 1.00, 1.00]
regression_ports_identified 2 ✅ 1.00 [1.00, 1.00, 1.00]
report_content_quality 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
report_file_created 1 ✅ 1.00 [1.00, 1.00, 1.00]
report_quality 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
triplet_coverage 1 ✅ 1.00 [1.00, 1.00, 1.00]

gpt-5.3-codex

Overall Metrics

  • Weighted Score: 1.0000
  • Aggregate Score: 1.0000
  • Success Rate: 100%
  • Total Duration: 24.1m
  • Tests Passed/Failed: 1/0

Task: Real CI build failures

  • Status: PASSED
  • Runs: 3 (3 passed, 0 failed)
  • Avg Duration: 6.9m
  • Duration Range: 5.9m - 7.8m

Validation Results:

Grader Weight Avg Score Run Scores
build_id_referenced 0.5 ✅ 1.00 [1.00, 1.00, 1.00]
failure_type_classification 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
logs_directory_created 1 ✅ 1.00 [1.00, 1.00, 1.00]
regression_ports_identified 2 ✅ 1.00 [1.00, 1.00, 1.00]
report_content_quality 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
report_file_created 1 ✅ 1.00 [1.00, 1.00, 1.00]
report_quality 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
triplet_coverage 1 ✅ 1.00 [1.00, 1.00, 1.00]

gpt-5.4-mini

Overall Metrics

  • Weighted Score: 1.0000
  • Aggregate Score: 1.0000
  • Success Rate: 100%
  • Total Duration: 22.1m
  • Tests Passed/Failed: 1/0

Task: Real CI build failures

  • Status: PASSED
  • Runs: 3 (3 passed, 0 failed)
  • Avg Duration: 6.4m
  • Duration Range: 6.1m - 6.6m

Validation Results:

Grader Weight Avg Score Run Scores
build_id_referenced 0.5 ✅ 1.00 [1.00, 1.00, 1.00]
failure_type_classification 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
logs_directory_created 1 ✅ 1.00 [1.00, 1.00, 1.00]
regression_ports_identified 2 ✅ 1.00 [1.00, 1.00, 1.00]
report_content_quality 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
report_file_created 1 ✅ 1.00 [1.00, 1.00, 1.00]
report_quality 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
triplet_coverage 1 ✅ 1.00 [1.00, 1.00, 1.00]

claude-opus-4.5

Overall Metrics

  • Weighted Score: 0.9472
  • Aggregate Score: 0.9560
  • Success Rate: 0%
  • Total Duration: 17.3m
  • Tests Passed/Failed: 0/1

Task: Real CI build failures

  • Status: FAILED
  • Runs: 3 (1 passed, 2 failed)
  • Avg Duration: 4.4m
  • Duration Range: 4.2m - 4.7m

Validation Results:

Grader Weight Avg Score Run Scores
build_id_referenced 0.5 ✅ 1.00 [1.00, 1.00, 1.00]
failure_type_classification 1.5 ❌ 0.67 [0.50, 1.00, 0.50]
logs_directory_created 1 ✅ 1.00 [1.00, 1.00, 1.00]
regression_ports_identified 2 ✅ 1.00 [1.00, 1.00, 1.00]
report_content_quality 1.5 ⚠️ 0.98 [1.00, 1.00, 0.94]
report_file_created 1 ✅ 1.00 [1.00, 1.00, 1.00]
report_quality 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
triplet_coverage 1 ✅ 1.00 [1.00, 1.00, 1.00]

claude-haiku-4.5

Overall Metrics

  • Weighted Score: 0.8667
  • Aggregate Score: 0.8333
  • Success Rate: 0%
  • Total Duration: 14.1m
  • Tests Passed/Failed: 0/1

Task: Real CI build failures

  • Status: FAILED
  • Runs: 3 (2 passed, 1 failed)
  • Avg Duration: 3.8m
  • Duration Range: 3.3m - 4.5m

Validation Results:

Grader Weight Avg Score Run Scores
build_id_referenced 0.5 ❌ 0.67 [1.00, 1.00, 0.00]
failure_type_classification 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
logs_directory_created 1 ❌ 0.67 [1.00, 1.00, 0.00]
regression_ports_identified 2 ✅ 1.00 [1.00, 1.00, 1.00]
report_content_quality 1.5 ❌ 0.67 [1.00, 1.00, 0.00]
report_file_created 1 ❌ 0.67 [1.00, 1.00, 0.00]
report_quality 1.5 ✅ 1.00 [1.00, 1.00, 1.00]
triplet_coverage 1 ✅ 1.00 [1.00, 1.00, 1.00]

Cost Analysis

Costs are estimated based on GitHub Copilot pricing (per 1M tokens). Values shown are per trial averages.

Token Usage Per Trial

Model Trials Input Tokens Cached Tokens Output Tokens Total Tokens
claude-opus-4.7-1m 3 89.9K 1.03M 22.6K 1.14M
gpt-5.3-codex 3 82.0K 704.2K 22.0K 808.2K
gpt-5.4-mini 3 91.8K 906.6K 21.0K 1.02M
claude-opus-4.5 3 110.5K 1.09M 12.1K 1.21M
claude-haiku-4.5 3 59.3K 820.0K 11.3K 890.6K

Cost Breakdown Per Trial

Model Input Cost Cached Cost Output Cost Cost/Trial
claude-opus-4.7-1m N/A N/A N/A N/A
gpt-5.3-codex $0.1435 $0.1232 $0.3077 $0.5744
gpt-5.4-mini $0.0689 $0.0680 $0.0943 $0.2312
claude-opus-4.5 $0.5525 $0.5426 $0.3023 $1.3974
claude-haiku-4.5 $0.0593 $0.0820 $0.0566 $0.1979

Cost-Performance Ratio (Per Trial)

Model Score Cost/Trial Score per $1
claude-opus-4.7-1m 100.00% N/A N/A
gpt-5.3-codex 100.00% $0.5744 1.74
gpt-5.4-mini 100.00% $0.2312 4.32
claude-opus-4.5 94.72% $1.3974 0.68
claude-haiku-4.5 86.67% $0.1979 4.38

Model Pricing Reference

Prices per 1 million tokens (source: GitHub Copilot Models and Pricing):

Model Category Input Cached Output
claude-haiku-4.5 Versatile $1.00 $0.100 $5.00
claude-opus-4.5 Powerful $5.00 $0.500 $25.00
gpt-5.3-codex Powerful $1.75 $0.175 $14.00
gpt-5.4-mini Lightweight $0.75 $0.075 $4.50

Evaluation Rubrics

The following graders were used to evaluate model performance:

build_id_referenced

  • Type: text
  • Weight: 0.5

Expected content (must contain):

  • https://dev.azure.com/vcpkg/public/_build/results?buildId=129315

failure_type_classification

  • Type: text
  • Weight: 1.5

Expected content (must contain):

  • FILE_CONFLICTS
  • BUILD_FAILED

logs_directory_created

  • Type: file
  • Weight: 1

regression_ports_identified

  • Type: text
  • Weight: 2

Expected content (must contain):

  • kf6i18n
  • kf6itemmodels
  • flint
  • allegro5
  • mathgl
  • mdl-sdk
  • salome-med-fichier
  • sebsjames-maths
  • vcpkg-ci-matio

report_content_quality

  • Type: file
  • Weight: 1.5

report_file_created

  • Type: file
  • Weight: 1

report_quality

  • Type: prompt
  • Weight: 1.5

Evaluation prompt:

Read the report.md file you saved under ci-failure-analysis/ci-129315/ and grade it.

Ground truth regressions for build #129315:
- kf6i18n, kf6itemmodels: FILE_CONFLICTS on Windows triplets
- flint: BUILD_FAILED on x64-windows-static
- allegro5, salome-med-fichier, sebsjames-maths, vcpkg-ci-matio: BUILD_FAILED on arm64-linux

Check that:
1. The report includes:
  - a summary with triplet table, 
  - a list of regressions per triplet
  - root cause for identified regressions
  - action recommendations to take for each regression
2. All ground truth regressions are identified with correct failure types
3. Baseline/known failures are separated from new regressions

If the report follows the guidelines and identifies regressions correctly,
call set_waza_grade_pass.
Otherwise, call set_waza_grade_fail with your reasoning.

triplet_coverage

  • Type: text
  • Weight: 1

Expected content (must contain):

  • x86-windows
  • x64-windows
  • x64-windows-release
  • x64-windows-static
  • arm64-linux

Methodology

Evaluation Configuration

  • Runs per test: 3
  • Timeout: 1200 seconds
  • Judge model: chatgpt-5.4-mini
  • Engine type: copilot-sdk

Scoring

  • Weighted Score: Sum of (grader_score × grader_weight) / sum of weights
  • Aggregate Score: Average of individual grader scores
  • Success Rate: Percentage of tests that passed all graders

Run Aggregation

Each test is run multiple times to account for model variability. A test is considered:

  • Passed: If the majority of runs pass all graders
  • Failed: If any grader fails in the majority of runs

Raw Data Reference

Model Eval ID Timestamp Result File
claude-haiku-4.5 run-1777783044 2026-05-02T21:23:18 analyze-ci-failures-claude-haiku-4.5-20260502-203637.json
claude-opus-4.5 run-1777784087 2026-05-02T21:37:29 analyze-ci-failures-claude-opus-4.5-20260502-203637.json
claude-opus-4.7-1m run-1777785513 2026-05-02T21:54:52 analyze-ci-failures-claude-opus-4.7-1m-20260502-203637.json
gpt-5.3-codex run-1777782185 2026-05-02T20:58:56 analyze-ci-failures-gpt-5.3-codex-20260502-203637.json
gpt-5.4-mini run-1777780727 2026-05-02T20:36:42 analyze-ci-failures-gpt-5.4-mini-20260502-203637.json

Comparison against agent with no skill

./waza run analyze-ci-failures --baseline --keep-workspace --parallel --verbose --model chatgpt-5.4-mini --output eval-baseline-gptmini.json --trials 1
Running benchmark: analyze-ci-failures-eval
Skill: analyze-ci-failures
Engine: copilot-sdk
Model: chatgpt-5.4-mini
Judge Model: chatgpt-5.4-mini
Parallel: 4 workers
Skill Directories:
  - C:\dev\copilot\vcpkg-ai-evals\skills\analyze-ci-failures


════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
Starting benchmark with 1 test(s)...

[1/1] Running test: Real CI build failures
  Run 1/1...  [PROMPT] Analyze https://dev.azure.com/vcpkg/public/_build/results?buildId=129315
  [RESPONSE] Now let me analyze the failure logs for all regression ports and check baselines.Now I have all the data needed. Let me generate the report.# vcpkg CI Failure Report — Build #129315

**Build:** [202...
  [TOOLS] 33 tool call(s)
  [GRADER] ✓ build_id_referenced score=1.00 (0s) — All text checks passed
  [GRADER] ✓ failure_type_classification score=1.00 (0s) — All text checks passed
  [GRADER] ✓ logs_directory_created score=1.00 (0s) — All file checks passed
  [GRADER] ✓ regression_ports_identified score=1.00 (0s) — All text checks passed
  [GRADER] ✓ report_content_quality score=1.00 (0s) — All file checks passed
  [GRADER] ✓ report_file_created score=1.00 (0s) — All file checks passed
  [GRADER] ✓ report_quality score=1.00 (27.574s) — All prompts passed
  [GRADER] ✓ triplet_coverage score=1.00 (0s) — All text checks passed
 passed (6m35.193s)
  Workspace: C:\Users\viromer\AppData\Local\Temp\waza-767351820
  Test Real CI build failures: passed

Benchmark completed in 7m35.24s


════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
Starting benchmark with 1 test(s)...

[1/1] Running test: Real CI build failures
  Run 1/1...  [PROMPT] Analyze https://dev.azure.com/vcpkg/public/_build/results?buildId=129315
  [RESPONSE] The Azure DevOps page is a JavaScript SPA, so the HTML doesn't contain build data. Let me try the REST API instead.Here's the analysis of **build #129315** (`microsoft.vcpkg.ci` #20260330.1):

---

##...
  [TOOLS] 13 tool call(s)
  [GRADER] ✗ build_id_referenced score=0.00 (0s) — Missing expected substring: https://dev.azure.com/vcpkg/public/_build/results?buildId=129315
  [GRADER] ✓ failure_type_classification score=1.00 (0s) — All text checks passed
  [GRADER] ✗ logs_directory_created score=0.00 (0s) — File must exist but not found: ci-failure-analysis/ci-129315/logs
  [GRADER] ✓ regression_ports_identified score=1.00 (0s) — All text checks passed
  [GRADER] ✗ report_content_quality score=0.00 (0s) — File not found for content check: ci-failure-analysis/ci-129315/report.md; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): https://dev\.azure\.com/vcpkg/public/_build/results\?buildId=129315; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): kf6i18n; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): kf6itemmodels; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): flint; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): allegro5; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): mathgl; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): mdl-sdk; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): salome-med-fichier; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): sebsjames-maths; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): vcpkg-ci-matio; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): FILE_CONFLICTS; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): BUILD_FAILED; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): x86-windows; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): x64-windows; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): x64-windows-release; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): x64-windows-static; File ci-failure-analysis/ci-129315/report.md missing expected pattern (file not found): arm64-linux
  [GRADER] ✗ report_file_created score=0.00 (0s) — File must exist but not found: ci-failure-analysis/ci-129315/report.md
  [GRADER] ✗ report_quality score=0.00 (21.274s) — fail: No report.md file was saved: The file `ci-failure-analysis/ci-129315/report.md` does not exist. The analysis was provided inline in the chat response but was never saved to disk as a report file. The grading criteria require a saved report with: a summary with triplet table, a list of regressions per triplet, root cause analysis, and action recommendations. Since no file exists, these criteria cannot be met.
  [GRADER] ✓ triplet_coverage score=1.00 (0s) — All text checks passed
 failed (1m54.993s)
  Workspace: C:\Users\viromer\AppData\Local\Temp\waza-3318899754
  Test Real CI build failures: failed

Benchmark completed in 2m16.28s


════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
  With Skills:    100.0% (1/1 tasks passed)
  Without Skills: 0.0% (0/1 tasks passed)
  Impact:         +100.0 percentage points

Per-Task Breakdown:
  • Real CI build failures         [IMPROVED]  0% → 100% (+100pp)

Verdict: Skills have POSITIVE IMPACT (improved 1/1 tasks)
════════════════════════════════════════════════════════════════
Workspace preserved: C:\Users\viromer\AppData\Local\Temp\waza-767351820
Workspace preserved: C:\Users\viromer\AppData\Local\Temp\waza-3318899754
===================================================
 BENCHMARK RESULTS
===================================================

Total Tests:    1
Succeeded:      1
Failed:         0
Errors:         0
Success Rate:   100.0%
Aggregate Score: 1.00
Min Score:      1.00
Max Score:      1.00
Std Dev:        0.0000
Duration:       7m35.24s

---------------------------------------------------
 PER-TASK BREAKDOWN
---------------------------------------------------
  ✓ Real CI build failures [passed]
      pass_rate=100.0%  avg=1.00  min=1.00  max=1.00  stddev=0.0000  avg_dur=395193ms

@vicroms vicroms changed the title Rework analyze-ci-failures script - pass evals Rework analyze-ci-failures script - add evals May 3, 2026
@BillyONeal
Copy link
Copy Markdown
Member

Neat report! I'm not that surprised to see GPT win at this.


## Critical Rules

- Use `Invoke-WebRequest` for ZIPs — `web_fetch` can't download binaries
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Use `Invoke-WebRequest` for ZIPs — `web_fetch` can't download binaries
- Use `Invoke-WebRequest` or curl shell commands for ZIPs — `web_fetch` can't download binaries

Just to hopefully do better when copilot-cli is in a bash/zsh/etc. shell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants