Rework analyze-ci-failures script - add evals#51513
Open
vicroms wants to merge 1 commit intomicrosoft:masterfrom
Open
Rework analyze-ci-failures script - add evals#51513vicroms wants to merge 1 commit intomicrosoft:masterfrom
vicroms wants to merge 1 commit intomicrosoft:masterfrom
Conversation
Member
|
Neat report! I'm not that surprised to see GPT win at this. |
BillyONeal
approved these changes
May 3, 2026
|
|
||
| ## Critical Rules | ||
|
|
||
| - Use `Invoke-WebRequest` for ZIPs — `web_fetch` can't download binaries |
Member
There was a problem hiding this comment.
Suggested change
| - Use `Invoke-WebRequest` for ZIPs — `web_fetch` can't download binaries | |
| - Use `Invoke-WebRequest` or curl shell commands for ZIPs — `web_fetch` can't download binaries |
Just to hopefully do better when copilot-cli is in a bash/zsh/etc. shell
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I created an evaluation project for the AI agent skills in vcpkg using Microsoft/waza.
The skill is evaluated
analyze-ci-failuresusing five different models and grading the produced output. The skill is asked to analyze the output of a real CI run and generate a report of the regressions found.The changes to the skill were motivated by the output of the
waza checkandwaza runcommands, in combination this evaluate the quality of the skill and the output it produces. Taking an iterative approach the skill was reworked to greatly reduce over-specificity and produce output that can pass the evaluation metrics.I plan to make the evaluations public but probably kept in a separate repository (or maybe another branch?). The evaluation rubrics are:
I also ran a test comparing the best performing model with and without the skill.
AI Model Performance Comparison Report
Generated: 2026-05-02 22:25:55
Evaluation: analyze-ci-failures-eval
Skill Tested: analyze-ci-failures
Models Evaluated: 5
Executive Summary
Best Performing Model: gpt-5.4-mini (weighted score: 100.00%)
Detailed Performance Analysis
claude-opus-4.7-1m
Overall Metrics
Task: Real CI build failures
Validation Results:
gpt-5.3-codex
Overall Metrics
Task: Real CI build failures
Validation Results:
gpt-5.4-mini
Overall Metrics
Task: Real CI build failures
Validation Results:
claude-opus-4.5
Overall Metrics
Task: Real CI build failures
Validation Results:
claude-haiku-4.5
Overall Metrics
Task: Real CI build failures
Validation Results:
Cost Analysis
Costs are estimated based on GitHub Copilot pricing (per 1M tokens). Values shown are per trial averages.
Token Usage Per Trial
Cost Breakdown Per Trial
Cost-Performance Ratio (Per Trial)
Model Pricing Reference
Prices per 1 million tokens (source: GitHub Copilot Models and Pricing):
Evaluation Rubrics
The following graders were used to evaluate model performance:
build_id_referenced
Expected content (must contain):
https://dev.azure.com/vcpkg/public/_build/results?buildId=129315failure_type_classification
Expected content (must contain):
FILE_CONFLICTSBUILD_FAILEDlogs_directory_created
regression_ports_identified
Expected content (must contain):
kf6i18nkf6itemmodelsflintallegro5mathglmdl-sdksalome-med-fichiersebsjames-mathsvcpkg-ci-matioreport_content_quality
report_file_created
report_quality
Evaluation prompt:
triplet_coverage
Expected content (must contain):
x86-windowsx64-windowsx64-windows-releasex64-windows-staticarm64-linuxMethodology
Evaluation Configuration
Scoring
Run Aggregation
Each test is run multiple times to account for model variability. A test is considered:
Raw Data Reference
analyze-ci-failures-claude-haiku-4.5-20260502-203637.jsonanalyze-ci-failures-claude-opus-4.5-20260502-203637.jsonanalyze-ci-failures-claude-opus-4.7-1m-20260502-203637.jsonanalyze-ci-failures-gpt-5.3-codex-20260502-203637.jsonanalyze-ci-failures-gpt-5.4-mini-20260502-203637.jsonComparison against agent with no skill