Add static assertions to dotnet-test eval files by Evangelink · Pull Request #507 · dotnet/skills

Evangelink · 2026-04-08T14:13:58Z

Add negative assertions (output_not_matches/output_not_contains) to prevent wrong-direction advice across migration skills (v3→v4 vs v1→v3, VSTest vs MTP, wrong runner for framework)
Tighten overly broad regex patterns in test-anti-patterns and migrate-mstest-v1v2-to-v3 to reduce false positive matches
Add exit_success baseline assertions to crap-score and coverage-analysis scenarios
Add negative assertions to mtp-hot-reload to reject dotnet test when dotnet run is required
Add negative assertions to writing-mstest-tests for async void and swapped Assert.AreEqual argument order

- Add negative assertions (output_not_matches/output_not_contains) to prevent wrong-direction advice across migration skills (v3→v4 vs v1→v3, VSTest vs MTP, wrong runner for framework) - Tighten overly broad regex patterns in test-anti-patterns and migrate-mstest-v1v2-to-v3 to reduce false positive matches - Add exit_success baseline assertions to crap-score and coverage-analysis scenarios - Add negative assertions to mtp-hot-reload to reject dotnet test when dotnet run is required - Add negative assertions to writing-mstest-tests for async void and swapped Assert.AreEqual argument order

Evangelink · 2026-04-08T14:14:10Z

/evaluate

github-actions · 2026-04-08T14:31:41Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
test-anti-patterns	Detect mixed severity anti-patterns in repository service tests	5.0/5 → 5.0/5	✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED	✅ 0.05	✅ [1]
test-anti-patterns	Detect flakiness indicators and test coupling	3.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED	✅ 0.05	✅ [2]
test-anti-patterns	Detect duplicated tests and magic values	3.3/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: report_intent, skill / ✅ test-anti-patterns; writing-mstest-tests; tools: report_intent, skill	✅ 0.05	✅ [3]
test-anti-patterns	Recognize well-written tests without inventing false positives	2.0/5 → 5.0/5 🟢	✅ test-anti-patterns; tools: report_intent, skill	✅ 0.05	✅
migrate-mstest-v3-to-v4	Migrate custom TestMethodAttribute from Execute to ExecuteAsync	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash, report_intent, view	—	❌ [4]
migrate-mstest-v3-to-v4	Replace ExpectedExceptionAttribute with Assert.ThrowsExactly	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash	—	❌ [5]
migrate-mstest-v3-to-v4	Fix multiple v4 breaking changes: Assert, ClassCleanup, TestContext, Timeout	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash	—	✅
migrate-mstest-v3-to-v4	Handle net6.0 target framework dropped in MSTest v4	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-mstest-v3-to-v4; tools: skill / ⚠️ NOT ACTIVATED	—	❌ [6]
migrate-mstest-v3-to-v4	Fix TestMethodAttribute CallerInfo constructor breaking change	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash	—	❌ [7]
migrate-mstest-v3-to-v4	Understand behavioral changes after MSTest v4 upgrade	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill	—	❌ [8]
migrate-mstest-v3-to-v4	Handle MSTest.Sdk and MTP changes in v4	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill	—	❌ [9]
migrate-mstest-v3-to-v4	Full MSTest v3 to v4 migration with multiple breaking changes	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill, view	—	❌ [10]
migrate-mstest-v3-to-v4	Migrate MSTest.Sdk v3 project using ManagedType and TestTimeout	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: skill, report_intent, view, edit, bash	—	❌ [11]
migrate-mstest-v3-to-v4	Correctly identify MSTest v3 project and recommend v4 migration	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill, view	—	🟡 [12]
writing-mstest-tests	Write unit tests for a service class	4.3/5 → 4.3/5	✅ writing-mstest-tests; tools: skill / ✅ code-testing-agent; writing-mstest-tests; tools: skill, task	🟡 0.25	❌ [13]
writing-mstest-tests	Write data-driven tests for a calculator	5.0/5 → 5.0/5	✅ writing-mstest-tests; tools: skill	🟡 0.25	❌ [14]
writing-mstest-tests	Write async tests with cancellation	2.3/5 → 5.0/5 🟢	✅ writing-mstest-tests; tools: skill	🟡 0.25	✅
writing-mstest-tests	Fix swapped Assert.AreEqual arguments	5.0/5 → 5.0/5	✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED	🟡 0.25	✅ [15]
writing-mstest-tests	Modernize legacy test patterns	3.3/5 ⏰ → 4.3/5 ⏰ 🟢	✅ writing-mstest-tests; tools: skill, edit	🟡 0.25	✅ [16]
writing-mstest-tests	Replace ExpectedException with Assert.Throws	3.0/5 → 5.0/5 🟢	✅ writing-mstest-tests; tools: skill / ✅ writing-mstest-tests; tools: skill, report_intent	🟡 0.25	✅
writing-mstest-tests	Use proper collection assertions	3.0/5 → 2.3/5 🔴	✅ writing-mstest-tests; tools: skill	🟡 0.25	❌ [17]
writing-mstest-tests	Use proper type assertions instead of casts	2.7/5 → 3.7/5 🟢	✅ writing-mstest-tests; tools: skill	🟡 0.25	❌ [18]
writing-mstest-tests	Set up test lifecycle correctly	2.3/5 → 4.0/5 🟢	✅ writing-mstest-tests; tools: skill	🟡 0.25	✅ [19]
writing-mstest-tests	Use DynamicData with ValueTuples over object arrays	2.7/5 → 3.7/5 🟢	✅ writing-mstest-tests; tools: skill, report_intent / ✅ writing-mstest-tests; tools: skill	🟡 0.25	✅ [20]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 9)	3.0/5 → 3.0/5 ⏰	✅ mtp-hot-reload; tools: skill	✅ 0.10	❌ [21]
mtp-hot-reload	Suggest hot reload for failing test in MTP project (SDK 10)	1.0/5 → 3.3/5 🟢	✅ mtp-hot-reload; tools: skill, bash, create / ✅ mtp-hot-reload; tools: skill, bash, read_bash, create	✅ 0.10	✅
mtp-hot-reload	Enable hot reload when package already installed	1.7/5 ⏰ → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill, glob / ✅ mtp-hot-reload; tools: skill	✅ 0.10	✅
mtp-hot-reload	Suggest launchSettings.json configuration for hot reload	1.0/5 → 4.0/5 🟢	✅ mtp-hot-reload; tools: skill, create, glob	✅ 0.10	✅
mtp-hot-reload	Use dotnet run not dotnet test for hot reload	2.3/5 → 3.7/5 🟢	✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, report_intent	✅ 0.10	✅ [22]
mtp-hot-reload	Negative: VSTest project cannot use MTP hot reload	2.0/5 → 3.0/5 🟢	✅ mtp-hot-reload; tools: skill, create	✅ 0.10	✅
mtp-hot-reload	Run specific failing test with hot reload filter	1.0/5 → 5.0/5 🟢	✅ mtp-hot-reload; tools: skill, view	✅ 0.10	✅
run-tests	Run tests in a VSTest MSTest project	4.0/5 → 4.3/5 🟢	✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill	—	✅ [23]
run-tests	Run tests with trx reporting on MTP project (SDK 9)	3.0/5 → 3.3/5 ⏰ 🟢	✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob	—	❌ [24]
run-tests	Run tests with blame-hang on MTP project (SDK 10)	2.0/5 → 2.7/5 ⏰ 🟢	✅ run-tests; tools: skill, edit, bash / ✅ run-tests; tools: skill	—	✅
run-tests	Run tests in a multi-TFM project targeting a specific framework	2.0/5 → 4.3/5 🟢	✅ run-tests; tools: skill, glob, bash / ⚠️ NOT ACTIVATED	—	✅
run-tests	Filter MSTest tests by category on VSTest	5.0/5 → 5.0/5	✅ run-tests; tools: skill, glob / ⚠️ NOT ACTIVATED	—	❌ [25]
run-tests	Filter NUnit tests by class name on VSTest	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, glob / ⚠️ NOT ACTIVATED	—	❌ [26]
run-tests	Filter xUnit v3 tests by class on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED	—	✅
run-tests	Filter xUnit v3 tests by trait on MTP	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view	—	✅
run-tests	Filter TUnit tests by class using treenode-filter	1.3/5 → 4.7/5 🟢	✅ run-tests; tools: skill	—	✅
run-tests	Combine multiple filter criteria on VSTest MSTest	4.3/5 → 4.7/5 🟢	✅ run-tests; tools: skill, glob	—	❌ [27]
run-tests	MTP project on SDK 9 must use -- separator for args	1.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill	—	✅
run-tests	MTP project on SDK 10 passes args directly	2.3/5 ⏰ → 4.0/5 🟢	✅ run-tests; tools: skill	—	✅ [28]
run-tests	Detect test platform from Directory.Build.props	2.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill	—	✅ [29]
run-tests	Negative test: do not use MTP syntax for a VSTest project	4.0/5 → 5.0/5 🟢	✅ run-tests; tools: skill, view, glob / ⚠️ NOT ACTIVATED	—	❌ [30]
coverage-analysis	Project-wide coverage analysis with existing Cobertura data	2.3/5 → 4.7/5 🟢	✅ coverage-analysis; tools: skill, create, view	✅ 0.10	✅
coverage-analysis	Run coverage from scratch without existing data	3.7/5 → 5.0/5 🟢	✅ coverage-analysis; tools: skill, create / ✅ coverage-analysis; tools: skill, create, task, glob, read_agent	✅ 0.10	✅
coverage-analysis	Coverage plateau diagnosis	3.0/5 → 4.3/5 🟢	✅ coverage-analysis; tools: skill, create, bash / ✅ coverage-analysis; tools: skill, create, task, glob, bash, read_agent	✅ 0.10	✅
migrate-vstest-to-mtp	Migrate MSTest project from VSTest to Microsoft.Testing.Platform	4.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill, report_intent / ✅ migrate-vstest-to-mtp; tools: skill	—	✅ [31]
migrate-vstest-to-mtp	Migrate NUnit project from VSTest to Microsoft.Testing.Platform	1.7/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill	—	✅
migrate-vstest-to-mtp	Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform	1.7/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill, report_intent, view / ✅ migrate-vstest-to-mtp; tools: report_intent, skill	—	✅
migrate-vstest-to-mtp	Update Azure DevOps pipeline from VSTest task to MTP	3.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill	—	✅
migrate-vstest-to-mtp	Migrate MSTest.Sdk project that explicitly uses VSTest	3.0/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent	—	✅
migrate-vstest-to-mtp	Translate dotnet test VSTest arguments to MTP equivalents	4.3/5 → 5.0/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent	—	✅ [32]
migrate-vstest-to-mtp	Handle exit code 8 when migrating from VSTest to MTP	2.7/5 ⏰ → 3.7/5 🟢	✅ migrate-vstest-to-mtp; tools: skill	—	✅ [33]
migrate-vstest-to-mtp	Configure dotnet test MTP mode on .NET 10 SDK	3.0/5 → 4.3/5 🟢	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, bash	—	✅ [34]
migrate-vstest-to-mtp	Migrate xUnit.net VSTest filter syntax to MTP	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; filter-syntax; tools: skill	—	✅ [35]
migrate-vstest-to-mtp	Full VSTest to MTP migration plan for MSTest solution	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill	—	❌ [36]
migrate-mstest-v1v2-to-v3	Migrate MSTest v1 project with assembly reference	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-mstest-v1v2-to-v3; tools: skill / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view, edit, bash	—	❌ [37]
migrate-mstest-v1v2-to-v3	Migrate MSTest v2 NuGet project to v3	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view	—	❌ [38]
migrate-mstest-v1v2-to-v3	Fix Assert.AreEqual object overload errors after v3 upgrade	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-mstest-v1v2-to-v3; tools: skill / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, view, bash, edit, skill	—	❌ [39]
migrate-mstest-v1v2-to-v3	Migrate from .testsettings to .runsettings	3.0/5 ⏰ → 3.0/5 ⏰	✅ migrate-mstest-v1v2-to-v3; tools: skill / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view, create, bash	—	❌ [40]
migrate-mstest-v1v2-to-v3	Fix DataRow type mismatch errors after v3 upgrade	3.0/5 ⏰ → 3.0/5 ⏰	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view, bash, edit	—	✅ [41]
migrate-mstest-v1v2-to-v3	Migrate to MSTest.Sdk project style	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, view, skill, edit, bash	—	❌ [42]
migrate-mstest-v1v2-to-v3	Handle dropped target framework during v3 migration	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view	—	❌ [43]
migrate-mstest-v1v2-to-v3	Migrate complex MSTest v2 project with testsettings, DataRow issues, and dropped TFM	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view	—	❌ [44]
migrate-mstest-v1v2-to-v3	Correctly identify MSTest v1 vs v2 and recommend different migration paths	3.0/5 → 3.0/5	⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view	—	❌ [45]
crap-score	Calculate CRAP score for a single method with partial coverage	3.3/5 → 3.7/5 🟢	✅ crap-score; tools: skill, bash, glob / ✅ crap-score; tools: skill	🟡 0.22	✅ [46]
crap-score	Identify riskiest methods across a file	4.0/5 → 5.0/5 🟢	✅ crap-score; tools: skill / ✅ crap-score; tools: skill, glob	🟡 0.22	✅
crap-score	Generate coverage then compute CRAP score	3.0/5 ⏰ → 3.0/5 ⏰	✅ crap-score; tools: skill	🟡 0.22	❌ [47]

[1] ⚠️ High run-to-run variance (CV=1.13) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=1.84) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=0.54) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=0.85) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -5.5% due to: tokens (12104 → 22754), time (124.7s → 180.2s)
[5] ⚠️ High run-to-run variance (CV=1.47) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -15.7% due to: completion (✓ → ✗), tokens (18039 → 24922)
[6] ⚠️ High run-to-run variance (CV=0.64) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -1.3% due to: tokens (18054 → 22833)
[7] ⚠️ High run-to-run variance (CV=13.17) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -1.2% due to: tokens (18079 → 25514)
[8] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[9] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 44513), tool calls (0 → 2), time (1ms → 17.9s)
[10] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 2ms)
[11] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[12] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5
[13] ⚠️ High run-to-run variance (CV=0.56) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=1.23) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.8% due to: tokens (192447 → 408166), time (128.2s → 164.1s), tool calls (18 → 22)
[15] ⚠️ High run-to-run variance (CV=1.10) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=1.50) — consider re-running with --runs 5
[17] ⚠️ High run-to-run variance (CV=2.25) — consider re-running with --runs 5
[18] ⚠️ High run-to-run variance (CV=2.68) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -23.8% due to: judgment, quality, tool calls (0 → 1), tokens (18194 → 23738)
[19] ⚠️ High run-to-run variance (CV=0.58) — consider re-running with --runs 5
[20] ⚠️ High run-to-run variance (CV=0.93) — consider re-running with --runs 5
[21] (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (138810 → 1107341), errors (0 → 1), tool calls (14 → 51), time (128.5s → 360.3s)
[22] ⚠️ High run-to-run variance (CV=0.63) — consider re-running with --runs 5
[23] ⚠️ High run-to-run variance (CV=0.72) — consider re-running with --runs 5
[24] ⚠️ High run-to-run variance (CV=1233.74) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -21.9% due to: judgment
[25] (Isolated) Quality unchanged but weighted score is -5.8% due to: tokens (31269 → 49413), tool calls (3 → 5), time (22.8s → 33.8s)
[26] ⚠️ High run-to-run variance (CV=1.56) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -7.1% due to: tokens (31151 → 53330), tool calls (3 → 6), time (14.4s → 20.4s)
[27] (Plugin) Quality unchanged but weighted score is -9.8% due to: quality, tokens (31292 → 53746), tool calls (3 → 5), time (19.7s → 24.2s)
[28] ⚠️ High run-to-run variance (CV=0.69) — consider re-running with --runs 5
[29] ⚠️ High run-to-run variance (CV=1.41) — consider re-running with --runs 5
[30] ⚠️ High run-to-run variance (CV=99.26) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.1% due to: efficiency metrics
[31] ⚠️ High run-to-run variance (CV=2.30) — consider re-running with --runs 5
[32] ⚠️ High run-to-run variance (CV=5.75) — consider re-running with --runs 5
[33] ⚠️ High run-to-run variance (CV=1.17) — consider re-running with --runs 5
[34] ⚠️ High run-to-run variance (CV=2.68) — consider re-running with --runs 5
[35] ⚠️ High run-to-run variance (CV=1.48) — consider re-running with --runs 5
[36] ⚠️ High run-to-run variance (CV=2.48) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -5.0% due to: tokens (6108 → 12258)
[37] ⚠️ High run-to-run variance (CV=1.23) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -1.9% due to: tokens (10183 → 16685)
[38] ⚠️ High run-to-run variance (CV=1.45) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -7.5% due to: tokens (6084 → 12734), tool calls (1 → 2)
[39] ⚠️ High run-to-run variance (CV=6.90) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -11.0% due to: completion (✓ → ✗)
[40] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -7.5% due to: tokens (40 → 10368), tool calls (0 → 1)
[41] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5
[42] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[43] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[44] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 2ms)
[45] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[46] ⚠️ High run-to-run variance (CV=1.84) — consider re-running with --runs 5
[47] ⚠️ High run-to-run variance (CV=2.07) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -0.7% due to: tokens (232740 → 260688)

⏰ timeout — run(s) hit the (180s, 240s, 300s, 360s, 480s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

github-actions bot added a commit that referenced this pull request Apr 8, 2026

Update PR token usage data (PR #507)

87f92c5

More constraints

4f3d6a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add static assertions to dotnet-test eval files#507

Add static assertions to dotnet-test eval files#507
Evangelink wants to merge 2 commits intomainfrom
dev/amauryleve/tighten-test-eval

Evangelink commented Apr 8, 2026

Uh oh!

Evangelink commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Evangelink commented Apr 8, 2026

Uh oh!

Evangelink commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

Skill Validation Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant