Skip to content

Add static assertions to dotnet-test eval files#507

Draft
Evangelink wants to merge 2 commits intomainfrom
dev/amauryleve/tighten-test-eval
Draft

Add static assertions to dotnet-test eval files#507
Evangelink wants to merge 2 commits intomainfrom
dev/amauryleve/tighten-test-eval

Conversation

@Evangelink
Copy link
Copy Markdown
Member

  • Add negative assertions (output_not_matches/output_not_contains) to prevent wrong-direction advice across migration skills (v3→v4 vs v1→v3, VSTest vs MTP, wrong runner for framework)
  • Tighten overly broad regex patterns in test-anti-patterns and migrate-mstest-v1v2-to-v3 to reduce false positive matches
  • Add exit_success baseline assertions to crap-score and coverage-analysis scenarios
  • Add negative assertions to mtp-hot-reload to reject dotnet test when dotnet run is required
  • Add negative assertions to writing-mstest-tests for async void and swapped Assert.AreEqual argument order

- Add negative assertions (output_not_matches/output_not_contains) to
  prevent wrong-direction advice across migration skills (v3→v4 vs
  v1→v3, VSTest vs MTP, wrong runner for framework)
- Tighten overly broad regex patterns in test-anti-patterns and
  migrate-mstest-v1v2-to-v3 to reduce false positive matches
- Add exit_success baseline assertions to crap-score and
  coverage-analysis scenarios
- Add negative assertions to mtp-hot-reload to reject dotnet test
  when dotnet run is required
- Add negative assertions to writing-mstest-tests for async void
  and swapped Assert.AreEqual argument order
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions bot added a commit that referenced this pull request Apr 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
test-anti-patterns Detect mixed severity anti-patterns in repository service tests 5.0/5 → 5.0/5 ✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED ✅ 0.05 [1]
test-anti-patterns Detect flakiness indicators and test coupling 3.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill / ⚠️ NOT ACTIVATED ✅ 0.05 [2]
test-anti-patterns Detect duplicated tests and magic values 3.3/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill / ✅ test-anti-patterns; writing-mstest-tests; tools: report_intent, skill ✅ 0.05 [3]
test-anti-patterns Recognize well-written tests without inventing false positives 2.0/5 → 5.0/5 🟢 ✅ test-anti-patterns; tools: report_intent, skill ✅ 0.05
migrate-mstest-v3-to-v4 Migrate custom TestMethodAttribute from Execute to ExecuteAsync 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash, report_intent, view [4]
migrate-mstest-v3-to-v4 Replace ExpectedExceptionAttribute with Assert.ThrowsExactly 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash [5]
migrate-mstest-v3-to-v4 Fix multiple v4 breaking changes: Assert, ClassCleanup, TestContext, Timeout 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash
migrate-mstest-v3-to-v4 Handle net6.0 target framework dropped in MSTest v4 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-mstest-v3-to-v4; tools: skill / ⚠️ NOT ACTIVATED [6]
migrate-mstest-v3-to-v4 Fix TestMethodAttribute CallerInfo constructor breaking change 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash [7]
migrate-mstest-v3-to-v4 Understand behavioral changes after MSTest v4 upgrade 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill [8]
migrate-mstest-v3-to-v4 Handle MSTest.Sdk and MTP changes in v4 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill [9]
migrate-mstest-v3-to-v4 Full MSTest v3 to v4 migration with multiple breaking changes 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill, view [10]
migrate-mstest-v3-to-v4 Migrate MSTest.Sdk v3 project using ManagedType and TestTimeout 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: skill, report_intent, view, edit, bash [11]
migrate-mstest-v3-to-v4 Correctly identify MSTest v3 project and recommend v4 migration 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill, view 🟡 [12]
writing-mstest-tests Write unit tests for a service class 4.3/5 → 4.3/5 ✅ writing-mstest-tests; tools: skill / ✅ code-testing-agent; writing-mstest-tests; tools: skill, task 🟡 0.25 [13]
writing-mstest-tests Write data-driven tests for a calculator 5.0/5 → 5.0/5 ✅ writing-mstest-tests; tools: skill 🟡 0.25 [14]
writing-mstest-tests Write async tests with cancellation 2.3/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.25
writing-mstest-tests Fix swapped Assert.AreEqual arguments 5.0/5 → 5.0/5 ✅ writing-mstest-tests; tools: skill / ⚠️ NOT ACTIVATED 🟡 0.25 [15]
writing-mstest-tests Modernize legacy test patterns 3.3/5 ⏰ → 4.3/5 ⏰ 🟢 ✅ writing-mstest-tests; tools: skill, edit 🟡 0.25 [16]
writing-mstest-tests Replace ExpectedException with Assert.Throws 3.0/5 → 5.0/5 🟢 ✅ writing-mstest-tests; tools: skill / ✅ writing-mstest-tests; tools: skill, report_intent 🟡 0.25
writing-mstest-tests Use proper collection assertions 3.0/5 → 2.3/5 🔴 ✅ writing-mstest-tests; tools: skill 🟡 0.25 [17]
writing-mstest-tests Use proper type assertions instead of casts 2.7/5 → 3.7/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.25 [18]
writing-mstest-tests Set up test lifecycle correctly 2.3/5 → 4.0/5 🟢 ✅ writing-mstest-tests; tools: skill 🟡 0.25 [19]
writing-mstest-tests Use DynamicData with ValueTuples over object arrays 2.7/5 → 3.7/5 🟢 ✅ writing-mstest-tests; tools: skill, report_intent / ✅ writing-mstest-tests; tools: skill 🟡 0.25 [20]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 9) 3.0/5 → 3.0/5 ⏰ ✅ mtp-hot-reload; tools: skill ✅ 0.10 [21]
mtp-hot-reload Suggest hot reload for failing test in MTP project (SDK 10) 1.0/5 → 3.3/5 🟢 ✅ mtp-hot-reload; tools: skill, bash, create / ✅ mtp-hot-reload; tools: skill, bash, read_bash, create ✅ 0.10
mtp-hot-reload Enable hot reload when package already installed 1.7/5 ⏰ → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill, glob / ✅ mtp-hot-reload; tools: skill ✅ 0.10
mtp-hot-reload Suggest launchSettings.json configuration for hot reload 1.0/5 → 4.0/5 🟢 ✅ mtp-hot-reload; tools: skill, create, glob ✅ 0.10
mtp-hot-reload Use dotnet run not dotnet test for hot reload 2.3/5 → 3.7/5 🟢 ✅ mtp-hot-reload; tools: skill / ✅ mtp-hot-reload; tools: skill, report_intent ✅ 0.10 [22]
mtp-hot-reload Negative: VSTest project cannot use MTP hot reload 2.0/5 → 3.0/5 🟢 ✅ mtp-hot-reload; tools: skill, create ✅ 0.10
mtp-hot-reload Run specific failing test with hot reload filter 1.0/5 → 5.0/5 🟢 ✅ mtp-hot-reload; tools: skill, view ✅ 0.10
run-tests Run tests in a VSTest MSTest project 4.0/5 → 4.3/5 🟢 ✅ run-tests; tools: skill, glob / ✅ run-tests; tools: skill [23]
run-tests Run tests with trx reporting on MTP project (SDK 9) 3.0/5 → 3.3/5 ⏰ 🟢 ✅ run-tests; tools: skill / ✅ run-tests; tools: skill, glob [24]
run-tests Run tests with blame-hang on MTP project (SDK 10) 2.0/5 → 2.7/5 ⏰ 🟢 ✅ run-tests; tools: skill, edit, bash / ✅ run-tests; tools: skill
run-tests Run tests in a multi-TFM project targeting a specific framework 2.0/5 → 4.3/5 🟢 ✅ run-tests; tools: skill, glob, bash / ⚠️ NOT ACTIVATED
run-tests Filter MSTest tests by category on VSTest 5.0/5 → 5.0/5 ✅ run-tests; tools: skill, glob / ⚠️ NOT ACTIVATED [25]
run-tests Filter NUnit tests by class name on VSTest 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, glob / ⚠️ NOT ACTIVATED [26]
run-tests Filter xUnit v3 tests by class on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, bash / ⚠️ NOT ACTIVATED
run-tests Filter xUnit v3 tests by trait on MTP 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view
run-tests Filter TUnit tests by class using treenode-filter 1.3/5 → 4.7/5 🟢 ✅ run-tests; tools: skill
run-tests Combine multiple filter criteria on VSTest MSTest 4.3/5 → 4.7/5 🟢 ✅ run-tests; tools: skill, glob [27]
run-tests MTP project on SDK 9 must use -- separator for args 1.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill
run-tests MTP project on SDK 10 passes args directly 2.3/5 ⏰ → 4.0/5 🟢 ✅ run-tests; tools: skill [28]
run-tests Detect test platform from Directory.Build.props 2.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill [29]
run-tests Negative test: do not use MTP syntax for a VSTest project 4.0/5 → 5.0/5 🟢 ✅ run-tests; tools: skill, view, glob / ⚠️ NOT ACTIVATED [30]
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.3/5 → 4.7/5 🟢 ✅ coverage-analysis; tools: skill, create, view ✅ 0.10
coverage-analysis Run coverage from scratch without existing data 3.7/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, create / ✅ coverage-analysis; tools: skill, create, task, glob, read_agent ✅ 0.10
coverage-analysis Coverage plateau diagnosis 3.0/5 → 4.3/5 🟢 ✅ coverage-analysis; tools: skill, create, bash / ✅ coverage-analysis; tools: skill, create, task, glob, bash, read_agent ✅ 0.10
migrate-vstest-to-mtp Migrate MSTest project from VSTest to Microsoft.Testing.Platform 4.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, report_intent / ✅ migrate-vstest-to-mtp; tools: skill [31]
migrate-vstest-to-mtp Migrate NUnit project from VSTest to Microsoft.Testing.Platform 1.7/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill
migrate-vstest-to-mtp Migrate xUnit.net v2 project from VSTest to Microsoft.Testing.Platform 1.7/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill, report_intent, view / ✅ migrate-vstest-to-mtp; tools: report_intent, skill
migrate-vstest-to-mtp Update Azure DevOps pipeline from VSTest task to MTP 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill
migrate-vstest-to-mtp Migrate MSTest.Sdk project that explicitly uses VSTest 3.0/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent
migrate-vstest-to-mtp Translate dotnet test VSTest arguments to MTP equivalents 4.3/5 → 5.0/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, report_intent [32]
migrate-vstest-to-mtp Handle exit code 8 when migrating from VSTest to MTP 2.7/5 ⏰ → 3.7/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill [33]
migrate-vstest-to-mtp Configure dotnet test MTP mode on .NET 10 SDK 3.0/5 → 4.3/5 🟢 ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: skill, bash [34]
migrate-vstest-to-mtp Migrate xUnit.net VSTest filter syntax to MTP 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; filter-syntax; tools: skill [35]
migrate-vstest-to-mtp Full VSTest to MTP migration plan for MSTest solution 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-vstest-to-mtp; tools: skill / ✅ migrate-vstest-to-mtp; tools: report_intent, skill [36]
migrate-mstest-v1v2-to-v3 Migrate MSTest v1 project with assembly reference 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-mstest-v1v2-to-v3; tools: skill / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view, edit, bash [37]
migrate-mstest-v1v2-to-v3 Migrate MSTest v2 NuGet project to v3 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view [38]
migrate-mstest-v1v2-to-v3 Fix Assert.AreEqual object overload errors after v3 upgrade 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-mstest-v1v2-to-v3; tools: skill / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, view, bash, edit, skill [39]
migrate-mstest-v1v2-to-v3 Migrate from .testsettings to .runsettings 3.0/5 ⏰ → 3.0/5 ⏰ ✅ migrate-mstest-v1v2-to-v3; tools: skill / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view, create, bash [40]
migrate-mstest-v1v2-to-v3 Fix DataRow type mismatch errors after v3 upgrade 3.0/5 ⏰ → 3.0/5 ⏰ ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view, bash, edit [41]
migrate-mstest-v1v2-to-v3 Migrate to MSTest.Sdk project style 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, view, skill, edit, bash [42]
migrate-mstest-v1v2-to-v3 Handle dropped target framework during v3 migration 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view [43]
migrate-mstest-v1v2-to-v3 Migrate complex MSTest v2 project with testsettings, DataRow issues, and dropped TFM 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view [44]
migrate-mstest-v1v2-to-v3 Correctly identify MSTest v1 vs v2 and recommend different migration paths 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED / ✅ migrate-mstest-v1v2-to-v3; tools: report_intent, skill, view [45]
crap-score Calculate CRAP score for a single method with partial coverage 3.3/5 → 3.7/5 🟢 ✅ crap-score; tools: skill, bash, glob / ✅ crap-score; tools: skill 🟡 0.22 [46]
crap-score Identify riskiest methods across a file 4.0/5 → 5.0/5 🟢 ✅ crap-score; tools: skill / ✅ crap-score; tools: skill, glob 🟡 0.22
crap-score Generate coverage then compute CRAP score 3.0/5 ⏰ → 3.0/5 ⏰ ✅ crap-score; tools: skill 🟡 0.22 [47]

[1] ⚠️ High run-to-run variance (CV=1.13) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=1.84) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=0.54) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=0.85) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -5.5% due to: tokens (12104 → 22754), time (124.7s → 180.2s)
[5] ⚠️ High run-to-run variance (CV=1.47) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -15.7% due to: completion (✓ → ✗), tokens (18039 → 24922)
[6] ⚠️ High run-to-run variance (CV=0.64) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -1.3% due to: tokens (18054 → 22833)
[7] ⚠️ High run-to-run variance (CV=13.17) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -1.2% due to: tokens (18079 → 25514)
[8] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[9] (Plugin) Quality unchanged but weighted score is -5.0% due to: tokens (0 → 44513), tool calls (0 → 2), time (1ms → 17.9s)
[10] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 2ms)
[11] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[12] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5
[13] ⚠️ High run-to-run variance (CV=0.56) — consider re-running with --runs 5
[14] ⚠️ High run-to-run variance (CV=1.23) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -2.8% due to: tokens (192447 → 408166), time (128.2s → 164.1s), tool calls (18 → 22)
[15] ⚠️ High run-to-run variance (CV=1.10) — consider re-running with --runs 5
[16] ⚠️ High run-to-run variance (CV=1.50) — consider re-running with --runs 5
[17] ⚠️ High run-to-run variance (CV=2.25) — consider re-running with --runs 5
[18] ⚠️ High run-to-run variance (CV=2.68) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -23.8% due to: judgment, quality, tool calls (0 → 1), tokens (18194 → 23738)
[19] ⚠️ High run-to-run variance (CV=0.58) — consider re-running with --runs 5
[20] ⚠️ High run-to-run variance (CV=0.93) — consider re-running with --runs 5
[21] (Isolated) Quality unchanged but weighted score is -15.0% due to: tokens (138810 → 1107341), errors (0 → 1), tool calls (14 → 51), time (128.5s → 360.3s)
[22] ⚠️ High run-to-run variance (CV=0.63) — consider re-running with --runs 5
[23] ⚠️ High run-to-run variance (CV=0.72) — consider re-running with --runs 5
[24] ⚠️ High run-to-run variance (CV=1233.74) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -21.9% due to: judgment
[25] (Isolated) Quality unchanged but weighted score is -5.8% due to: tokens (31269 → 49413), tool calls (3 → 5), time (22.8s → 33.8s)
[26] ⚠️ High run-to-run variance (CV=1.56) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -7.1% due to: tokens (31151 → 53330), tool calls (3 → 6), time (14.4s → 20.4s)
[27] (Plugin) Quality unchanged but weighted score is -9.8% due to: quality, tokens (31292 → 53746), tool calls (3 → 5), time (19.7s → 24.2s)
[28] ⚠️ High run-to-run variance (CV=0.69) — consider re-running with --runs 5
[29] ⚠️ High run-to-run variance (CV=1.41) — consider re-running with --runs 5
[30] ⚠️ High run-to-run variance (CV=99.26) — consider re-running with --runs 5. (Plugin) Quality unchanged but weighted score is -0.1% due to: efficiency metrics
[31] ⚠️ High run-to-run variance (CV=2.30) — consider re-running with --runs 5
[32] ⚠️ High run-to-run variance (CV=5.75) — consider re-running with --runs 5
[33] ⚠️ High run-to-run variance (CV=1.17) — consider re-running with --runs 5
[34] ⚠️ High run-to-run variance (CV=2.68) — consider re-running with --runs 5
[35] ⚠️ High run-to-run variance (CV=1.48) — consider re-running with --runs 5
[36] ⚠️ High run-to-run variance (CV=2.48) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -5.0% due to: tokens (6108 → 12258)
[37] ⚠️ High run-to-run variance (CV=1.23) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -1.9% due to: tokens (10183 → 16685)
[38] ⚠️ High run-to-run variance (CV=1.45) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -7.5% due to: tokens (6084 → 12734), tool calls (1 → 2)
[39] ⚠️ High run-to-run variance (CV=6.90) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -11.0% due to: completion (✓ → ✗)
[40] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -7.5% due to: tokens (40 → 10368), tool calls (0 → 1)
[41] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5
[42] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[43] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[44] (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 2ms)
[45] ⚠️ High run-to-run variance (CV=1.73) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -2.5% due to: time (0ms → 1ms)
[46] ⚠️ High run-to-run variance (CV=1.84) — consider re-running with --runs 5
[47] ⚠️ High run-to-run variance (CV=2.07) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -0.7% due to: tokens (232740 → 260688)

timeout — run(s) hit the (180s, 240s, 300s, 360s, 480s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant