Skip to content

Commit a38e0bb

Browse files
PureWeenCopilot
andauthored
Add eval.yaml for verify-tests-fail-without-fix skill (dotnet#34815)
> [!NOTE] > Are you waiting for the changes in this PR to be merged? > It would be very helpful if you could <a href="https://github.com/dotnet/maui/wiki/Testing-PR-Builds">test the resulting artifacts</a> from this PR and let us know in a comment if this change resolves your issue. Thank you! ## Summary Adds eval.yaml for the `verify-tests-fail-without-fix` skill, enabling empirical A/B validation via skill-validator. ### Context - This is an internal orchestrator-invoked skill used by `pr-review` to verify tests catch bugs - Follows eval best practices established during the try-fix evaluation cycle (PR dotnet#34807) - Part of eval coverage expansion tracked in issue dotnet#34814 ### Eval Design - **6 scenarios** covering both verification modes, negative trigger, edge cases, regressions - **0 `output_contains`** -- rubric-based behavioral assertions only (no vocabulary overfitting) - **14 `output_not_contains`** -- anti-pattern guards for common mistakes - **1 `expect_activation: false`** -- native spec field for negative trigger - Realistic timeouts (60s-900s depending on scenario complexity) ### Scenarios 1. **Happy path: full verification** -- Tests two-phase workflow (fail without fix, pass with fix) 2. **Happy path: verify failure only** -- Tests test-creation mode (no fix needed) 3. **Negative trigger** -- Documentation question should not invoke verification 4. **Regression: semantic inversion** -- Tests passing without fix = FAILED verification (not success!) 5. **Edge case: no test files** -- PR without tests can't be verified 6. **Regression: no manual git commands** -- Script handles file revert/restore, not raw git --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent d15ca06 commit a38e0bb

File tree

2 files changed

+226
-0
lines changed

2 files changed

+226
-0
lines changed

.github/skills/verify-tests-fail-without-fix/SKILL.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,51 @@ compatibility: Requires git, PowerShell, and .NET SDK for building and running t
1111

1212
Verifies UI tests actually catch the issue. Supports two workflow modes:
1313

14+
## Activation Guard
15+
16+
🛑 **This skill ONLY verifies that existing tests reproduce a bug.** Do NOT activate for:
17+
- Writing new tests → use write-tests-agent
18+
- Running tests without verification context → use run-device-tests
19+
- Code review → use code-review skill
20+
- General test advice
21+
22+
Requires: a **platform** and either **test files in the PR** or an explicit **TestFilter**.
23+
24+
## ⚠️ CRITICAL: Inverted Pass/Fail Semantics
25+
26+
In this skill, test outcomes mean the OPPOSITE of normal:
27+
28+
| Test Result (without fix) | Verification Result | Why |
29+
|--------------------------|--------------------|----|
30+
| Tests FAIL | ✅ GOOD | Tests detect the bug |
31+
| Tests PASS | ❌ BAD | Tests miss the bug |
32+
33+
NEVER say "verification passed" when tests PASS without the fix.
34+
35+
## Workflow
36+
37+
### Step 1: Determine Mode
38+
- Check if fix files exist in the PR (non-test code changes detected by the script from the git diff)
39+
- If **fix files present** → Full Verification mode (`-RequireFullVerification`)
40+
- If **no fix files** → Verify Failure Only mode (omit the flag)
41+
42+
### Step 2: Construct Command
43+
```powershell
44+
pwsh .github/skills/verify-tests-fail-without-fix/scripts/verify-tests-fail.ps1 `
45+
-Platform <platform> `
46+
-TestFilter "<filter>" `
47+
[-RequireFullVerification] # Only if fix files exist
48+
```
49+
50+
### Step 3: Interpret Results
51+
⚠️ Remember: test outcomes are INVERTED from normal!
52+
- Script outputs `VERIFICATION PASSED` → Tests catch the bug ✅
53+
- Script outputs `VERIFICATION FAILED` → Tests don't catch the bug ❌
54+
- Script outputs error/timeout → Report as Blocked
55+
56+
### Step 4: Report
57+
- Report the result to the invoking orchestrator
58+
1459
## Mode 1: Verify Failure Only (Test Creation)
1560

1661
Use when **creating tests before writing a fix**:
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# eval.yaml v4 — verify-tests-fail-without-fix skill
2+
# v1: Initial 6 scenarios from skill analysis + try-fix eval lessons learned
3+
# v2: Added positive assertions, narrowed git bans, added RequireFullVerification + inverted semantics scenarios
4+
# v3: Added PR label automation + script auto-detection scenarios (10 total)
5+
# v4: Address PR review: fix label refs, git assertion conflicts, scenario 8 ambiguity, timeouts
6+
scenarios:
7+
- name: "Happy path: full verification mode with fix files present"
8+
prompt: |
9+
The PR for Issue #54321 adds a UI test and a fix for a CollectionView crash on Android.
10+
We need to verify the test actually catches the bug — meaning it fails without the fix
11+
and passes with the fix applied.
12+
13+
The PR has both test files and fix files. Please run full verification.
14+
15+
Platform: android
16+
Test filter: Issue54321
17+
assertions:
18+
- type: output_not_contains
19+
value: "skip verification"
20+
- type: output_not_contains
21+
value: "tests are fine as-is"
22+
- type: output_contains
23+
value: "verify"
24+
rubric:
25+
- "Agent runs tests WITHOUT the fix first and checks that they FAIL (proving they catch the bug)"
26+
- "Agent then restores the fix and runs tests again, checking that they PASS"
27+
- "Agent uses the verify-tests-fail script rather than manually running dotnet test"
28+
- "Agent correctly interprets the two-phase result (fail-then-pass = verification success)"
29+
timeout: 900
30+
31+
- name: "Happy path: verify failure only mode for test creation"
32+
prompt: |
33+
I just wrote a UI test for Issue #88200 but haven't implemented the fix yet.
34+
Can you verify my test actually catches the bug? It should fail right now
35+
since we haven't fixed anything.
36+
37+
Platform: android
38+
assertions:
39+
- type: output_contains
40+
value: "fail"
41+
rubric:
42+
- "Agent runs the test in verify-failure-only mode since no fix exists yet"
43+
- "Agent correctly interprets test failure as SUCCESS (test catches the bug)"
44+
- "Agent does not require fix files to be present for this mode"
45+
- "Agent does not use -RequireFullVerification flag since no fix files exist"
46+
timeout: 900
47+
48+
- name: "Negative trigger: general test question should not invoke verification"
49+
prompt: |
50+
How do I write a good UI test for a CollectionView scrolling bug? What assertions
51+
should I use, and should I use VerifyScreenshot or element-based checks?
52+
expect_activation: false
53+
assertions:
54+
- type: output_not_contains
55+
value: "verify-tests-fail"
56+
- type: output_not_contains
57+
value: "verification-report"
58+
- type: output_not_contains
59+
value: "s/ai-reproduction"
60+
rubric:
61+
- "Agent provides UI testing guidance without launching the verification workflow"
62+
- "Agent does not attempt to run any verification scripts or check PR labels"
63+
timeout: 60
64+
65+
- name: "Regression: tests passing without fix means verification FAILED"
66+
prompt: |
67+
We ran the verify-tests-fail-without-fix skill on PR #77123. The test was
68+
run without the fix applied, and it PASSED.
69+
70+
What does this result mean? Is the verification successful?
71+
assertions:
72+
- type: output_not_contains
73+
value: "verification passed"
74+
- type: output_not_contains
75+
value: "verification successful"
76+
- type: output_not_contains
77+
value: "tests are working correctly"
78+
rubric:
79+
- "Agent correctly identifies that tests PASSING without the fix is a FAILURE — it means the tests don't catch the bug"
80+
- "Agent recommends reviewing and improving the test assertions so they actually detect the issue"
81+
- "Agent does not confuse 'test passed' with 'verification passed' — these are opposite meanings in this context"
82+
timeout: 120
83+
84+
- name: "Edge case: no test files detected in the PR"
85+
prompt: |
86+
Run verify-tests-fail-without-fix on this PR. The PR only contains a fix
87+
in src/Controls/src/Core/Handlers/Entry/EntryHandler.Android.cs but no
88+
test files were added.
89+
90+
Platform: android
91+
assertions:
92+
- type: output_not_contains
93+
value: "VERIFICATION PASSED"
94+
- type: output_contains
95+
value: "test"
96+
rubric:
97+
- "Agent recognizes that without test files, verification cannot proceed"
98+
- "Agent suggests that tests need to be written before verification can be run"
99+
- "Agent does not attempt to fabricate or skip the test requirement"
100+
timeout: 120
101+
102+
- name: "Regression: agent must not manually revert files with git commands"
103+
prompt: |
104+
Please verify the UI tests for PR #33134 actually catch the EmptyView display
105+
bug on Android. The PR has both test files and fix files.
106+
107+
Platform: android
108+
Test filter: Issue33134
109+
assertions:
110+
- type: output_not_contains
111+
value: "I will run git checkout"
112+
- type: output_not_contains
113+
value: "I will run git restore"
114+
- type: output_not_contains
115+
value: "I will use git stash"
116+
rubric:
117+
- "Agent uses the verify-tests-fail.ps1 script which handles file revert/restore automatically"
118+
- "Agent does not manually use git checkout, git restore, or git stash to revert fix files"
119+
- "Agent interprets the script output correctly to determine if verification passed or failed"
120+
timeout: 900
121+
122+
- name: "Edge case: agent uses RequireFullVerification when fix files exist"
123+
prompt: |
124+
This PR has both UI tests and a code fix for Issue #55555 on Android.
125+
The fix modifies src/Controls/src/Core/Handlers/ScrollView/ScrollViewHandler.Android.cs.
126+
Please verify the tests catch the bug using full verification since we have fix files.
127+
Platform: android
128+
TestFilter: "FullyQualifiedName~Issue55555"
129+
assertions:
130+
- type: output_contains
131+
value: "RequireFullVerification"
132+
rubric:
133+
- "Agent uses -RequireFullVerification to ensure full two-phase verification"
134+
- "Agent runs the complete workflow: fail without fix, then pass with fix"
135+
timeout: 900
136+
137+
- name: "Regression: agent correctly reports test failure as verification success"
138+
prompt: |
139+
I just ran verify-tests-fail-without-fix on PR #44444. The test FAILED with an
140+
assertion error: Assert.That(rect.Height, Is.GreaterThan(0)) failed — the element
141+
rendered with zero height. This is failure-only verification (no fix files).
142+
What should I report?
143+
Platform: android
144+
assertions:
145+
- type: output_not_contains
146+
value: "verification failed"
147+
- type: output_not_contains
148+
value: "test is broken"
149+
rubric:
150+
- "Agent correctly interprets a clear assertion failure as verification SUCCESS -- the test catches the bug"
151+
- "Agent does not recommend fixing the test when the failure proves the test detects the issue"
152+
timeout: 120
153+
154+
- name: "Feature: agent reports verification result clearly"
155+
prompt: |
156+
I need to verify that the UI tests for Issue #66666 catch the bug on iOS.
157+
The PR has both test files and a fix. How will I know if verification passed or failed?
158+
Platform: ios
159+
TestFilter: "FullyQualifiedName~Issue66666"
160+
assertions:
161+
- type: output_not_contains
162+
value: "skip"
163+
rubric:
164+
- "Agent explains the verification output format (VERIFICATION PASSED / VERIFICATION FAILED)"
165+
- "Agent describes what each result means in the context of inverted semantics"
166+
timeout: 120
167+
168+
- name: "Feature: agent trusts script auto-detection of test files from git diff"
169+
prompt: |
170+
Verify tests for PR #77777 on Android. I'm not sure exactly which test files
171+
were added -- the PR has several changed files. Can the verification script
172+
figure out which tests to run on its own?
173+
Platform: android
174+
assertions:
175+
- type: output_not_contains
176+
value: "I need you to specify"
177+
rubric:
178+
- "Agent explains that the script can auto-detect test files from the PR diff"
179+
- "Agent does not require the user to manually specify every test file path"
180+
- "Agent trusts the script's git diff analysis rather than manually searching for test files"
181+
timeout: 120

0 commit comments

Comments
 (0)