Skip to content

Commit aeda6f5

Browse files
authored
Add more detailed steps for skill testing (#37959)
1 parent d08003c commit aeda6f5

7 files changed

Lines changed: 742 additions & 8 deletions

File tree

.agents/skills/ci-analysis/references/manual-investigation.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
If the script doesn't provide enough information, use these manual investigation steps.
44

55
## Table of Contents
6+
67
- [Get Build Timeline](#get-build-timeline)
78
- [Find Helix Tasks](#find-helix-tasks)
89
- [Get Build Logs](#get-build-logs)
@@ -62,6 +63,7 @@ $workItem.Files | ForEach-Object { Write-Host "$($_.FileName): $($_.Uri)" }
6263
```
6364

6465
Common artifacts:
66+
6567
- `console.*.log` - Console output
6668
- `*.binlog` - MSBuild binary logs
6769
- `run-*.log` - XHarness/test runner logs
@@ -70,6 +72,7 @@ Common artifacts:
7072
## Analyze Binlogs
7173

7274
Binlogs contain detailed MSBuild execution traces for diagnosing:
75+
7376
- AOT compilation failures
7477
- Static web asset issues
7578
- NuGet restore problems
@@ -89,6 +92,7 @@ curl -s "https://helix.dot.net/api/2019-06-17/jobs/JOB_ID/workitems/WORK_ITEM_NA
8992
```
9093

9194
Example output:
95+
9296
```
9397
DOTNET_JitStress=1
9498
DOTNET_TieredCompilation=0

.agents/skills/make-skill/SKILL.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,17 @@ Include these recommended sections, following this file's structure:
8282
└── assets/ # Optional: templates, resources and other data files that aren't executable or Markdown
8383
```
8484

85-
### Step 6: Validate the skill
85+
### Step 6: Write Scripts (Script-driven Only)
86+
87+
- Prefer PowerShell, but can also use Python or JavaScript
88+
- Standard param block with defaults
89+
- Ensure scripts produce clear, structured, and parseable console output (for example, section headers and status lines)
90+
- Emoji status: ✅ green / ⚠️ yellow / 🔴 red
91+
- **Fail-closed error handling** — Unknown ≠ Healthy
92+
93+
> **NEVER** count API failures as success. Return "Unknown" and exclude from positive counts.
94+
95+
### Step 7: Validate the skill
8696

8797
Ensure the name:
8898
- Does not start or end with a hyphen
@@ -103,6 +113,20 @@ After creating a skill, verify:
103113
- [ ] Optional directories are used appropriately
104114
- [ ] Scripts handle edge cases gracefully and return structured outputs and helpful error messages when applicable
105115

116+
### Step 8: Test with Multi-Model Subagents
117+
118+
Follow [references/testing-patterns.md](references/testing-patterns.md):
119+
120+
1. Select top-tier model from 2-4 different families
121+
2. Give each the same test prompt exercising the skill
122+
3. Launch in parallel via `task` tool with `model` parameter
123+
4. Synthesize: consensus findings = high confidence
124+
5. Fix errors first, then warnings, then consider suggestions
125+
6. **Retrospective**: When an agent misapplies guidance, ask the *same model* why it made that choice — its self-analysis reveals guidance gaps you can close with targeted anti-patterns (see references/anti-patterns.md)
126+
7. **A/B test**: After fixing issues, re-run the same task to verify improvement — same model, same prompt, compare correctness/speed/tool calls (see references/testing-patterns.md)
127+
128+
**For new skills or major restructuring**, use the writer-critic convergence loop instead: one agent writes, a different-model agent critiques, writer applies fixes, repeat until convergence (2-3 rounds). See references/testing-patterns.md#writer-critic-convergence-loop.
129+
106130
## Common Pitfalls
107131

108132
| Pitfall | Solution |

.agents/skills/make-skill/references/anti-patterns.md

Lines changed: 351 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 300 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,300 @@
1+
# Testing Patterns
2+
3+
Multi-model subagent testing methodology for Copilot CLI skills. This approach was independently validated by both our iterative skill development and stephentoub's code-review skill (which includes multi-model review as a first-class process).
4+
5+
## Why Multi-Model Testing
6+
7+
Different models have different blind spots:
8+
9+
- Some excel at code correctness but miss UX issues
10+
- Some catch edge cases others overlook
11+
- Some produce false positives that others correctly ignore
12+
- **Consensus findings** (flagged by 2+ models) are almost always real issues
13+
14+
## The Process
15+
16+
### 1. Select Models
17+
18+
Choose the top-tier model from each available model family. Use at least 2, at most 4. Skip fast/cheap tiers — you want the best reasoning from each family.
19+
20+
Example selection:
21+
22+
```
23+
claude-opus-4.6 (Anthropic)
24+
gpt-5.3-codex (OpenAI)
25+
gpt-5.4 (OpenAI, alternative perspective)
26+
```
27+
28+
> ⚠️ `gemini-3-pro-preview` frequently fails with 400 errors on general-purpose task agents. Prefer OpenAI or Anthropic models until Gemini stability improves.
29+
30+
### 2. Construct the Test Prompt
31+
32+
Give each agent the **same prompt** containing:
33+
34+
- The skill's purpose and context
35+
- A realistic task that exercises the skill
36+
- Instructions to report findings with severity
37+
38+
**For script-driven skills** — ask agents to run the skill and evaluate output:
39+
40+
```
41+
Use the skill at {path} to {task}. After running, evaluate:
42+
1. Did the skill produce correct, useful output?
43+
2. Are there edge cases it mishandled?
44+
3. Is the output clear and actionable?
45+
4. Any bugs, errors, or misleading information?
46+
Report findings as: ❌ error / ⚠️ warning / 💡 suggestion
47+
```
48+
49+
**For knowledge-driven skills** — ask agents to apply the skill's rules:
50+
51+
```
52+
Read the skill at {path} and use it to {task}. After applying, evaluate:
53+
1. Were the instructions clear enough to follow?
54+
2. Did any rules conflict or create ambiguity?
55+
3. Were there gaps — situations where the skill gave no guidance?
56+
4. Any rules that seem wrong or overly broad?
57+
Report findings as: ❌ error / ⚠️ warning / 💡 suggestion
58+
```
59+
60+
**For the SKILL.md itself** — ask for structural review:
61+
```
62+
Review the skill at {path} as if you were a developer evaluating whether
63+
to adopt it. Consider: trigger description quality, section organization,
64+
completeness, accuracy, actionability. Would you trust this skill's guidance?
65+
```
66+
67+
### 3. Launch in Parallel
68+
69+
Use the `task` tool with different `model` parameters:
70+
71+
```
72+
task agent_type="general-purpose" model="claude-opus-4.6" prompt="..."
73+
task agent_type="general-purpose" model="gpt-5.4" prompt="..."
74+
task agent_type="general-purpose" model="gemini-3.1-pro-preview" prompt="..."
75+
```
76+
77+
Launch all in parallel (mode="background") when possible.
78+
79+
### 4. Synthesize Results
80+
81+
After all agents complete:
82+
83+
1. **Deduplicate**: Group findings that describe the same issue
84+
2. **Elevate consensus**: Issues flagged by 2+ models → high confidence, fix first
85+
3. **Include unique catches**: Single-model findings that meet the confidence bar
86+
4. **Discard noise**: Vague suggestions without specific evidence
87+
88+
### 5. Prioritize Actions
89+
90+
| Priority | Criteria |
91+
|----------|----------|
92+
| Fix now | ❌ errors from any model, ⚠️ warnings from 2+ models |
93+
| Fix soon | ⚠️ warnings from 1 model with clear evidence |
94+
| Consider | 💡 suggestions with consensus or strong rationale |
95+
| Skip | 💡 suggestions from 1 model without evidence, style-only feedback |
96+
97+
## A/B Testing: Before/After Comparison
98+
99+
When iterating on a skill, run the **same task** before and after changes to measure improvement. This catches cases where a fix for one problem introduces a regression elsewhere.
100+
101+
### Setup
102+
103+
1. **Pick a reproducible task** — a real investigation with a known correct answer works best
104+
2. **Record the "before" run** — launch a subagent with the current skill, note: elapsed time, tool call count, whether it got the correct answer, and any wrong turns
105+
3. **Apply your skill changes** (edit SKILL.md, references, scripts)
106+
4. **Run the "after" test** — same prompt, same model, same task
107+
5. **Compare results**
108+
109+
### What to Measure
110+
111+
| Metric | How | Good signal |
112+
|--------|-----|-------------|
113+
| **Correctness** | Did the agent reach the right conclusion? | Before: ❌ → After: ✅ |
114+
| **Elapsed time** | Agent completion time (seconds) | >30% faster |
115+
| **Tool calls** | Count of tool invocations | Fewer = more efficient |
116+
| **Wrong turns** | Steps that didn't contribute to the answer | Fewer = better guidance |
117+
118+
### Example (from ci-analysis improvement)
119+
120+
```
121+
Task: "Compare Csc args between passing and failing Helix binlogs"
122+
123+
Round 1 (before fixes): 623s, wrong root cause (Debug/Release noise)
124+
Round 2 (after fixes): 272s, correct root cause (extra analyzerconfig arg)
125+
126+
Changes made: Added "focus on arg count, not value differences" to
127+
binlog-comparison.md delegation prompt template.
128+
```
129+
130+
### Tips
131+
132+
- **Use the same model** for before/after — different models have different capabilities
133+
- **Known-answer tasks** are best — you can objectively score correctness
134+
- **Don't optimize for speed alone** — a slower agent that gets the right answer beats a fast wrong one
135+
- **Save the before prompt** — you'll need the exact same prompt for the after run
136+
137+
## Writer-Critic Convergence Loop
138+
139+
For skill creation or major restructuring, a single review pass often misses structural issues that only surface when someone tries to *apply* the feedback. The writer-critic pattern uses two agents iteratively until the skill converges.
140+
141+
### Process
142+
143+
1. **Writer agent** creates or modifies the skill (SKILL.md, scripts, references)
144+
2. **Critic agent** reviews the result — produces a structured feedback document with ❌/⚠️/💡 findings
145+
3. **Writer agent** reads the feedback and applies fixes
146+
4. **Critic agent** reviews again — only flags *new or remaining* issues
147+
5. **Repeat** until the critic has no meaningful findings (usually 2-3 rounds)
148+
149+
### Setup
150+
151+
Use two `task` calls in sequence (not parallel — each depends on the previous output):
152+
153+
```
154+
# Round 1: Writer creates the skill
155+
task agent_type="general-purpose" prompt="Create a skill at {path} that {does X}..."
156+
157+
# Round 1: Critic reviews
158+
task agent_type="general-purpose" model="{different-model}" prompt="Review the skill at {path}. Report ❌/⚠️/💡 findings. Save feedback to {path}/feedback.md"
159+
160+
# Round 2: Writer applies feedback
161+
task agent_type="general-purpose" prompt="Read {path}/feedback.md and apply the feedback to the skill at {path}. Delete feedback.md when done."
162+
163+
# Round 2: Critic reviews again
164+
task agent_type="general-purpose" model="{different-model}" prompt="Review the skill at {path}. Only flag NEW or REMAINING issues..."
165+
```
166+
167+
### Key design choices
168+
169+
- **Use different models** for writer and critic — same-model pairs are too agreeable
170+
- **The human stays in the loop** between rounds to steer direction and override bad suggestions
171+
- **Save feedback as a file** (e.g., `feedback.md` in the skill directory) so the writer agent has full context without you relaying it
172+
- **Delete feedback files** after they're applied — they're transient, not part of the skill
173+
- **Stop when the critic produces only 💡 suggestions** — that's convergence. Don't chase zero findings.
174+
175+
### When to use this vs. multi-model review
176+
177+
| Scenario | Approach |
178+
|----------|----------|
179+
| Testing an existing skill against a real task | Multi-model review (parallel, single-shot) |
180+
| Creating a new skill from scratch | Writer-critic loop (2-3 rounds) |
181+
| Major restructuring of a skill | Writer-critic loop |
182+
| Small fixes or incremental improvements | Multi-model review |
183+
| Validating after writer-critic converges | Multi-model review as final check |
184+
185+
The two approaches complement each other: writer-critic for creation/iteration, multi-model for validation.
186+
187+
## Waza Eval Testing
188+
189+
For repeatable, quantitative skill testing, use the **waza-eval** skill. It provides:
190+
191+
- **Structured eval suites** — define tasks with prompts, expected outputs, and graders
192+
- **Progression testing** — compare tool efficiency across skill versions from git history
193+
- **Session capture** — commit result transcripts as golden sessions for regression detection
194+
- **CI integration** — gate PRs on eval pass rates
195+
196+
Use waza evals when you need to *measure* whether a skill change improved behavior. Use multi-model review (above) when you need *qualitative* structural feedback.
197+
198+
### Regression Heuristics
199+
200+
When comparing before/after eval results:
201+
202+
| Metric | Threshold | Action |
203+
|--------|-----------|--------|
204+
| Tool call increase > 20% on any task | 🔴 Regression | Roll back the change |
205+
| Tool call decrease > 10% | 🟢 Improvement | Record as evidence |
206+
| Elapsed time increase > 30% | 🔴 Regression | Investigate bottleneck |
207+
| Correct before, wrong after | 🔴 Regression | Roll back — correctness trumps efficiency |
208+
| Model misapplies new guidance | 🔴 Regression | Needs anti-pattern or rewording |
209+
| One model improves, others unchanged | 🟡 Partial | Likely acceptable |
210+
211+
### Trigger Test Structure
212+
213+
Evals should include trigger tests (does the skill activate correctly?):
214+
- **Should trigger** (8-12 prompts): varied phrasings of the skill's use cases, with high/medium confidence ratings
215+
- **Should not trigger** (6-8 prompts): neighboring skills, similar keywords that belong elsewhere
216+
- **Edge cases** (3-5 prompts): ambiguous prompts with explicit expected behavior and rationale
217+
218+
### Pre-submission Checklist
219+
220+
Before shipping a skill change:
221+
222+
- [ ] Description matches trigger tests (USE FOR phrases appear in should-trigger prompts)
223+
- [ ] Stop signals are explicit with numeric bounds
224+
- [ ] Domain examples present (not just tool schemas)
225+
- [ ] Token budget met (SKILL.md under 4K orchestrating / 15K knowledge)
226+
- [ ] Multi-model validation ≥ 4/5 across 2+ families
227+
228+
## Common False Positives
229+
230+
From real experience — automated reviewers frequently flag these incorrectly:
231+
232+
### PowerShell compatibility
233+
234+
- **Claim**: `-UseBasicParsing` is "not supported in pwsh"
235+
- **Reality**: It's a no-op in pwsh (accepted, silently ignored). Required in Windows PowerShell 5.1.
236+
- **Response**: "Keeping it — no-op in pwsh, required in WinPS 5.1 to avoid IE COM dependency."
237+
238+
### API field names
239+
240+
- **Claim**: `gh pr checks` should use `--json conclusion` instead of `--json state`
241+
- **Reality**: `conclusion` is not a valid field. `state` contains `SUCCESS`/`FAILURE` directly.
242+
- **Response**: Verify with `gh pr checks --json` error output: "Unknown JSON field: 'conclusion'"
243+
244+
### Training data staleness
245+
246+
- **Claim**: "This API/method doesn't exist" or "is deprecated"
247+
- **Reality**: Models have knowledge cutoffs. The API may be current.
248+
- **Response**: "Verified — this API exists and works. Model training data may be stale."
249+
250+
### MCP tool name prefixes
251+
252+
- **Claim**: Skill docs should use fully-qualified MCP tool names like `hlx-hlx_status` or `github-mcp-server-list_workflow_runs` instead of short names like `hlx_status` or `list_workflow_runs`
253+
- **Reality**: Skills should prefer domain language ("search the console log", "get job pass/fail summary") over any tool name. This maps to whichever tool the agent has — MCP, CLI, or API fallback. When tool names are unavoidable (e.g., anti-pattern examples), use short names; the server prefix is an implementation detail.
254+
- **Response**: "Domain language is preferred. It creates semantic connections to tool descriptions rather than literal coupling to names that change across MCP versions."
255+
256+
### Over-disposal
257+
258+
- **Claim**: Every HTTP response/client needs try/finally/dispose
259+
- **Reality**: Sometimes correct! But reviewers often suggest disposal patterns that add complexity without value (e.g., disposing a client that's about to go out of scope at function return).
260+
- **Response**: Apply disposal for long-running functions or loops. Skip for simple one-shot calls at function end.
261+
262+
## Review Thread Workflow
263+
264+
When addressing PR review comments programmatically:
265+
266+
### Reply to a thread
267+
268+
```powershell
269+
$body = "Your evidence-based reply" | ConvertTo-Json
270+
$query = @"
271+
mutation {
272+
addPullRequestReviewThreadReply(input: {
273+
pullRequestReviewThreadId: "$threadId",
274+
body: $body
275+
}) { clientMutationId }
276+
}
277+
"@
278+
gh api graphql -f query="$query"
279+
```
280+
281+
### Resolve a thread
282+
283+
```powershell
284+
$query = @"
285+
mutation {
286+
resolveReviewThread(input: {
287+
threadId: "$threadId"
288+
}) { clientMutationId }
289+
}
290+
"@
291+
gh api graphql -f query="$query"
292+
```
293+
294+
### Best practices
295+
296+
- **Read all threads first** before responding — some may be duplicates
297+
- **Reply before resolving** — so the conversation is preserved
298+
- **Batch replies** for the same issue across multiple threads
299+
- **Include evidence** — "verified by running X" or "tested against real API"
300+
- **Be concise** — one paragraph per reply is usually enough

0 commit comments

Comments
 (0)