|
| 1 | +# Testing Patterns |
| 2 | + |
| 3 | +Multi-model subagent testing methodology for Copilot CLI skills. This approach was independently validated by both our iterative skill development and stephentoub's code-review skill (which includes multi-model review as a first-class process). |
| 4 | + |
| 5 | +## Why Multi-Model Testing |
| 6 | + |
| 7 | +Different models have different blind spots: |
| 8 | + |
| 9 | +- Some excel at code correctness but miss UX issues |
| 10 | +- Some catch edge cases others overlook |
| 11 | +- Some produce false positives that others correctly ignore |
| 12 | +- **Consensus findings** (flagged by 2+ models) are almost always real issues |
| 13 | + |
| 14 | +## The Process |
| 15 | + |
| 16 | +### 1. Select Models |
| 17 | + |
| 18 | +Choose the top-tier model from each available model family. Use at least 2, at most 4. Skip fast/cheap tiers — you want the best reasoning from each family. |
| 19 | + |
| 20 | +Example selection: |
| 21 | + |
| 22 | +``` |
| 23 | +claude-opus-4.6 (Anthropic) |
| 24 | +gpt-5.3-codex (OpenAI) |
| 25 | +gpt-5.4 (OpenAI, alternative perspective) |
| 26 | +``` |
| 27 | + |
| 28 | +> ⚠️ `gemini-3-pro-preview` frequently fails with 400 errors on general-purpose task agents. Prefer OpenAI or Anthropic models until Gemini stability improves. |
| 29 | +
|
| 30 | +### 2. Construct the Test Prompt |
| 31 | + |
| 32 | +Give each agent the **same prompt** containing: |
| 33 | + |
| 34 | +- The skill's purpose and context |
| 35 | +- A realistic task that exercises the skill |
| 36 | +- Instructions to report findings with severity |
| 37 | + |
| 38 | +**For script-driven skills** — ask agents to run the skill and evaluate output: |
| 39 | + |
| 40 | +``` |
| 41 | +Use the skill at {path} to {task}. After running, evaluate: |
| 42 | +1. Did the skill produce correct, useful output? |
| 43 | +2. Are there edge cases it mishandled? |
| 44 | +3. Is the output clear and actionable? |
| 45 | +4. Any bugs, errors, or misleading information? |
| 46 | +Report findings as: ❌ error / ⚠️ warning / 💡 suggestion |
| 47 | +``` |
| 48 | + |
| 49 | +**For knowledge-driven skills** — ask agents to apply the skill's rules: |
| 50 | + |
| 51 | +``` |
| 52 | +Read the skill at {path} and use it to {task}. After applying, evaluate: |
| 53 | +1. Were the instructions clear enough to follow? |
| 54 | +2. Did any rules conflict or create ambiguity? |
| 55 | +3. Were there gaps — situations where the skill gave no guidance? |
| 56 | +4. Any rules that seem wrong or overly broad? |
| 57 | +Report findings as: ❌ error / ⚠️ warning / 💡 suggestion |
| 58 | +``` |
| 59 | + |
| 60 | +**For the SKILL.md itself** — ask for structural review: |
| 61 | +``` |
| 62 | +Review the skill at {path} as if you were a developer evaluating whether |
| 63 | +to adopt it. Consider: trigger description quality, section organization, |
| 64 | +completeness, accuracy, actionability. Would you trust this skill's guidance? |
| 65 | +``` |
| 66 | + |
| 67 | +### 3. Launch in Parallel |
| 68 | + |
| 69 | +Use the `task` tool with different `model` parameters: |
| 70 | + |
| 71 | +``` |
| 72 | +task agent_type="general-purpose" model="claude-opus-4.6" prompt="..." |
| 73 | +task agent_type="general-purpose" model="gpt-5.4" prompt="..." |
| 74 | +task agent_type="general-purpose" model="gemini-3.1-pro-preview" prompt="..." |
| 75 | +``` |
| 76 | + |
| 77 | +Launch all in parallel (mode="background") when possible. |
| 78 | + |
| 79 | +### 4. Synthesize Results |
| 80 | + |
| 81 | +After all agents complete: |
| 82 | + |
| 83 | +1. **Deduplicate**: Group findings that describe the same issue |
| 84 | +2. **Elevate consensus**: Issues flagged by 2+ models → high confidence, fix first |
| 85 | +3. **Include unique catches**: Single-model findings that meet the confidence bar |
| 86 | +4. **Discard noise**: Vague suggestions without specific evidence |
| 87 | + |
| 88 | +### 5. Prioritize Actions |
| 89 | + |
| 90 | +| Priority | Criteria | |
| 91 | +|----------|----------| |
| 92 | +| Fix now | ❌ errors from any model, ⚠️ warnings from 2+ models | |
| 93 | +| Fix soon | ⚠️ warnings from 1 model with clear evidence | |
| 94 | +| Consider | 💡 suggestions with consensus or strong rationale | |
| 95 | +| Skip | 💡 suggestions from 1 model without evidence, style-only feedback | |
| 96 | + |
| 97 | +## A/B Testing: Before/After Comparison |
| 98 | + |
| 99 | +When iterating on a skill, run the **same task** before and after changes to measure improvement. This catches cases where a fix for one problem introduces a regression elsewhere. |
| 100 | + |
| 101 | +### Setup |
| 102 | + |
| 103 | +1. **Pick a reproducible task** — a real investigation with a known correct answer works best |
| 104 | +2. **Record the "before" run** — launch a subagent with the current skill, note: elapsed time, tool call count, whether it got the correct answer, and any wrong turns |
| 105 | +3. **Apply your skill changes** (edit SKILL.md, references, scripts) |
| 106 | +4. **Run the "after" test** — same prompt, same model, same task |
| 107 | +5. **Compare results** |
| 108 | + |
| 109 | +### What to Measure |
| 110 | + |
| 111 | +| Metric | How | Good signal | |
| 112 | +|--------|-----|-------------| |
| 113 | +| **Correctness** | Did the agent reach the right conclusion? | Before: ❌ → After: ✅ | |
| 114 | +| **Elapsed time** | Agent completion time (seconds) | >30% faster | |
| 115 | +| **Tool calls** | Count of tool invocations | Fewer = more efficient | |
| 116 | +| **Wrong turns** | Steps that didn't contribute to the answer | Fewer = better guidance | |
| 117 | + |
| 118 | +### Example (from ci-analysis improvement) |
| 119 | + |
| 120 | +``` |
| 121 | +Task: "Compare Csc args between passing and failing Helix binlogs" |
| 122 | +
|
| 123 | +Round 1 (before fixes): 623s, wrong root cause (Debug/Release noise) |
| 124 | +Round 2 (after fixes): 272s, correct root cause (extra analyzerconfig arg) |
| 125 | +
|
| 126 | +Changes made: Added "focus on arg count, not value differences" to |
| 127 | +binlog-comparison.md delegation prompt template. |
| 128 | +``` |
| 129 | + |
| 130 | +### Tips |
| 131 | + |
| 132 | +- **Use the same model** for before/after — different models have different capabilities |
| 133 | +- **Known-answer tasks** are best — you can objectively score correctness |
| 134 | +- **Don't optimize for speed alone** — a slower agent that gets the right answer beats a fast wrong one |
| 135 | +- **Save the before prompt** — you'll need the exact same prompt for the after run |
| 136 | + |
| 137 | +## Writer-Critic Convergence Loop |
| 138 | + |
| 139 | +For skill creation or major restructuring, a single review pass often misses structural issues that only surface when someone tries to *apply* the feedback. The writer-critic pattern uses two agents iteratively until the skill converges. |
| 140 | + |
| 141 | +### Process |
| 142 | + |
| 143 | +1. **Writer agent** creates or modifies the skill (SKILL.md, scripts, references) |
| 144 | +2. **Critic agent** reviews the result — produces a structured feedback document with ❌/⚠️/💡 findings |
| 145 | +3. **Writer agent** reads the feedback and applies fixes |
| 146 | +4. **Critic agent** reviews again — only flags *new or remaining* issues |
| 147 | +5. **Repeat** until the critic has no meaningful findings (usually 2-3 rounds) |
| 148 | + |
| 149 | +### Setup |
| 150 | + |
| 151 | +Use two `task` calls in sequence (not parallel — each depends on the previous output): |
| 152 | + |
| 153 | +``` |
| 154 | +# Round 1: Writer creates the skill |
| 155 | +task agent_type="general-purpose" prompt="Create a skill at {path} that {does X}..." |
| 156 | +
|
| 157 | +# Round 1: Critic reviews |
| 158 | +task agent_type="general-purpose" model="{different-model}" prompt="Review the skill at {path}. Report ❌/⚠️/💡 findings. Save feedback to {path}/feedback.md" |
| 159 | +
|
| 160 | +# Round 2: Writer applies feedback |
| 161 | +task agent_type="general-purpose" prompt="Read {path}/feedback.md and apply the feedback to the skill at {path}. Delete feedback.md when done." |
| 162 | +
|
| 163 | +# Round 2: Critic reviews again |
| 164 | +task agent_type="general-purpose" model="{different-model}" prompt="Review the skill at {path}. Only flag NEW or REMAINING issues..." |
| 165 | +``` |
| 166 | + |
| 167 | +### Key design choices |
| 168 | + |
| 169 | +- **Use different models** for writer and critic — same-model pairs are too agreeable |
| 170 | +- **The human stays in the loop** between rounds to steer direction and override bad suggestions |
| 171 | +- **Save feedback as a file** (e.g., `feedback.md` in the skill directory) so the writer agent has full context without you relaying it |
| 172 | +- **Delete feedback files** after they're applied — they're transient, not part of the skill |
| 173 | +- **Stop when the critic produces only 💡 suggestions** — that's convergence. Don't chase zero findings. |
| 174 | + |
| 175 | +### When to use this vs. multi-model review |
| 176 | + |
| 177 | +| Scenario | Approach | |
| 178 | +|----------|----------| |
| 179 | +| Testing an existing skill against a real task | Multi-model review (parallel, single-shot) | |
| 180 | +| Creating a new skill from scratch | Writer-critic loop (2-3 rounds) | |
| 181 | +| Major restructuring of a skill | Writer-critic loop | |
| 182 | +| Small fixes or incremental improvements | Multi-model review | |
| 183 | +| Validating after writer-critic converges | Multi-model review as final check | |
| 184 | + |
| 185 | +The two approaches complement each other: writer-critic for creation/iteration, multi-model for validation. |
| 186 | + |
| 187 | +## Waza Eval Testing |
| 188 | + |
| 189 | +For repeatable, quantitative skill testing, use the **waza-eval** skill. It provides: |
| 190 | + |
| 191 | +- **Structured eval suites** — define tasks with prompts, expected outputs, and graders |
| 192 | +- **Progression testing** — compare tool efficiency across skill versions from git history |
| 193 | +- **Session capture** — commit result transcripts as golden sessions for regression detection |
| 194 | +- **CI integration** — gate PRs on eval pass rates |
| 195 | + |
| 196 | +Use waza evals when you need to *measure* whether a skill change improved behavior. Use multi-model review (above) when you need *qualitative* structural feedback. |
| 197 | + |
| 198 | +### Regression Heuristics |
| 199 | + |
| 200 | +When comparing before/after eval results: |
| 201 | + |
| 202 | +| Metric | Threshold | Action | |
| 203 | +|--------|-----------|--------| |
| 204 | +| Tool call increase > 20% on any task | 🔴 Regression | Roll back the change | |
| 205 | +| Tool call decrease > 10% | 🟢 Improvement | Record as evidence | |
| 206 | +| Elapsed time increase > 30% | 🔴 Regression | Investigate bottleneck | |
| 207 | +| Correct before, wrong after | 🔴 Regression | Roll back — correctness trumps efficiency | |
| 208 | +| Model misapplies new guidance | 🔴 Regression | Needs anti-pattern or rewording | |
| 209 | +| One model improves, others unchanged | 🟡 Partial | Likely acceptable | |
| 210 | + |
| 211 | +### Trigger Test Structure |
| 212 | + |
| 213 | +Evals should include trigger tests (does the skill activate correctly?): |
| 214 | +- **Should trigger** (8-12 prompts): varied phrasings of the skill's use cases, with high/medium confidence ratings |
| 215 | +- **Should not trigger** (6-8 prompts): neighboring skills, similar keywords that belong elsewhere |
| 216 | +- **Edge cases** (3-5 prompts): ambiguous prompts with explicit expected behavior and rationale |
| 217 | + |
| 218 | +### Pre-submission Checklist |
| 219 | + |
| 220 | +Before shipping a skill change: |
| 221 | + |
| 222 | +- [ ] Description matches trigger tests (USE FOR phrases appear in should-trigger prompts) |
| 223 | +- [ ] Stop signals are explicit with numeric bounds |
| 224 | +- [ ] Domain examples present (not just tool schemas) |
| 225 | +- [ ] Token budget met (SKILL.md under 4K orchestrating / 15K knowledge) |
| 226 | +- [ ] Multi-model validation ≥ 4/5 across 2+ families |
| 227 | + |
| 228 | +## Common False Positives |
| 229 | + |
| 230 | +From real experience — automated reviewers frequently flag these incorrectly: |
| 231 | + |
| 232 | +### PowerShell compatibility |
| 233 | + |
| 234 | +- **Claim**: `-UseBasicParsing` is "not supported in pwsh" |
| 235 | +- **Reality**: It's a no-op in pwsh (accepted, silently ignored). Required in Windows PowerShell 5.1. |
| 236 | +- **Response**: "Keeping it — no-op in pwsh, required in WinPS 5.1 to avoid IE COM dependency." |
| 237 | + |
| 238 | +### API field names |
| 239 | + |
| 240 | +- **Claim**: `gh pr checks` should use `--json conclusion` instead of `--json state` |
| 241 | +- **Reality**: `conclusion` is not a valid field. `state` contains `SUCCESS`/`FAILURE` directly. |
| 242 | +- **Response**: Verify with `gh pr checks --json` error output: "Unknown JSON field: 'conclusion'" |
| 243 | + |
| 244 | +### Training data staleness |
| 245 | + |
| 246 | +- **Claim**: "This API/method doesn't exist" or "is deprecated" |
| 247 | +- **Reality**: Models have knowledge cutoffs. The API may be current. |
| 248 | +- **Response**: "Verified — this API exists and works. Model training data may be stale." |
| 249 | + |
| 250 | +### MCP tool name prefixes |
| 251 | + |
| 252 | +- **Claim**: Skill docs should use fully-qualified MCP tool names like `hlx-hlx_status` or `github-mcp-server-list_workflow_runs` instead of short names like `hlx_status` or `list_workflow_runs` |
| 253 | +- **Reality**: Skills should prefer domain language ("search the console log", "get job pass/fail summary") over any tool name. This maps to whichever tool the agent has — MCP, CLI, or API fallback. When tool names are unavoidable (e.g., anti-pattern examples), use short names; the server prefix is an implementation detail. |
| 254 | +- **Response**: "Domain language is preferred. It creates semantic connections to tool descriptions rather than literal coupling to names that change across MCP versions." |
| 255 | + |
| 256 | +### Over-disposal |
| 257 | + |
| 258 | +- **Claim**: Every HTTP response/client needs try/finally/dispose |
| 259 | +- **Reality**: Sometimes correct! But reviewers often suggest disposal patterns that add complexity without value (e.g., disposing a client that's about to go out of scope at function return). |
| 260 | +- **Response**: Apply disposal for long-running functions or loops. Skip for simple one-shot calls at function end. |
| 261 | + |
| 262 | +## Review Thread Workflow |
| 263 | + |
| 264 | +When addressing PR review comments programmatically: |
| 265 | + |
| 266 | +### Reply to a thread |
| 267 | + |
| 268 | +```powershell |
| 269 | +$body = "Your evidence-based reply" | ConvertTo-Json |
| 270 | +$query = @" |
| 271 | +mutation { |
| 272 | + addPullRequestReviewThreadReply(input: { |
| 273 | + pullRequestReviewThreadId: "$threadId", |
| 274 | + body: $body |
| 275 | + }) { clientMutationId } |
| 276 | +} |
| 277 | +"@ |
| 278 | +gh api graphql -f query="$query" |
| 279 | +``` |
| 280 | + |
| 281 | +### Resolve a thread |
| 282 | + |
| 283 | +```powershell |
| 284 | +$query = @" |
| 285 | +mutation { |
| 286 | + resolveReviewThread(input: { |
| 287 | + threadId: "$threadId" |
| 288 | + }) { clientMutationId } |
| 289 | +} |
| 290 | +"@ |
| 291 | +gh api graphql -f query="$query" |
| 292 | +``` |
| 293 | + |
| 294 | +### Best practices |
| 295 | + |
| 296 | +- **Read all threads first** before responding — some may be duplicates |
| 297 | +- **Reply before resolving** — so the conversation is preserved |
| 298 | +- **Batch replies** for the same issue across multiple threads |
| 299 | +- **Include evidence** — "verified by running X" or "tested against real API" |
| 300 | +- **Be concise** — one paragraph per reply is usually enough |
0 commit comments