update README + add latest run result

davidjahn · davidjahn · commit 239755995dbe · 2026-03-24T22:54:44.000-04:00
diff --git a/testing/mongodb-query-optimizer/README.md b/testing/mongodb-query-optimizer/README.md
@@ -33,24 +33,28 @@ Evals are split into two groups based on whether they need a live Atlas MCP conn
 | Evals | MCP server | What they test |
 |-------|-----------|----------------|
 | **1–5** | **Not configured** | Query optimization knowledge from skill references only. No MCP tools should be called. |
-| **6–7** | **Configured** (connection string + Atlas API credentials) | Skill's ability to use `atlas-get-performance-advisor` to diagnose real cluster performance issues. |
+| **6–8** | **Configured** (connection string + Atlas API credentials) | Skill's ability to use `atlas-get-performance-advisor` to diagnose real cluster performance issues. |
 
 **Both groups run with_skill and without_skill (baseline).** The skill-creator should spawn two subagents per eval — one with the skill, one without — so you can compare the skill's value-add against the model's base knowledge.
 
 Eval 5 is a **negative test case** — the optimizer skill should NOT trigger for a routine query-writing prompt.
 
 ### Instructions for /skill-creator
 
-When asked to run evals for this skill:
+**Important:** Evals 1–5 must run without the MCP server, and evals 6–8 must run with it. When invoking skill-creator, call it out explicitly, e.g.:
+
+> `/skill-creator` can you please run the evals for mongodb-query-optimizer. run evals 1-5 without MCP server configured, run evals 6-8 with the MCP server configured
+
+The steps are:
 
 1. Run **evals 1–5 without the MCP server configured** (tell subagents not to use any MongoDB MCP tools). Run both with_skill and without_skill (baseline) for each.
-2. Run **evals 6–7 with the MCP server configured** (subagents should use Atlas MCP tools). Run both with_skill and without_skill (baseline) for each.
+2. Run **evals 6–8 with the MCP server configured** (subagents should use Atlas MCP tools). Run both with_skill and without_skill (baseline) for each.
 3. Grade all runs against the assertions in `evals/evals.json`.
 4. Generate the eval viewer with benchmark comparison (with_skill vs without_skill).
 
-## Atlas Performance Test Setup (Evals 6–7)
+## Atlas Performance Test Setup (Evals 6–8)
 
-Evals 6 and 7 require a live Atlas cluster with slow query data. Follow these steps before running them.
+Evals 6–8 require a live Atlas cluster with slow query data. Follow these steps before running them.
 
 ### Prerequisites
 
@@ -110,6 +114,7 @@ The script produces two slow query patterns:
 |---|---|---|
 | `find({ status, region }).sort({ createdAt: -1 })` | COLLSCAN + in-memory SORT | `{ status: 1, region: 1, createdAt: -1 }` |
 | `find({ customerId })` | COLLSCAN | `{ customerId: 1 }` |
+| `aggregate([$facet: {...}])` | Full collection funneled into every branch | Replace `$facet` with `$unionWith`; index on `{ total: 1, createdAt: -1 }` |
 
 ### 4. Wait for Performance Advisor
 
@@ -129,11 +134,12 @@ If running evals via subagents (e.g., with skill-creator), pre-approve MCP tool
 }
 ```
 
-### 6. Run evals 6–7
+### 6. Run evals 6–8
 
-The eval test cases (ids 6 and 7 in `evals/evals.json`) ask the skill to:
-- Summarize slow queries and performance suggestions for the connected cluster
-- Provide optimization recommendations based on Performance Advisor output
+The eval test cases (ids 6, 7, and 8 in `evals/evals.json`) ask the skill to:
+- Discover and summarize slow queries on the connected cluster (eval 6)
+- Provide a full performance summary including indexes to create and drop (eval 7)
+- Identify and optimize a slow `$facet` aggregation from slow query logs (eval 8)
 
 These evals require a live MCP server connection — they cannot be run in offline/mock mode.
 
diff --git a/testing/mongodb-query-optimizer/SUMMARY.md b/testing/mongodb-query-optimizer/SUMMARY.md
@@ -1,63 +1,36 @@
-# mongodb-query-optimizer — Eval Results
+# mongodb-query-optimizer — Eval Results (Iteration 4)
 
-**Date:** 2026-03-24
+**Date:** 2026-03-25
 **Model:** Claude Opus 4.6 (`us.anthropic.claude-opus-4-6-v1`)
-**Iteration:** 1 (baseline)
+**MCP config:** Evals 1–5 run **without** MCP server; evals 6–8 run **with** MCP server
 
 ## Results
 
-
-| Eval                             | with_skill | without_skill | Differentiates? |
-| -------------------------------- | ---------- | ------------- | --------------- |
-| 1. $in operator optimization     | 3/3 (100%) | 2/3 (67%)     | Yes             |
-| 2. $lookup aggregation           | 4/4 (100%) | 3/4 (75%)     | Yes             |
-| 3. replaceOne oplog              | 3/3 (100%) | 2/3 (67%)     | Yes             |
-| 4. Covered query                 | 3/3 (100%) | 3/3 (100%)    | No              |
-| 5. Negative test (query writing) | 2/2 (100%) | 2/2 (100%)    | No              |
-| 6. Atlas slow queries (MCP)      | 5/5 (100%) | 5/5 (100%)    | No              |
-| 7. Atlas perf summary (MCP)      | 5/5 (100%) | 5/5 (100%)    | No              |
-| 8. $facet aggregation (MCP)      | 5/5 (100%) | 4/5 (80%)     | Yes             |
-
-
-**Overall: with_skill 100% vs without_skill 88% (+12%)**
-
-## Key findings
-
-- **Evals 1–3 differentiate well.** The skill's reference files provide specialized knowledge the model lacks: the ~200-element `$in` threshold (eval 1), top-N sort optimization (eval 2), and `$replaceWith` + `$literal` for oplog-efficient updates (eval 3).
-- **Eval 4–5 don't differentiate.** The covered query `_id` issue and the negative test case are handled equally well with or without the skill.
-- **MCP evals 6–7 don't differentiate.** When Atlas Performance Advisor provides index suggestions via API, both runs surface them equally. Consider adding assertions for skill-specific reasoning (e.g., ESR analysis) to better measure value-add.
-
----
-
-## Iteration 2
-
-**Date:** 2026-03-24
-**Model:** Claude Sonnet 4.6 (`us.anthropic.claude-sonnet-4-6`)
-**MCP config:** Evals 1–5 run **without** MCP server; evals 6–7 run **with** MCP server
-
-## Results
-
-| Eval                             | with_skill | without_skill | Differentiates? |
-| -------------------------------- | ---------- | ------------- | --------------- |
-| 1. $in operator optimization     | 2/3 (67%)  | 3/3 (100%)    | Inverted        |
-| 2. $lookup aggregation           | 4/4 (100%) | 2/4 (50%)     | Yes             |
-| 3. replaceOne oplog              | 3/3 (100%) | 2/3 (67%)     | Yes             |
-| 4. Covered query                 | 3/3 (100%) | 3/3 (100%)    | No              |
-| 5. Negative test (query writing) | 2/2 (100%) | 2/2 (100%)    | No              |
-| 6. Atlas slow queries (MCP)      | 5/5 (100%) | 5/5 (100%)    | No              |
-| 7. Atlas perf summary (MCP)      | 5/5 (100%) | 5/5 (100%)    | No              |
-
-**Overall: with_skill 95% vs without_skill 88% (+7%)**
-
-| Metric        | with_skill       | without_skill    | Delta   |
-| ------------- | ---------------- | ---------------- | ------- |
-| Pass Rate     | 95% ± 12%        | 88% ± 21%        | +7%     |
-| Time          | 52.2s ± 26.0s    | 48.1s ± 26.4s    | +4.1s   |
-| Tokens        | 18,252 ± 6,460   | 10,728 ± 5,929   | +7,524  |
+| Eval                             | with_skill  | without_skill | Differentiates? |
+| -------------------------------- | ----------- | ------------- | --------------- |
+| 1. $in operator optimization     | 2/3 (67%)   | 2/3 (67%)     | No              |
+| 2. $lookup aggregation           | 4/4 (100%)  | 2/4 (50%)     | Yes             |
+| 3. replaceOne oplog              | 3/3 (100%)  | 2/3 (67%)     | Yes             |
+| 4. Covered query                 | 3/3 (100%)  | 3/3 (100%)    | No              |
+| 5. Negative test (query writing) | 2/2 (100%)  | 2/2 (100%)    | No              |
+| 6. Atlas slow queries (MCP)      | 5/5 (100%)  | 5/5 (100%)    | No              |
+| 7. Atlas perf summary (MCP)      | 5/5 (100%)  | 4/5 (80%)     | Yes             |
+| 8. $facet aggregation (MCP)      | 4/5 (80%)   | 2/5 (40%)     | Yes             |
+
+**Overall: with_skill 93% vs without_skill 76% (+17%)**
+
+| Metric        | with_skill       | without_skill    | Delta    |
+| ------------- | ---------------- | ---------------- | -------- |
+| Pass Rate     | 93%              | 76%              | +17%     |
+| Avg Time      | 76.0s            | 70.6s            | +5.4s    |
+| Avg Tokens    | 19,687           | 14,988           | +4,699   |
 
 ## Key findings
 
-- **Eval 1 inverted (with_skill 67% < without_skill 100%).** The skill agent converged on the single "safe" ESR index `{ status: 1, createdAt: -1, tags: 1 }` and didn't present `{ status: 1, tags: 1, createdAt: -1 }` as the better option for small `$in` lists. The baseline correctly surfaced both options with a size-based recommendation. The skill's ESR guidance may be too prescriptive for this nuanced case.
-- **Evals 2–3 still differentiate well.** Skill prevents the `$project`-before-`$group` anti-pattern (eval 2) and surfaces `$replaceWith` + `$literal` for oplog-efficient syncs (eval 3) — both require reference file knowledge.
-- **Evals 4–5 and MCP evals 6–7 remain non-differentiating**, same as iteration 1.
-- **Token cost of skill:** ~7,500 extra tokens per run on average, primarily from loading reference files.
+- **Biggest skill wins: evals 2, 3, and 8.** The skill's reference files provide specialized MongoDB knowledge the base model lacks: top-N sort optimization and avoiding the `$project`-before-`$group` anti-pattern (eval 2), `$replaceWith` + `$literal` for oplog-efficient updates (eval 3), and the `$facet` → `$unionWith` rewrite pattern (eval 8).
+- **Eval 1 tied at 67%.** Both versions missed the assertion about `{ status: 1, tags: 1, createdAt: -1 }` being suitable for small `$in` lists — both converged on the ESR-based `{ status: 1, createdAt: -1, tags: 1 }` index without presenting the alternative.
+- **Eval 7 skill edge.** The skill correctly called all three Performance Advisor operations (suggestedIndexes, slowQueryLogs, dropIndexSuggestions) while the baseline missed slowQueryLogs.
+- **Eval 8 strongest differentiator (+40%).** The skill recommended replacing `$facet` with `$unionWith` for independent pipeline optimization — knowledge from `aggregation-optimization.md`. The baseline only suggested generic improvements (pre-$match, $project, $limit) without the structural rewrite.
+- **Eval 8 caveat.** Both versions couldn't find the actual `$facet` query in slow query logs (aged out of retention window), costing the skill one assertion.
+- **Evals 4–5 and 6 remain non-differentiating.** Covered query `_id` issue and negative test are handled equally well. MCP-based slow query discovery (eval 6) works equally well with or without the skill.
+- **Token cost of skill:** ~4,700 extra tokens per run on average, primarily from loading reference files. Time overhead is modest (+5.4s).