|
1 | | -# mongodb-query-optimizer — Eval Results |
| 1 | +# mongodb-query-optimizer — Eval Results (Iteration 4) |
2 | 2 |
|
3 | | -**Date:** 2026-03-24 |
| 3 | +**Date:** 2026-03-25 |
4 | 4 | **Model:** Claude Opus 4.6 (`us.anthropic.claude-opus-4-6-v1`) |
5 | | -**Iteration:** 1 (baseline) |
| 5 | +**MCP config:** Evals 1–5 run **without** MCP server; evals 6–8 run **with** MCP server |
6 | 6 |
|
7 | 7 | ## Results |
8 | 8 |
|
9 | | - |
10 | | -| Eval | with_skill | without_skill | Differentiates? | |
11 | | -| -------------------------------- | ---------- | ------------- | --------------- | |
12 | | -| 1. $in operator optimization | 3/3 (100%) | 2/3 (67%) | Yes | |
13 | | -| 2. $lookup aggregation | 4/4 (100%) | 3/4 (75%) | Yes | |
14 | | -| 3. replaceOne oplog | 3/3 (100%) | 2/3 (67%) | Yes | |
15 | | -| 4. Covered query | 3/3 (100%) | 3/3 (100%) | No | |
16 | | -| 5. Negative test (query writing) | 2/2 (100%) | 2/2 (100%) | No | |
17 | | -| 6. Atlas slow queries (MCP) | 5/5 (100%) | 5/5 (100%) | No | |
18 | | -| 7. Atlas perf summary (MCP) | 5/5 (100%) | 5/5 (100%) | No | |
19 | | -| 8. $facet aggregation (MCP) | 5/5 (100%) | 4/5 (80%) | Yes | |
20 | | - |
21 | | - |
22 | | -**Overall: with_skill 100% vs without_skill 88% (+12%)** |
23 | | - |
24 | | -## Key findings |
25 | | - |
26 | | -- **Evals 1–3 differentiate well.** The skill's reference files provide specialized knowledge the model lacks: the ~200-element `$in` threshold (eval 1), top-N sort optimization (eval 2), and `$replaceWith` + `$literal` for oplog-efficient updates (eval 3). |
27 | | -- **Eval 4–5 don't differentiate.** The covered query `_id` issue and the negative test case are handled equally well with or without the skill. |
28 | | -- **MCP evals 6–7 don't differentiate.** When Atlas Performance Advisor provides index suggestions via API, both runs surface them equally. Consider adding assertions for skill-specific reasoning (e.g., ESR analysis) to better measure value-add. |
29 | | - |
30 | | ---- |
31 | | - |
32 | | -## Iteration 2 |
33 | | - |
34 | | -**Date:** 2026-03-24 |
35 | | -**Model:** Claude Sonnet 4.6 (`us.anthropic.claude-sonnet-4-6`) |
36 | | -**MCP config:** Evals 1–5 run **without** MCP server; evals 6–7 run **with** MCP server |
37 | | - |
38 | | -## Results |
39 | | - |
40 | | -| Eval | with_skill | without_skill | Differentiates? | |
41 | | -| -------------------------------- | ---------- | ------------- | --------------- | |
42 | | -| 1. $in operator optimization | 2/3 (67%) | 3/3 (100%) | Inverted | |
43 | | -| 2. $lookup aggregation | 4/4 (100%) | 2/4 (50%) | Yes | |
44 | | -| 3. replaceOne oplog | 3/3 (100%) | 2/3 (67%) | Yes | |
45 | | -| 4. Covered query | 3/3 (100%) | 3/3 (100%) | No | |
46 | | -| 5. Negative test (query writing) | 2/2 (100%) | 2/2 (100%) | No | |
47 | | -| 6. Atlas slow queries (MCP) | 5/5 (100%) | 5/5 (100%) | No | |
48 | | -| 7. Atlas perf summary (MCP) | 5/5 (100%) | 5/5 (100%) | No | |
49 | | - |
50 | | -**Overall: with_skill 95% vs without_skill 88% (+7%)** |
51 | | - |
52 | | -| Metric | with_skill | without_skill | Delta | |
53 | | -| ------------- | ---------------- | ---------------- | ------- | |
54 | | -| Pass Rate | 95% ± 12% | 88% ± 21% | +7% | |
55 | | -| Time | 52.2s ± 26.0s | 48.1s ± 26.4s | +4.1s | |
56 | | -| Tokens | 18,252 ± 6,460 | 10,728 ± 5,929 | +7,524 | |
| 9 | +| Eval | with_skill | without_skill | Differentiates? | |
| 10 | +| -------------------------------- | ----------- | ------------- | --------------- | |
| 11 | +| 1. $in operator optimization | 2/3 (67%) | 2/3 (67%) | No | |
| 12 | +| 2. $lookup aggregation | 4/4 (100%) | 2/4 (50%) | Yes | |
| 13 | +| 3. replaceOne oplog | 3/3 (100%) | 2/3 (67%) | Yes | |
| 14 | +| 4. Covered query | 3/3 (100%) | 3/3 (100%) | No | |
| 15 | +| 5. Negative test (query writing) | 2/2 (100%) | 2/2 (100%) | No | |
| 16 | +| 6. Atlas slow queries (MCP) | 5/5 (100%) | 5/5 (100%) | No | |
| 17 | +| 7. Atlas perf summary (MCP) | 5/5 (100%) | 4/5 (80%) | Yes | |
| 18 | +| 8. $facet aggregation (MCP) | 4/5 (80%) | 2/5 (40%) | Yes | |
| 19 | + |
| 20 | +**Overall: with_skill 93% vs without_skill 76% (+17%)** |
| 21 | + |
| 22 | +| Metric | with_skill | without_skill | Delta | |
| 23 | +| ------------- | ---------------- | ---------------- | -------- | |
| 24 | +| Pass Rate | 93% | 76% | +17% | |
| 25 | +| Avg Time | 76.0s | 70.6s | +5.4s | |
| 26 | +| Avg Tokens | 19,687 | 14,988 | +4,699 | |
57 | 27 |
|
58 | 28 | ## Key findings |
59 | 29 |
|
60 | | -- **Eval 1 inverted (with_skill 67% < without_skill 100%).** The skill agent converged on the single "safe" ESR index `{ status: 1, createdAt: -1, tags: 1 }` and didn't present `{ status: 1, tags: 1, createdAt: -1 }` as the better option for small `$in` lists. The baseline correctly surfaced both options with a size-based recommendation. The skill's ESR guidance may be too prescriptive for this nuanced case. |
61 | | -- **Evals 2–3 still differentiate well.** Skill prevents the `$project`-before-`$group` anti-pattern (eval 2) and surfaces `$replaceWith` + `$literal` for oplog-efficient syncs (eval 3) — both require reference file knowledge. |
62 | | -- **Evals 4–5 and MCP evals 6–7 remain non-differentiating**, same as iteration 1. |
63 | | -- **Token cost of skill:** ~7,500 extra tokens per run on average, primarily from loading reference files. |
| 30 | +- **Biggest skill wins: evals 2, 3, and 8.** The skill's reference files provide specialized MongoDB knowledge the base model lacks: top-N sort optimization and avoiding the `$project`-before-`$group` anti-pattern (eval 2), `$replaceWith` + `$literal` for oplog-efficient updates (eval 3), and the `$facet` → `$unionWith` rewrite pattern (eval 8). |
| 31 | +- **Eval 1 tied at 67%.** Both versions missed the assertion about `{ status: 1, tags: 1, createdAt: -1 }` being suitable for small `$in` lists — both converged on the ESR-based `{ status: 1, createdAt: -1, tags: 1 }` index without presenting the alternative. |
| 32 | +- **Eval 7 skill edge.** The skill correctly called all three Performance Advisor operations (suggestedIndexes, slowQueryLogs, dropIndexSuggestions) while the baseline missed slowQueryLogs. |
| 33 | +- **Eval 8 strongest differentiator (+40%).** The skill recommended replacing `$facet` with `$unionWith` for independent pipeline optimization — knowledge from `aggregation-optimization.md`. The baseline only suggested generic improvements (pre-$match, $project, $limit) without the structural rewrite. |
| 34 | +- **Eval 8 caveat.** Both versions couldn't find the actual `$facet` query in slow query logs (aged out of retention window), costing the skill one assertion. |
| 35 | +- **Evals 4–5 and 6 remain non-differentiating.** Covered query `_id` issue and negative test are handled equally well. MCP-based slow query discovery (eval 6) works equally well with or without the skill. |
| 36 | +- **Token cost of skill:** ~4,700 extra tokens per run on average, primarily from loading reference files. Time overhead is modest (+5.4s). |
0 commit comments