Skip to content

Commit 2397559

Browse files
committed
update README + add latest run result
1 parent 288c84a commit 2397559

File tree

2 files changed

+43
-64
lines changed

2 files changed

+43
-64
lines changed

testing/mongodb-query-optimizer/README.md

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -33,24 +33,28 @@ Evals are split into two groups based on whether they need a live Atlas MCP conn
3333
| Evals | MCP server | What they test |
3434
|-------|-----------|----------------|
3535
| **1–5** | **Not configured** | Query optimization knowledge from skill references only. No MCP tools should be called. |
36-
| **6–7** | **Configured** (connection string + Atlas API credentials) | Skill's ability to use `atlas-get-performance-advisor` to diagnose real cluster performance issues. |
36+
| **6–8** | **Configured** (connection string + Atlas API credentials) | Skill's ability to use `atlas-get-performance-advisor` to diagnose real cluster performance issues. |
3737

3838
**Both groups run with_skill and without_skill (baseline).** The skill-creator should spawn two subagents per eval — one with the skill, one without — so you can compare the skill's value-add against the model's base knowledge.
3939

4040
Eval 5 is a **negative test case** — the optimizer skill should NOT trigger for a routine query-writing prompt.
4141

4242
### Instructions for /skill-creator
4343

44-
When asked to run evals for this skill:
44+
**Important:** Evals 1–5 must run without the MCP server, and evals 6–8 must run with it. When invoking skill-creator, call it out explicitly, e.g.:
45+
46+
> `/skill-creator` can you please run the evals for mongodb-query-optimizer. run evals 1-5 without MCP server configured, run evals 6-8 with the MCP server configured
47+
48+
The steps are:
4549

4650
1. Run **evals 1–5 without the MCP server configured** (tell subagents not to use any MongoDB MCP tools). Run both with_skill and without_skill (baseline) for each.
47-
2. Run **evals 6–7 with the MCP server configured** (subagents should use Atlas MCP tools). Run both with_skill and without_skill (baseline) for each.
51+
2. Run **evals 6–8 with the MCP server configured** (subagents should use Atlas MCP tools). Run both with_skill and without_skill (baseline) for each.
4852
3. Grade all runs against the assertions in `evals/evals.json`.
4953
4. Generate the eval viewer with benchmark comparison (with_skill vs without_skill).
5054

51-
## Atlas Performance Test Setup (Evals 6–7)
55+
## Atlas Performance Test Setup (Evals 6–8)
5256

53-
Evals 6 and 7 require a live Atlas cluster with slow query data. Follow these steps before running them.
57+
Evals 6–8 require a live Atlas cluster with slow query data. Follow these steps before running them.
5458

5559
### Prerequisites
5660

@@ -110,6 +114,7 @@ The script produces two slow query patterns:
110114
|---|---|---|
111115
| `find({ status, region }).sort({ createdAt: -1 })` | COLLSCAN + in-memory SORT | `{ status: 1, region: 1, createdAt: -1 }` |
112116
| `find({ customerId })` | COLLSCAN | `{ customerId: 1 }` |
117+
| `aggregate([$facet: {...}])` | Full collection funneled into every branch | Replace `$facet` with `$unionWith`; index on `{ total: 1, createdAt: -1 }` |
113118

114119
### 4. Wait for Performance Advisor
115120

@@ -129,11 +134,12 @@ If running evals via subagents (e.g., with skill-creator), pre-approve MCP tool
129134
}
130135
```
131136

132-
### 6. Run evals 6–7
137+
### 6. Run evals 6–8
133138

134-
The eval test cases (ids 6 and 7 in `evals/evals.json`) ask the skill to:
135-
- Summarize slow queries and performance suggestions for the connected cluster
136-
- Provide optimization recommendations based on Performance Advisor output
139+
The eval test cases (ids 6, 7, and 8 in `evals/evals.json`) ask the skill to:
140+
- Discover and summarize slow queries on the connected cluster (eval 6)
141+
- Provide a full performance summary including indexes to create and drop (eval 7)
142+
- Identify and optimize a slow `$facet` aggregation from slow query logs (eval 8)
137143

138144
These evals require a live MCP server connection — they cannot be run in offline/mock mode.
139145

Lines changed: 28 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,36 @@
1-
# mongodb-query-optimizer — Eval Results
1+
# mongodb-query-optimizer — Eval Results (Iteration 4)
22

3-
**Date:** 2026-03-24
3+
**Date:** 2026-03-25
44
**Model:** Claude Opus 4.6 (`us.anthropic.claude-opus-4-6-v1`)
5-
**Iteration:** 1 (baseline)
5+
**MCP config:** Evals 1–5 run **without** MCP server; evals 6–8 run **with** MCP server
66

77
## Results
88

9-
10-
| Eval | with_skill | without_skill | Differentiates? |
11-
| -------------------------------- | ---------- | ------------- | --------------- |
12-
| 1. $in operator optimization | 3/3 (100%) | 2/3 (67%) | Yes |
13-
| 2. $lookup aggregation | 4/4 (100%) | 3/4 (75%) | Yes |
14-
| 3. replaceOne oplog | 3/3 (100%) | 2/3 (67%) | Yes |
15-
| 4. Covered query | 3/3 (100%) | 3/3 (100%) | No |
16-
| 5. Negative test (query writing) | 2/2 (100%) | 2/2 (100%) | No |
17-
| 6. Atlas slow queries (MCP) | 5/5 (100%) | 5/5 (100%) | No |
18-
| 7. Atlas perf summary (MCP) | 5/5 (100%) | 5/5 (100%) | No |
19-
| 8. $facet aggregation (MCP) | 5/5 (100%) | 4/5 (80%) | Yes |
20-
21-
22-
**Overall: with_skill 100% vs without_skill 88% (+12%)**
23-
24-
## Key findings
25-
26-
- **Evals 1–3 differentiate well.** The skill's reference files provide specialized knowledge the model lacks: the ~200-element `$in` threshold (eval 1), top-N sort optimization (eval 2), and `$replaceWith` + `$literal` for oplog-efficient updates (eval 3).
27-
- **Eval 4–5 don't differentiate.** The covered query `_id` issue and the negative test case are handled equally well with or without the skill.
28-
- **MCP evals 6–7 don't differentiate.** When Atlas Performance Advisor provides index suggestions via API, both runs surface them equally. Consider adding assertions for skill-specific reasoning (e.g., ESR analysis) to better measure value-add.
29-
30-
---
31-
32-
## Iteration 2
33-
34-
**Date:** 2026-03-24
35-
**Model:** Claude Sonnet 4.6 (`us.anthropic.claude-sonnet-4-6`)
36-
**MCP config:** Evals 1–5 run **without** MCP server; evals 6–7 run **with** MCP server
37-
38-
## Results
39-
40-
| Eval | with_skill | without_skill | Differentiates? |
41-
| -------------------------------- | ---------- | ------------- | --------------- |
42-
| 1. $in operator optimization | 2/3 (67%) | 3/3 (100%) | Inverted |
43-
| 2. $lookup aggregation | 4/4 (100%) | 2/4 (50%) | Yes |
44-
| 3. replaceOne oplog | 3/3 (100%) | 2/3 (67%) | Yes |
45-
| 4. Covered query | 3/3 (100%) | 3/3 (100%) | No |
46-
| 5. Negative test (query writing) | 2/2 (100%) | 2/2 (100%) | No |
47-
| 6. Atlas slow queries (MCP) | 5/5 (100%) | 5/5 (100%) | No |
48-
| 7. Atlas perf summary (MCP) | 5/5 (100%) | 5/5 (100%) | No |
49-
50-
**Overall: with_skill 95% vs without_skill 88% (+7%)**
51-
52-
| Metric | with_skill | without_skill | Delta |
53-
| ------------- | ---------------- | ---------------- | ------- |
54-
| Pass Rate | 95% ± 12% | 88% ± 21% | +7% |
55-
| Time | 52.2s ± 26.0s | 48.1s ± 26.4s | +4.1s |
56-
| Tokens | 18,252 ± 6,460 | 10,728 ± 5,929 | +7,524 |
9+
| Eval | with_skill | without_skill | Differentiates? |
10+
| -------------------------------- | ----------- | ------------- | --------------- |
11+
| 1. $in operator optimization | 2/3 (67%) | 2/3 (67%) | No |
12+
| 2. $lookup aggregation | 4/4 (100%) | 2/4 (50%) | Yes |
13+
| 3. replaceOne oplog | 3/3 (100%) | 2/3 (67%) | Yes |
14+
| 4. Covered query | 3/3 (100%) | 3/3 (100%) | No |
15+
| 5. Negative test (query writing) | 2/2 (100%) | 2/2 (100%) | No |
16+
| 6. Atlas slow queries (MCP) | 5/5 (100%) | 5/5 (100%) | No |
17+
| 7. Atlas perf summary (MCP) | 5/5 (100%) | 4/5 (80%) | Yes |
18+
| 8. $facet aggregation (MCP) | 4/5 (80%) | 2/5 (40%) | Yes |
19+
20+
**Overall: with_skill 93% vs without_skill 76% (+17%)**
21+
22+
| Metric | with_skill | without_skill | Delta |
23+
| ------------- | ---------------- | ---------------- | -------- |
24+
| Pass Rate | 93% | 76% | +17% |
25+
| Avg Time | 76.0s | 70.6s | +5.4s |
26+
| Avg Tokens | 19,687 | 14,988 | +4,699 |
5727

5828
## Key findings
5929

60-
- **Eval 1 inverted (with_skill 67% < without_skill 100%).** The skill agent converged on the single "safe" ESR index `{ status: 1, createdAt: -1, tags: 1 }` and didn't present `{ status: 1, tags: 1, createdAt: -1 }` as the better option for small `$in` lists. The baseline correctly surfaced both options with a size-based recommendation. The skill's ESR guidance may be too prescriptive for this nuanced case.
61-
- **Evals 2–3 still differentiate well.** Skill prevents the `$project`-before-`$group` anti-pattern (eval 2) and surfaces `$replaceWith` + `$literal` for oplog-efficient syncs (eval 3) — both require reference file knowledge.
62-
- **Evals 4–5 and MCP evals 6–7 remain non-differentiating**, same as iteration 1.
63-
- **Token cost of skill:** ~7,500 extra tokens per run on average, primarily from loading reference files.
30+
- **Biggest skill wins: evals 2, 3, and 8.** The skill's reference files provide specialized MongoDB knowledge the base model lacks: top-N sort optimization and avoiding the `$project`-before-`$group` anti-pattern (eval 2), `$replaceWith` + `$literal` for oplog-efficient updates (eval 3), and the `$facet``$unionWith` rewrite pattern (eval 8).
31+
- **Eval 1 tied at 67%.** Both versions missed the assertion about `{ status: 1, tags: 1, createdAt: -1 }` being suitable for small `$in` lists — both converged on the ESR-based `{ status: 1, createdAt: -1, tags: 1 }` index without presenting the alternative.
32+
- **Eval 7 skill edge.** The skill correctly called all three Performance Advisor operations (suggestedIndexes, slowQueryLogs, dropIndexSuggestions) while the baseline missed slowQueryLogs.
33+
- **Eval 8 strongest differentiator (+40%).** The skill recommended replacing `$facet` with `$unionWith` for independent pipeline optimization — knowledge from `aggregation-optimization.md`. The baseline only suggested generic improvements (pre-$match, $project, $limit) without the structural rewrite.
34+
- **Eval 8 caveat.** Both versions couldn't find the actual `$facet` query in slow query logs (aged out of retention window), costing the skill one assertion.
35+
- **Evals 4–5 and 6 remain non-differentiating.** Covered query `_id` issue and negative test are handled equally well. MCP-based slow query discovery (eval 6) works equally well with or without the skill.
36+
- **Token cost of skill:** ~4,700 extra tokens per run on average, primarily from loading reference files. Time overhead is modest (+5.4s).

0 commit comments

Comments
 (0)