Skip to content

Commit 2149609

Browse files
committed
test: add mcp quality evaluation
1 parent 786771f commit 2149609

4 files changed

Lines changed: 505 additions & 1 deletion

File tree

docs/mcp-quality.md

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,31 @@ This runs:
2222

2323
Exit code is non-zero if tests fail or the security scan reports critical issues. Safe to use in CI.
2424

25+
## Quality evaluation
26+
27+
Run the golden MCP quality evaluation from the repo root:
28+
29+
```bash
30+
pnpm mcp:evaluate
31+
```
32+
33+
This runs `packages/mcp/tests/quality/mcp-quality.test.ts` against the real rule corpus and prints a compact quality report. It currently measures:
34+
35+
1. **Retrieval quality** with golden discovery queries using `Recall@5` and mean reciprocal rank.
36+
2. **`review_code` accuracy** with labeled true-positive and true-negative fixtures, reported as precision, recall, and false-positive rate.
37+
3. **Tool contract quality** across the full 11-tool surface, including naming, schemas, read-only annotations, and agent-facing descriptions.
38+
39+
The command fails when quality drops below the current thresholds:
40+
41+
- Retrieval `Recall@5 >= 80%`
42+
- Retrieval `MRR >= 0.50`
43+
- `review_code` precision `>= 90%`
44+
- `review_code` recall `>= 85%`
45+
- `review_code` false-positive rate `<= 10%`
46+
- All 11 expected tools remain exposed when checklist data exists
47+
48+
Use this when changing search scoring, rule metadata, detector heuristics, tool definitions, or checklist-backed MCP behavior.
49+
2550
**Security-only (no tests):**
2651

2752
```bash
@@ -76,7 +101,19 @@ pnpm test --filter=@repo/mcp
76101

77102
Coverage: tools/list, get_rule, search_rules, check_rule, fix_rule, explain_rule, list_categories, review_code, get_workflow, get_quick_reference, telemetry, error handling.
78103

79-
### 5. Tool performance benchmarks
104+
### 5. Golden quality evals
105+
106+
**What**: Labeled quality checks in `packages/mcp/tests/quality/` that answer “is the MCP useful to agents?” rather than only “does it execute?”
107+
108+
**How**:
109+
110+
```bash
111+
pnpm mcp:evaluate
112+
```
113+
114+
Add a retrieval case whenever a real agent query should reliably find a rule. Add a review fixture whenever `review_code` gains a new heuristic or previously noisy behavior is fixed.
115+
116+
### 6. Tool performance benchmarks
80117

81118
**What**: In-process latency benchmarks for the main tools; asserts p95 stays within budget and that `review_code` scales sub-linearly as the rule set grows.
82119

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@
5151
"validate:packages": "tsx scripts/validate/validate-packages.ts",
5252
"mcp:audit": "tsx scripts/audit/mcp-audit.ts",
5353
"mcp:audit:security": "pnpm dlx mcp-security-auditor@latest scan packages/mcp/src --fail-on critical",
54+
"mcp:evaluate": "pnpm --filter @repo/mcp evaluate",
5455
"audit:url": "tsx packages/cli/src/index.ts",
5556
"crawl:site": "tsx packages/crawler/src/run.ts",
5657
"ci:check": "pnpm run lint && pnpm run typecheck && pnpm run validate:rule-structure && pnpm run validate:guide-structure && pnpm run validate:guides && pnpm run validate:evidence && pnpm run test:ci",

packages/mcp/package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
"main": "./src/index.ts",
1111
"types": "./src/index.ts",
1212
"scripts": {
13+
"evaluate": "jest tests/quality --runInBand",
1314
"lint": "biome check .",
1415
"test": "jest",
1516
"test:ci": "jest --ci --runInBand --passWithNoTests",

0 commit comments

Comments
 (0)