test: add mcp quality evaluation

thedaviddias · thedaviddias · commit 21496094b1d9 · 2026-05-31T10:55:17.000-04:00
diff --git a/docs/mcp-quality.md b/docs/mcp-quality.md
@@ -22,6 +22,31 @@ This runs:
 
 Exit code is non-zero if tests fail or the security scan reports critical issues. Safe to use in CI.
 
+## Quality evaluation
+
+Run the golden MCP quality evaluation from the repo root:
+
+```bash
+pnpm mcp:evaluate
+```
+
+This runs `packages/mcp/tests/quality/mcp-quality.test.ts` against the real rule corpus and prints a compact quality report. It currently measures:
+
+1. **Retrieval quality** with golden discovery queries using `Recall@5` and mean reciprocal rank.
+2. **`review_code` accuracy** with labeled true-positive and true-negative fixtures, reported as precision, recall, and false-positive rate.
+3. **Tool contract quality** across the full 11-tool surface, including naming, schemas, read-only annotations, and agent-facing descriptions.
+
+The command fails when quality drops below the current thresholds:
+
+- Retrieval `Recall@5 >= 80%`
+- Retrieval `MRR >= 0.50`
+- `review_code` precision `>= 90%`
+- `review_code` recall `>= 85%`
+- `review_code` false-positive rate `<= 10%`
+- All 11 expected tools remain exposed when checklist data exists
+
+Use this when changing search scoring, rule metadata, detector heuristics, tool definitions, or checklist-backed MCP behavior.
+
 **Security-only (no tests):**
 
 ```bash
@@ -76,7 +101,19 @@ pnpm test --filter=@repo/mcp
 
 Coverage: tools/list, get_rule, search_rules, check_rule, fix_rule, explain_rule, list_categories, review_code, get_workflow, get_quick_reference, telemetry, error handling.
 
-### 5. Tool performance benchmarks
+### 5. Golden quality evals
+
+**What**: Labeled quality checks in `packages/mcp/tests/quality/` that answer “is the MCP useful to agents?” rather than only “does it execute?”
+
+**How**:
+
+```bash
+pnpm mcp:evaluate
+```
+
+Add a retrieval case whenever a real agent query should reliably find a rule. Add a review fixture whenever `review_code` gains a new heuristic or previously noisy behavior is fixed.
+
+### 6. Tool performance benchmarks
 
 **What**: In-process latency benchmarks for the main tools; asserts p95 stays within budget and that `review_code` scales sub-linearly as the rule set grows.
 
diff --git a/package.json b/package.json
@@ -51,6 +51,7 @@
     "validate:packages": "tsx scripts/validate/validate-packages.ts",
     "mcp:audit": "tsx scripts/audit/mcp-audit.ts",
     "mcp:audit:security": "pnpm dlx mcp-security-auditor@latest scan packages/mcp/src --fail-on critical",
+    "mcp:evaluate": "pnpm --filter @repo/mcp evaluate",
     "audit:url": "tsx packages/cli/src/index.ts",
     "crawl:site": "tsx packages/crawler/src/run.ts",
     "ci:check": "pnpm run lint && pnpm run typecheck && pnpm run validate:rule-structure && pnpm run validate:guide-structure && pnpm run validate:guides && pnpm run validate:evidence && pnpm run test:ci",
diff --git a/packages/mcp/package.json b/packages/mcp/package.json
@@ -10,6 +10,7 @@
   "main": "./src/index.ts",
   "types": "./src/index.ts",
   "scripts": {
+    "evaluate": "jest tests/quality --runInBand",
     "lint": "biome check .",
     "test": "jest",
     "test:ci": "jest --ci --runInBand --passWithNoTests",
diff --git a/packages/mcp/tests/quality/mcp-quality.test.ts b/packages/mcp/tests/quality/mcp-quality.test.ts