You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Evaluate / score / review / audit an Agents.md | Load `agents-md-evaluation.md`, score it, then **automatically continue to the Efficacy Test if the result is PASS** unless the user explicitly asks for rubric-only output|
33
+
| Empirically test / measure usefulness / run efficacy trials |Prerequisite gate must be run first; if PASS, follow the Efficacy Test procedure below|
34
34
35
35
For rubric evaluation, load `agents-md-evaluation.md` first and follow it exactly.
36
36
For efficacy testing, the rubric must PASS before proceeding — see Prerequisite below.
37
37
38
+
> **Default behavior**: When a user asks to "evaluate" an Agents.md without
39
+
> qualifying the scope, run the rubric gate AND the efficacy test in sequence.
40
+
> Stop between stages only if the rubric result is FAIL. Do not ask for
41
+
> confirmation to proceed from gate to efficacy test on a PASS result.
42
+
38
43
---
39
44
40
45
# Agents.md Efficacy Test
@@ -65,7 +70,7 @@ All trials must be self-contained: agents answer from provided context only —
65
70
66
71
Before running any efficacy trials, evaluate the target Agents.md using `agents-md-evaluation.md` (in this same folder).
67
72
68
-
- If the result is **PASS** (total_score ≥ 16 and no dimension score is 0): proceed to Step 1.
73
+
- If the result is **PASS** (total_score ≥ 16 and no dimension score is 0): **immediately continue to Step 1 without waiting for user confirmation**.
69
74
- If the result is **FAIL**: stop. Report the rubric scores and required fixes. Do not run efficacy trials on a failing Agents.md — low-quality input will produce misleading efficacy results. Fix the document first, re-run the rubric, then return here.
70
75
71
76
### Step 1 — Read the target Agents.md and understand the service
Use this skill to evaluate any service-level Agents.md file for agent usefulness and quality.
6
+
Use this rubric to evaluate any service-level Agents.md file for agent usefulness and quality.
7
7
8
8
## When To Use
9
9
10
-
Use this skill when asked to:
10
+
Use this rubric when asked to:
11
11
12
12
- Evaluate, score, review, or audit an Agents.md file.
13
13
- Compare quality across multiple service Agents.md files.
@@ -49,6 +49,16 @@ Score each dimension:
49
49
9. Change-risk coverage
50
50
10. Audience fit (coding-agent focused)
51
51
52
+
### Dimension 3 Scoring Notes (Actionability)
53
+
54
+
Use these anchors when scoring `actionability`:
55
+
56
+
-**0**: No conditional guidance; no "When Editing" or equivalent triggers.
57
+
-**1**: When-Editing conditions exist but are a generic checklist (for example, "if models change, review migration") with no cross-references to specific KPI thresholds or constraint names stated elsewhere in the document.
58
+
-**2**: When-Editing conditions are concrete AND at least one trigger explicitly cites a specific KPI value (for example, "p95 ≤ 200 ms") or a named constraint (for example, "auth regression target: 0") from the document's own KPI or Non-Obvious Constraints section, creating a navigational link that agents can follow without re-reading the full document.
59
+
60
+
Rationale: efficacy testing shows that agents follow When-Editing trigger paths but do not spontaneously back-reference KPI tables or constraint sections unless the trigger text itself contains the specific value or name.
- Treat uploaded payloads as untrusted: enforce size, format, and field validation at service boundary.
40
40
- If changing calibration math or thresholds, include measurable before/after quality data.
41
41
- Keep long-running operations out of request thread paths.
42
+
- New request fields threaded into calibration logic must not introduce per-request mutable state that races with concurrent calibration; verify the concurrent calibration conflict rate KPI remains 0.
42
43
43
44
## When Editing This Service
44
45
45
46
- If you touch calibration strategy selection or execution flow, verify both AprilTag and markerless paths.
46
-
- If you touch API contracts, update service docs and client expectations in the same change.
47
-
- If you touch concurrency code, explicitly test overlapping scene/camera requests.
47
+
- If you touch API contracts, update service docs and client expectations in the same change; also confirm the off-thread constraint holds (long-running work must stay off request threads) and that new fields are validated at the boundary (size, format, field).
48
+
- If you touch concurrency code, explicitly test overlapping scene/camera requests and verify the concurrent calibration conflict rate remains 0.
- Manager is authoritative for configuration metadata; runtime tracking state is external.
22
-
- Database migration quality is a production safety issue, not a housekeeping task.
22
+
- Database migration quality is a production safety issue, not a housekeeping task. Phase-1 migrations must be additive-only (add columns/tables with defaults; never DROP or RENAME in the same migration) to allow safe rollback without downtime.
23
23
- Cross-service workflows depend on contract consistency more than UI presentation details.
24
24
- Operational failures often surface as partial workflow completion across services; preserve transactional intent where possible.
25
+
- Authorization checks must remain server-side near the protected resource; client-side or middleware-only enforcement is not sufficient.
- If models change, include migration review and compatibility notes.
46
-
- If API serializers/views change, verify permission boundaries and negative cases.
46
+
- If models change, include migration review and compatibility notes; confirm the migration is additive-only (no DROP/RENAME) and that `make -C tests django-integration-unit` passes with migration apply/check evidence in the PR notes.
47
+
- If API serializers/views change, verify permission boundaries and negative cases; confirm that server-side authorization checks are preserved (auth regression target: 0) and that any new writable field cannot be written by roles without write permission.
47
48
- If workflow orchestration changes, validate end-to-end behavior across dependent services.
49
+
- Any model or serializer change is also a performance touchpoint: confirm p95 API latency stays ≤ 200 ms by running `make -C tests openapi-validation`.
0 commit comments