Refine Agents.md based on evaluation framework results

saratpoluri · saratpoluri · commit 817a1f170d6e · 2026-05-01T14:42:18.000-07:00
diff --git a/.github/skills/agent_evaluation/SKILL.md b/.github/skills/agent_evaluation/SKILL.md
@@ -27,14 +27,19 @@ triggers:
 
 This skill covers two capabilities:
 
-| Ask                                                         | Action                                       |
-| ----------------------------------------------------------- | -------------------------------------------- |
-| Evaluate / score / review / audit an Agents.md              | Load `agents-md-evaluation.md` and follow it |
-| Empirically test / measure usefulness / run efficacy trials | Follow the Efficacy Test procedure below     |
+| Ask                                                         | Action                                                                                                                                                                   |
+| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| Evaluate / score / review / audit an Agents.md              | Load `agents-md-evaluation.md`, score it, then **automatically continue to the Efficacy Test if the result is PASS** unless the user explicitly asks for rubric-only output |
+| Empirically test / measure usefulness / run efficacy trials | Prerequisite gate must be run first; if PASS, follow the Efficacy Test procedure below                                                                                   |
 
 For rubric evaluation, load `agents-md-evaluation.md` first and follow it exactly.
 For efficacy testing, the rubric must PASS before proceeding — see Prerequisite below.
 
+> **Default behavior**: When a user asks to "evaluate" an Agents.md without
+> qualifying the scope, run the rubric gate AND the efficacy test in sequence.
+> Stop between stages only if the rubric result is FAIL. Do not ask for
+> confirmation to proceed from gate to efficacy test on a PASS result.
+
 ---
 
 # Agents.md Efficacy Test
@@ -65,7 +70,7 @@ All trials must be self-contained: agents answer from provided context only —
 
 Before running any efficacy trials, evaluate the target Agents.md using `agents-md-evaluation.md` (in this same folder).
 
-- If the result is **PASS** (total_score ≥ 16 and no dimension score is 0): proceed to Step 1.
+- If the result is **PASS** (total_score ≥ 16 and no dimension score is 0): **immediately continue to Step 1 without waiting for user confirmation**.
 - If the result is **FAIL**: stop. Report the rubric scores and required fixes. Do not run efficacy trials on a failing Agents.md — low-quality input will produce misleading efficacy results. Fix the document first, re-run the rubric, then return here.
 
 ### Step 1 — Read the target Agents.md and understand the service
diff --git a/.github/skills/agent_evaluation/agents-md-evaluation.md b/.github/skills/agent_evaluation/agents-md-evaluation.md
@@ -1,13 +1,13 @@
 <!-- SPDX-FileCopyrightText: (C) 2026 Intel Corporation -->
 <!-- SPDX-License-Identifier: Apache-2.0 -->
 
-# Agents.md Evaluation Skill
+# Agents.md Evaluation Rubric
 
-Use this skill to evaluate any service-level Agents.md file for agent usefulness and quality.
+Use this rubric to evaluate any service-level Agents.md file for agent usefulness and quality.
 
 ## When To Use
 
-Use this skill when asked to:
+Use this rubric when asked to:
 
 - Evaluate, score, review, or audit an Agents.md file.
 - Compare quality across multiple service Agents.md files.
@@ -49,6 +49,16 @@ Score each dimension:
 9. Change-risk coverage
 10. Audience fit (coding-agent focused)
 
+### Dimension 3 Scoring Notes (Actionability)
+
+Use these anchors when scoring `actionability`:
+
+- **0**: No conditional guidance; no "When Editing" or equivalent triggers.
+- **1**: When-Editing conditions exist but are a generic checklist (for example, "if models change, review migration") with no cross-references to specific KPI thresholds or constraint names stated elsewhere in the document.
+- **2**: When-Editing conditions are concrete AND at least one trigger explicitly cites a specific KPI value (for example, "p95 ≤ 200 ms") or a named constraint (for example, "auth regression target: 0") from the document's own KPI or Non-Obvious Constraints section, creating a navigational link that agents can follow without re-reading the full document.
+
+Rationale: efficacy testing shows that agents follow When-Editing trigger paths but do not spontaneously back-reference KPI tables or constraint sections unless the trigger text itself contains the specific value or name.
+
 ### Dimension 7 Scoring Notes (Verification Expectations)
 
 Use these anchors when scoring `verification_expectations`:
diff --git a/autocalibration/Agents.md b/autocalibration/Agents.md
@@ -39,12 +39,13 @@ SPDX-License-Identifier: Apache-2.0
 - Treat uploaded payloads as untrusted: enforce size, format, and field validation at service boundary.
 - If changing calibration math or thresholds, include measurable before/after quality data.
 - Keep long-running operations out of request thread paths.
+- New request fields threaded into calibration logic must not introduce per-request mutable state that races with concurrent calibration; verify the concurrent calibration conflict rate KPI remains 0.
 
 ## When Editing This Service
 
 - If you touch calibration strategy selection or execution flow, verify both AprilTag and markerless paths.
-- If you touch API contracts, update service docs and client expectations in the same change.
-- If you touch concurrency code, explicitly test overlapping scene/camera requests.
+- If you touch API contracts, update service docs and client expectations in the same change; also confirm the off-thread constraint holds (long-running work must stay off request threads) and that new fields are validated at the boundary (size, format, field).
+- If you touch concurrency code, explicitly test overlapping scene/camera requests and verify the concurrent calibration conflict rate remains 0.
 
 ## Verification Gate (Standardized)
 
diff --git a/manager/Agents.md b/manager/Agents.md
@@ -19,9 +19,10 @@ SPDX-License-Identifier: Apache-2.0
 ## Non-Obvious Constraints
 
 - Manager is authoritative for configuration metadata; runtime tracking state is external.
-- Database migration quality is a production safety issue, not a housekeeping task.
+- Database migration quality is a production safety issue, not a housekeeping task. Phase-1 migrations must be additive-only (add columns/tables with defaults; never DROP or RENAME in the same migration) to allow safe rollback without downtime.
 - Cross-service workflows depend on contract consistency more than UI presentation details.
 - Operational failures often surface as partial workflow completion across services; preserve transactional intent where possible.
+- Authorization checks must remain server-side near the protected resource; client-side or middleware-only enforcement is not sufficient.
 
 ## KPI Targets
 
@@ -42,9 +43,10 @@ SPDX-License-Identifier: Apache-2.0
 
 ## When Editing This Service
 
-- If models change, include migration review and compatibility notes.
-- If API serializers/views change, verify permission boundaries and negative cases.
+- If models change, include migration review and compatibility notes; confirm the migration is additive-only (no DROP/RENAME) and that `make -C tests django-integration-unit` passes with migration apply/check evidence in the PR notes.
+- If API serializers/views change, verify permission boundaries and negative cases; confirm that server-side authorization checks are preserved (auth regression target: 0) and that any new writable field cannot be written by roles without write permission.
 - If workflow orchestration changes, validate end-to-end behavior across dependent services.
+- Any model or serializer change is also a performance touchpoint: confirm p95 API latency stays ≤ 200 ms by running `make -C tests openapi-validation`.
 
 ## Verification Gate (Standardized)