Skip to content

Commit 817a1f1

Browse files
committed
Refine Agents.md based on evaluation framework results
1 parent c4f1b0d commit 817a1f1

4 files changed

Lines changed: 31 additions & 13 deletions

File tree

.github/skills/agent_evaluation/SKILL.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,19 @@ triggers:
2727

2828
This skill covers two capabilities:
2929

30-
| Ask | Action |
31-
| ----------------------------------------------------------- | -------------------------------------------- |
32-
| Evaluate / score / review / audit an Agents.md | Load `agents-md-evaluation.md` and follow it |
33-
| Empirically test / measure usefulness / run efficacy trials | Follow the Efficacy Test procedure below |
30+
| Ask | Action |
31+
| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
32+
| Evaluate / score / review / audit an Agents.md | Load `agents-md-evaluation.md`, score it, then **automatically continue to the Efficacy Test if the result is PASS** unless the user explicitly asks for rubric-only output |
33+
| Empirically test / measure usefulness / run efficacy trials | Prerequisite gate must be run first; if PASS, follow the Efficacy Test procedure below |
3434

3535
For rubric evaluation, load `agents-md-evaluation.md` first and follow it exactly.
3636
For efficacy testing, the rubric must PASS before proceeding — see Prerequisite below.
3737

38+
> **Default behavior**: When a user asks to "evaluate" an Agents.md without
39+
> qualifying the scope, run the rubric gate AND the efficacy test in sequence.
40+
> Stop between stages only if the rubric result is FAIL. Do not ask for
41+
> confirmation to proceed from gate to efficacy test on a PASS result.
42+
3843
---
3944

4045
# Agents.md Efficacy Test
@@ -65,7 +70,7 @@ All trials must be self-contained: agents answer from provided context only —
6570

6671
Before running any efficacy trials, evaluate the target Agents.md using `agents-md-evaluation.md` (in this same folder).
6772

68-
- If the result is **PASS** (total_score ≥ 16 and no dimension score is 0): proceed to Step 1.
73+
- If the result is **PASS** (total_score ≥ 16 and no dimension score is 0): **immediately continue to Step 1 without waiting for user confirmation**.
6974
- If the result is **FAIL**: stop. Report the rubric scores and required fixes. Do not run efficacy trials on a failing Agents.md — low-quality input will produce misleading efficacy results. Fix the document first, re-run the rubric, then return here.
7075

7176
### Step 1 — Read the target Agents.md and understand the service

.github/skills/agent_evaluation/agents-md-evaluation.md

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
<!-- SPDX-FileCopyrightText: (C) 2026 Intel Corporation -->
22
<!-- SPDX-License-Identifier: Apache-2.0 -->
33

4-
# Agents.md Evaluation Skill
4+
# Agents.md Evaluation Rubric
55

6-
Use this skill to evaluate any service-level Agents.md file for agent usefulness and quality.
6+
Use this rubric to evaluate any service-level Agents.md file for agent usefulness and quality.
77

88
## When To Use
99

10-
Use this skill when asked to:
10+
Use this rubric when asked to:
1111

1212
- Evaluate, score, review, or audit an Agents.md file.
1313
- Compare quality across multiple service Agents.md files.
@@ -49,6 +49,16 @@ Score each dimension:
4949
9. Change-risk coverage
5050
10. Audience fit (coding-agent focused)
5151

52+
### Dimension 3 Scoring Notes (Actionability)
53+
54+
Use these anchors when scoring `actionability`:
55+
56+
- **0**: No conditional guidance; no "When Editing" or equivalent triggers.
57+
- **1**: When-Editing conditions exist but are a generic checklist (for example, "if models change, review migration") with no cross-references to specific KPI thresholds or constraint names stated elsewhere in the document.
58+
- **2**: When-Editing conditions are concrete AND at least one trigger explicitly cites a specific KPI value (for example, "p95 ≤ 200 ms") or a named constraint (for example, "auth regression target: 0") from the document's own KPI or Non-Obvious Constraints section, creating a navigational link that agents can follow without re-reading the full document.
59+
60+
Rationale: efficacy testing shows that agents follow When-Editing trigger paths but do not spontaneously back-reference KPI tables or constraint sections unless the trigger text itself contains the specific value or name.
61+
5262
### Dimension 7 Scoring Notes (Verification Expectations)
5363

5464
Use these anchors when scoring `verification_expectations`:

autocalibration/Agents.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,12 +39,13 @@ SPDX-License-Identifier: Apache-2.0
3939
- Treat uploaded payloads as untrusted: enforce size, format, and field validation at service boundary.
4040
- If changing calibration math or thresholds, include measurable before/after quality data.
4141
- Keep long-running operations out of request thread paths.
42+
- New request fields threaded into calibration logic must not introduce per-request mutable state that races with concurrent calibration; verify the concurrent calibration conflict rate KPI remains 0.
4243

4344
## When Editing This Service
4445

4546
- If you touch calibration strategy selection or execution flow, verify both AprilTag and markerless paths.
46-
- If you touch API contracts, update service docs and client expectations in the same change.
47-
- If you touch concurrency code, explicitly test overlapping scene/camera requests.
47+
- If you touch API contracts, update service docs and client expectations in the same change; also confirm the off-thread constraint holds (long-running work must stay off request threads) and that new fields are validated at the boundary (size, format, field).
48+
- If you touch concurrency code, explicitly test overlapping scene/camera requests and verify the concurrent calibration conflict rate remains 0.
4849

4950
## Verification Gate (Standardized)
5051

manager/Agents.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,10 @@ SPDX-License-Identifier: Apache-2.0
1919
## Non-Obvious Constraints
2020

2121
- Manager is authoritative for configuration metadata; runtime tracking state is external.
22-
- Database migration quality is a production safety issue, not a housekeeping task.
22+
- Database migration quality is a production safety issue, not a housekeeping task. Phase-1 migrations must be additive-only (add columns/tables with defaults; never DROP or RENAME in the same migration) to allow safe rollback without downtime.
2323
- Cross-service workflows depend on contract consistency more than UI presentation details.
2424
- Operational failures often surface as partial workflow completion across services; preserve transactional intent where possible.
25+
- Authorization checks must remain server-side near the protected resource; client-side or middleware-only enforcement is not sufficient.
2526

2627
## KPI Targets
2728

@@ -42,9 +43,10 @@ SPDX-License-Identifier: Apache-2.0
4243

4344
## When Editing This Service
4445

45-
- If models change, include migration review and compatibility notes.
46-
- If API serializers/views change, verify permission boundaries and negative cases.
46+
- If models change, include migration review and compatibility notes; confirm the migration is additive-only (no DROP/RENAME) and that `make -C tests django-integration-unit` passes with migration apply/check evidence in the PR notes.
47+
- If API serializers/views change, verify permission boundaries and negative cases; confirm that server-side authorization checks are preserved (auth regression target: 0) and that any new writable field cannot be written by roles without write permission.
4748
- If workflow orchestration changes, validate end-to-end behavior across dependent services.
49+
- Any model or serializer change is also a performance touchpoint: confirm p95 API latency stays ≤ 200 ms by running `make -C tests openapi-validation`.
4850

4951
## Verification Gate (Standardized)
5052

0 commit comments

Comments
 (0)