Skip to content

Commit 2c21a08

Browse files
authored
docs: define harness evaluator checklist (#129)
Co-authored-by: devkade <devkade@users.noreply.github.com>
1 parent 87f20d4 commit 2c21a08

2 files changed

Lines changed: 90 additions & 0 deletions

File tree

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,6 +223,8 @@ Candidate vocabulary is deliberately small and additive: `ContractPreset`, `Evid
223223

224224
Implementation rhythm for the RunContract track is behavior-preserving: document the boundary first, add the generic projection second, add evidence/completion primitives third, add advisory quality hints fourth, render compact supervisor status fifth, and only then map optional external workflow adapter semantics. Each slice should keep existing workflow APIs, `WorkflowState`, `WorkflowDefinition`, artifacts, validation gates, and CLI output backward-compatible except for intentional additive fields or sections.
225225

226+
RunContract scoring, preset, and governance changes should use the [`docs/runcontract-harness-evaluator.md`](docs/runcontract-harness-evaluator.md) checklist to separate real harness quality from visible metric optimization. The checklist is advisory and does not add completion authority, runtime gates, kapi-agent policy, or score hard-blocking behavior.
227+
226228
## Thin Harness Standard
227229

228230
Kapi is evaluated as a thin harness, not just a feature surface. When no workflow is active, Kapi should stay transparent: no hidden workflow activation, no workflow artifacts, no workers, no tool blocking, and no heavy UI ownership.
@@ -259,6 +261,7 @@ Kapi uses Pi extension surfaces as thin safety rails rather than a separate orch
259261
- `README.md` — human-facing overview and operating model.
260262
- `GOAL.md` — completeness objective and P0-P5 gates.
261263
- `docs/chedex-completeness.md` — Chedex comparison boundary and intentional Pi-native differences.
264+
- `docs/runcontract-harness-evaluator.md` — evaluator and anti-Goodhart checklist for RunContract scoring, presets, and harness-governance changes.
262265
- `docs/ralph-live-qa.md` — operator live QA checklist for proving `/kapi-ralph` start, planning, approval, build, evidence, closeout, and resume behavior in a real Pi/Kapi runtime.
263266
- `skills/kapi-workflow/SKILL.md` — active-workflow behavior reminders for agents.
264267
- `prompts/` — Kapi prompt resources exposed to Pi.
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# RunContract Harness Evaluator and Anti-Goodhart Checklist
2+
3+
This checklist is the evaluation standard for RunContract, preset, and scoring changes. Use it to decide whether a change improves Kapi/Ilchul as a quality harness, not just whether it produces cleaner-looking status output or easier-to-pass scores.
4+
5+
The checklist is advisory governance. It does not add runtime gates, kapi-agent policy, score hard-blocking, or a module/plugin framework.
6+
7+
## Operating Principle
8+
9+
A harness improvement should make true workflow quality easier to see and weak workflow quality harder to hide.
10+
11+
RunContract quality signals may summarize, warn, or steer, but completion authority stays with the underlying workflow contracts, evidence records, verifier reviews, explicit human decisions, and adapter-level review/merge gates.
12+
13+
## Evaluation Checklist
14+
15+
For each RunContract or preset change, review these dimensions before treating a score, hint, or rendered status as useful.
16+
17+
| Dimension | Pass question | Warning signs |
18+
| --- | --- | --- |
19+
| Objective clarity | Is the workflow objective concrete enough that success and failure can be distinguished without reading the implementer's intent? | Generic goals, vague "quality" claims, hidden target changes, or multiple objectives collapsed into one score. |
20+
| Evidence integrity | Does the signal point to durable evidence that can be inspected independently? | Evidence counted by filename alone, stale artifact refs, unchecked command output, or ledger fields trusted without source artifact validation. |
21+
| Verifier independence | Is the verifier meaningfully separate from the producer of the work or metric? | Self-approval, same prompt producing and approving, reviewer text without changed-file/evidence references, or no current-head freshness check. |
22+
| Artifact usefulness | Would the artifact help the next operator understand, resume, or audit the run? | Artifacts that restate status only, omit decisions/trade-offs, hide rejected paths, or are too verbose to scan. |
23+
| Regression protection | Does the change protect previously working behavior and known boundaries? | No targeted regression check, backward-incompatible schema/output changes without a migration issue, or changed defaults with only snapshot updates. |
24+
| Benchmark robustness | Is the benchmark hard to game while still matching the real objective? | Narrow benchmark fixtures, command-string-only checks, magic constants, optimizing one metric while degrading guardrails, or no failure-mode examples. |
25+
| Context hygiene | Is the signal derived from the right layer and a bounded context window? | GitHub/PR/Ragna/Discord semantics leaking into RunContract core, stale worker registry fields treated as truth, or product-brand names used as reusable code identifiers. |
26+
| Human override | Is there an explicit path for the human/project owner to accept, reject, or defer advisory output? | Score presented as authority, no rationale for overrides, hidden hard-blocks, or tool behavior that blocks unrelated ordinary Pi work. |
27+
| Anti-Goodhart resilience | Does the design check whether optimizing the metric can harm the harness objective? | Incentives to write evidence-shaped prose, make benchmarks narrower, split artifacts for score only, or suppress uncertainty to earn a green status. |
28+
29+
## Advisory Signals vs Completion Authority
30+
31+
RunContract may expose advisory quality signals such as `pass`, `warn`, `fail`, reasons, or compact dimensions. Those signals answer: "What should a supervisor look at next?"
32+
33+
They must not answer by themselves: "Is this workflow complete?"
34+
35+
Completion authority remains with:
36+
37+
- workflow lifecycle and validation rules;
38+
- required artifacts and freshness checks;
39+
- concrete command/evidence records;
40+
- independent verifier or reviewer decisions where the workflow requires them;
41+
- explicit human approval for design-sensitive or externally visible decisions;
42+
- adapter-level gates for external systems, such as issue/PR/review/merge state.
43+
44+
A green quality signal without valid completion evidence is only a green advisory signal. A valid completion record with warning quality signals may still be complete, but the warnings should be visible to the supervisor.
45+
46+
## Anti-Goodhart Checks
47+
48+
Before accepting a new score, hint, preset default, or rendered status as useful, ask:
49+
50+
1. **Metric substitution:** What real harness property could this visible metric replace or obscure?
51+
2. **Easy gaming path:** What is the cheapest way to make this signal green without improving the run?
52+
3. **Counterexample:** Can a bad run pass this signal? Can a good run fail it for an understandable reason?
53+
4. **Evidence anchoring:** Which artifact, command, ledger entry, or reviewer decision proves the signal's claim?
54+
5. **Freshness:** Is the signal attached to the current run/head/artifact version rather than stale state?
55+
6. **Layer boundary:** Does the signal belong in generic RunContract core, or only in an adapter/reporting layer?
56+
7. **Override trail:** If a human overrides the advisory signal, is the reason visible for future audits?
57+
58+
### RunContract and preset examples
59+
60+
- A `Completeness: pass` hint is useful only if it points to required criteria and evidence refs; it is weak if it only counts completed checklist items.
61+
- An `Evidence: pass` hint should validate evidence shape and freshness; it should not accept prose that merely says "tests passed".
62+
- A preset that improves rendered status labels but makes the underlying artifact less useful is not a harness improvement.
63+
- A benchmark that rewards fewer warnings can be harmful if it encourages hiding uncertainty or dropping edge-case checks.
64+
- A GitHub adapter can report review freshness, but RunContract core must not learn GitHub PR or kapi-agent authority semantics.
65+
66+
## Level 0/1/2/3 Design-Decision Sensitivity
67+
68+
Use this scale to decide how much evidence and human review a design change needs.
69+
70+
| Level | Change type | Required sensitivity |
71+
| --- | --- | --- |
72+
| Level 0 | Wording, rendering, links, or documentation that does not change behavior or authority. | Keep concise, preserve boundaries, run docs/relevant checks. |
73+
| Level 1 | Additive advisory fields, quality reasons, or preset metadata that do not block tools or completion. | Prove old behavior is preserved, document advisory status, add targeted regression checks when code changes. |
74+
| Level 2 | Changes that affect workflow defaults, completion interpretation, benchmark policy, artifact requirements, or adapter decisions. | Require explicit acceptance criteria, migration/backward-compatibility review, anti-Goodhart examples, and independent verification evidence. |
75+
| Level 3 | Storage/API/default flips, runtime enforcement, hard gates, bot policy changes, or cross-layer authority changes. | Open a separate design issue before implementation; require human approval, rollback/migration plan, and proof that core/adapters remain correctly separated. |
76+
77+
For #114 Track D, this checklist is a Level 0 governance artifact. Later scoring/module/runtime changes should reference it and classify themselves before implementation.
78+
79+
## PR Review Use
80+
81+
A PR that changes RunContract scoring, preset behavior, or harness governance should include:
82+
83+
- the level classification;
84+
- which checklist dimensions were affected;
85+
- at least one anti-Goodhart example or counterexample for Level 1+ scoring/preset changes;
86+
- exact verification commands or evidence refs;
87+
- a statement that advisory signals remain separate from completion authority, or a linked design issue authorizing stronger authority.

0 commit comments

Comments
 (0)