docs: define harness evaluator checklist (#129)

devkade · web-flow · commit 2c21a08afb02 · 2026-05-15T23:12:40.000+09:00
Co-authored-by: devkade &lt;devkade@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -223,6 +223,8 @@ Candidate vocabulary is deliberately small and additive: `ContractPreset`, `Evid
 
 Implementation rhythm for the RunContract track is behavior-preserving: document the boundary first, add the generic projection second, add evidence/completion primitives third, add advisory quality hints fourth, render compact supervisor status fifth, and only then map optional external workflow adapter semantics. Each slice should keep existing workflow APIs, `WorkflowState`, `WorkflowDefinition`, artifacts, validation gates, and CLI output backward-compatible except for intentional additive fields or sections.
 
+RunContract scoring, preset, and governance changes should use the [`docs/runcontract-harness-evaluator.md`](docs/runcontract-harness-evaluator.md) checklist to separate real harness quality from visible metric optimization. The checklist is advisory and does not add completion authority, runtime gates, kapi-agent policy, or score hard-blocking behavior.
+
 ## Thin Harness Standard
 
 Kapi is evaluated as a thin harness, not just a feature surface. When no workflow is active, Kapi should stay transparent: no hidden workflow activation, no workflow artifacts, no workers, no tool blocking, and no heavy UI ownership.
@@ -259,6 +261,7 @@ Kapi uses Pi extension surfaces as thin safety rails rather than a separate orch
 - `README.md` — human-facing overview and operating model.
 - `GOAL.md` — completeness objective and P0-P5 gates.
 - `docs/chedex-completeness.md` — Chedex comparison boundary and intentional Pi-native differences.
+- `docs/runcontract-harness-evaluator.md` — evaluator and anti-Goodhart checklist for RunContract scoring, presets, and harness-governance changes.
 - `docs/ralph-live-qa.md` — operator live QA checklist for proving `/kapi-ralph` start, planning, approval, build, evidence, closeout, and resume behavior in a real Pi/Kapi runtime.
 - `skills/kapi-workflow/SKILL.md` — active-workflow behavior reminders for agents.
 - `prompts/` — Kapi prompt resources exposed to Pi.
diff --git a/docs/runcontract-harness-evaluator.md b/docs/runcontract-harness-evaluator.md
@@ -0,0 +1,87 @@
+# RunContract Harness Evaluator and Anti-Goodhart Checklist
+
+This checklist is the evaluation standard for RunContract, preset, and scoring changes. Use it to decide whether a change improves Kapi/Ilchul as a quality harness, not just whether it produces cleaner-looking status output or easier-to-pass scores.
+
+The checklist is advisory governance. It does not add runtime gates, kapi-agent policy, score hard-blocking, or a module/plugin framework.
+
+## Operating Principle
+
+A harness improvement should make true workflow quality easier to see and weak workflow quality harder to hide.
+
+RunContract quality signals may summarize, warn, or steer, but completion authority stays with the underlying workflow contracts, evidence records, verifier reviews, explicit human decisions, and adapter-level review/merge gates.
+
+## Evaluation Checklist
+
+For each RunContract or preset change, review these dimensions before treating a score, hint, or rendered status as useful.
+
+| Dimension | Pass question | Warning signs |
+| --- | --- | --- |
+| Objective clarity | Is the workflow objective concrete enough that success and failure can be distinguished without reading the implementer's intent? | Generic goals, vague "quality" claims, hidden target changes, or multiple objectives collapsed into one score. |
+| Evidence integrity | Does the signal point to durable evidence that can be inspected independently? | Evidence counted by filename alone, stale artifact refs, unchecked command output, or ledger fields trusted without source artifact validation. |
+| Verifier independence | Is the verifier meaningfully separate from the producer of the work or metric? | Self-approval, same prompt producing and approving, reviewer text without changed-file/evidence references, or no current-head freshness check. |
+| Artifact usefulness | Would the artifact help the next operator understand, resume, or audit the run? | Artifacts that restate status only, omit decisions/trade-offs, hide rejected paths, or are too verbose to scan. |
+| Regression protection | Does the change protect previously working behavior and known boundaries? | No targeted regression check, backward-incompatible schema/output changes without a migration issue, or changed defaults with only snapshot updates. |
+| Benchmark robustness | Is the benchmark hard to game while still matching the real objective? | Narrow benchmark fixtures, command-string-only checks, magic constants, optimizing one metric while degrading guardrails, or no failure-mode examples. |
+| Context hygiene | Is the signal derived from the right layer and a bounded context window? | GitHub/PR/Ragna/Discord semantics leaking into RunContract core, stale worker registry fields treated as truth, or product-brand names used as reusable code identifiers. |
+| Human override | Is there an explicit path for the human/project owner to accept, reject, or defer advisory output? | Score presented as authority, no rationale for overrides, hidden hard-blocks, or tool behavior that blocks unrelated ordinary Pi work. |
+| Anti-Goodhart resilience | Does the design check whether optimizing the metric can harm the harness objective? | Incentives to write evidence-shaped prose, make benchmarks narrower, split artifacts for score only, or suppress uncertainty to earn a green status. |
+
+## Advisory Signals vs Completion Authority
+
+RunContract may expose advisory quality signals such as `pass`, `warn`, `fail`, reasons, or compact dimensions. Those signals answer: "What should a supervisor look at next?"
+
+They must not answer by themselves: "Is this workflow complete?"
+
+Completion authority remains with:
+
+- workflow lifecycle and validation rules;
+- required artifacts and freshness checks;
+- concrete command/evidence records;
+- independent verifier or reviewer decisions where the workflow requires them;
+- explicit human approval for design-sensitive or externally visible decisions;
+- adapter-level gates for external systems, such as issue/PR/review/merge state.
+
+A green quality signal without valid completion evidence is only a green advisory signal. A valid completion record with warning quality signals may still be complete, but the warnings should be visible to the supervisor.
+
+## Anti-Goodhart Checks
+
+Before accepting a new score, hint, preset default, or rendered status as useful, ask:
+
+1. **Metric substitution:** What real harness property could this visible metric replace or obscure?
+2. **Easy gaming path:** What is the cheapest way to make this signal green without improving the run?
+3. **Counterexample:** Can a bad run pass this signal? Can a good run fail it for an understandable reason?
+4. **Evidence anchoring:** Which artifact, command, ledger entry, or reviewer decision proves the signal's claim?
+5. **Freshness:** Is the signal attached to the current run/head/artifact version rather than stale state?
+6. **Layer boundary:** Does the signal belong in generic RunContract core, or only in an adapter/reporting layer?
+7. **Override trail:** If a human overrides the advisory signal, is the reason visible for future audits?
+
+### RunContract and preset examples
+
+- A `Completeness: pass` hint is useful only if it points to required criteria and evidence refs; it is weak if it only counts completed checklist items.
+- An `Evidence: pass` hint should validate evidence shape and freshness; it should not accept prose that merely says "tests passed".
+- A preset that improves rendered status labels but makes the underlying artifact less useful is not a harness improvement.
+- A benchmark that rewards fewer warnings can be harmful if it encourages hiding uncertainty or dropping edge-case checks.
+- A GitHub adapter can report review freshness, but RunContract core must not learn GitHub PR or kapi-agent authority semantics.
+
+## Level 0/1/2/3 Design-Decision Sensitivity
+
+Use this scale to decide how much evidence and human review a design change needs.
+
+| Level | Change type | Required sensitivity |
+| --- | --- | --- |
+| Level 0 | Wording, rendering, links, or documentation that does not change behavior or authority. | Keep concise, preserve boundaries, run docs/relevant checks. |
+| Level 1 | Additive advisory fields, quality reasons, or preset metadata that do not block tools or completion. | Prove old behavior is preserved, document advisory status, add targeted regression checks when code changes. |
+| Level 2 | Changes that affect workflow defaults, completion interpretation, benchmark policy, artifact requirements, or adapter decisions. | Require explicit acceptance criteria, migration/backward-compatibility review, anti-Goodhart examples, and independent verification evidence. |
+| Level 3 | Storage/API/default flips, runtime enforcement, hard gates, bot policy changes, or cross-layer authority changes. | Open a separate design issue before implementation; require human approval, rollback/migration plan, and proof that core/adapters remain correctly separated. |
+
+For #114 Track D, this checklist is a Level 0 governance artifact. Later scoring/module/runtime changes should reference it and classify themselves before implementation.
+
+## PR Review Use
+
+A PR that changes RunContract scoring, preset behavior, or harness governance should include:
+
+- the level classification;
+- which checklist dimensions were affected;
+- at least one anti-Goodhart example or counterexample for Level 1+ scoring/preset changes;
+- exact verification commands or evidence refs;
+- a statement that advisory signals remain separate from completion authority, or a linked design issue authorizing stronger authority.