|
| 1 | +# RunContract Harness Evaluator and Anti-Goodhart Checklist |
| 2 | + |
| 3 | +This checklist is the evaluation standard for RunContract, preset, and scoring changes. Use it to decide whether a change improves Kapi/Ilchul as a quality harness, not just whether it produces cleaner-looking status output or easier-to-pass scores. |
| 4 | + |
| 5 | +The checklist is advisory governance. It does not add runtime gates, kapi-agent policy, score hard-blocking, or a module/plugin framework. |
| 6 | + |
| 7 | +## Operating Principle |
| 8 | + |
| 9 | +A harness improvement should make true workflow quality easier to see and weak workflow quality harder to hide. |
| 10 | + |
| 11 | +RunContract quality signals may summarize, warn, or steer, but completion authority stays with the underlying workflow contracts, evidence records, verifier reviews, explicit human decisions, and adapter-level review/merge gates. |
| 12 | + |
| 13 | +## Evaluation Checklist |
| 14 | + |
| 15 | +For each RunContract or preset change, review these dimensions before treating a score, hint, or rendered status as useful. |
| 16 | + |
| 17 | +| Dimension | Pass question | Warning signs | |
| 18 | +| --- | --- | --- | |
| 19 | +| Objective clarity | Is the workflow objective concrete enough that success and failure can be distinguished without reading the implementer's intent? | Generic goals, vague "quality" claims, hidden target changes, or multiple objectives collapsed into one score. | |
| 20 | +| Evidence integrity | Does the signal point to durable evidence that can be inspected independently? | Evidence counted by filename alone, stale artifact refs, unchecked command output, or ledger fields trusted without source artifact validation. | |
| 21 | +| Verifier independence | Is the verifier meaningfully separate from the producer of the work or metric? | Self-approval, same prompt producing and approving, reviewer text without changed-file/evidence references, or no current-head freshness check. | |
| 22 | +| Artifact usefulness | Would the artifact help the next operator understand, resume, or audit the run? | Artifacts that restate status only, omit decisions/trade-offs, hide rejected paths, or are too verbose to scan. | |
| 23 | +| Regression protection | Does the change protect previously working behavior and known boundaries? | No targeted regression check, backward-incompatible schema/output changes without a migration issue, or changed defaults with only snapshot updates. | |
| 24 | +| Benchmark robustness | Is the benchmark hard to game while still matching the real objective? | Narrow benchmark fixtures, command-string-only checks, magic constants, optimizing one metric while degrading guardrails, or no failure-mode examples. | |
| 25 | +| Context hygiene | Is the signal derived from the right layer and a bounded context window? | GitHub/PR/Ragna/Discord semantics leaking into RunContract core, stale worker registry fields treated as truth, or product-brand names used as reusable code identifiers. | |
| 26 | +| Human override | Is there an explicit path for the human/project owner to accept, reject, or defer advisory output? | Score presented as authority, no rationale for overrides, hidden hard-blocks, or tool behavior that blocks unrelated ordinary Pi work. | |
| 27 | +| Anti-Goodhart resilience | Does the design check whether optimizing the metric can harm the harness objective? | Incentives to write evidence-shaped prose, make benchmarks narrower, split artifacts for score only, or suppress uncertainty to earn a green status. | |
| 28 | + |
| 29 | +## Advisory Signals vs Completion Authority |
| 30 | + |
| 31 | +RunContract may expose advisory quality signals such as `pass`, `warn`, `fail`, reasons, or compact dimensions. Those signals answer: "What should a supervisor look at next?" |
| 32 | + |
| 33 | +They must not answer by themselves: "Is this workflow complete?" |
| 34 | + |
| 35 | +Completion authority remains with: |
| 36 | + |
| 37 | +- workflow lifecycle and validation rules; |
| 38 | +- required artifacts and freshness checks; |
| 39 | +- concrete command/evidence records; |
| 40 | +- independent verifier or reviewer decisions where the workflow requires them; |
| 41 | +- explicit human approval for design-sensitive or externally visible decisions; |
| 42 | +- adapter-level gates for external systems, such as issue/PR/review/merge state. |
| 43 | + |
| 44 | +A green quality signal without valid completion evidence is only a green advisory signal. A valid completion record with warning quality signals may still be complete, but the warnings should be visible to the supervisor. |
| 45 | + |
| 46 | +## Anti-Goodhart Checks |
| 47 | + |
| 48 | +Before accepting a new score, hint, preset default, or rendered status as useful, ask: |
| 49 | + |
| 50 | +1. **Metric substitution:** What real harness property could this visible metric replace or obscure? |
| 51 | +2. **Easy gaming path:** What is the cheapest way to make this signal green without improving the run? |
| 52 | +3. **Counterexample:** Can a bad run pass this signal? Can a good run fail it for an understandable reason? |
| 53 | +4. **Evidence anchoring:** Which artifact, command, ledger entry, or reviewer decision proves the signal's claim? |
| 54 | +5. **Freshness:** Is the signal attached to the current run/head/artifact version rather than stale state? |
| 55 | +6. **Layer boundary:** Does the signal belong in generic RunContract core, or only in an adapter/reporting layer? |
| 56 | +7. **Override trail:** If a human overrides the advisory signal, is the reason visible for future audits? |
| 57 | + |
| 58 | +### RunContract and preset examples |
| 59 | + |
| 60 | +- A `Completeness: pass` hint is useful only if it points to required criteria and evidence refs; it is weak if it only counts completed checklist items. |
| 61 | +- An `Evidence: pass` hint should validate evidence shape and freshness; it should not accept prose that merely says "tests passed". |
| 62 | +- A preset that improves rendered status labels but makes the underlying artifact less useful is not a harness improvement. |
| 63 | +- A benchmark that rewards fewer warnings can be harmful if it encourages hiding uncertainty or dropping edge-case checks. |
| 64 | +- A GitHub adapter can report review freshness, but RunContract core must not learn GitHub PR or kapi-agent authority semantics. |
| 65 | + |
| 66 | +## Level 0/1/2/3 Design-Decision Sensitivity |
| 67 | + |
| 68 | +Use this scale to decide how much evidence and human review a design change needs. |
| 69 | + |
| 70 | +| Level | Change type | Required sensitivity | |
| 71 | +| --- | --- | --- | |
| 72 | +| Level 0 | Wording, rendering, links, or documentation that does not change behavior or authority. | Keep concise, preserve boundaries, run docs/relevant checks. | |
| 73 | +| Level 1 | Additive advisory fields, quality reasons, or preset metadata that do not block tools or completion. | Prove old behavior is preserved, document advisory status, add targeted regression checks when code changes. | |
| 74 | +| Level 2 | Changes that affect workflow defaults, completion interpretation, benchmark policy, artifact requirements, or adapter decisions. | Require explicit acceptance criteria, migration/backward-compatibility review, anti-Goodhart examples, and independent verification evidence. | |
| 75 | +| Level 3 | Storage/API/default flips, runtime enforcement, hard gates, bot policy changes, or cross-layer authority changes. | Open a separate design issue before implementation; require human approval, rollback/migration plan, and proof that core/adapters remain correctly separated. | |
| 76 | + |
| 77 | +For #114 Track D, this checklist is a Level 0 governance artifact. Later scoring/module/runtime changes should reference it and classify themselves before implementation. |
| 78 | + |
| 79 | +## PR Review Use |
| 80 | + |
| 81 | +A PR that changes RunContract scoring, preset behavior, or harness governance should include: |
| 82 | + |
| 83 | +- the level classification; |
| 84 | +- which checklist dimensions were affected; |
| 85 | +- at least one anti-Goodhart example or counterexample for Level 1+ scoring/preset changes; |
| 86 | +- exact verification commands or evidence refs; |
| 87 | +- a statement that advisory signals remain separate from completion authority, or a linked design issue authorizing stronger authority. |
0 commit comments