RunContract Harness Evaluator and Anti-Goodhart Checklist

This checklist is the evaluation standard for RunContract, preset, and scoring changes. Use it to decide whether a change improves Kapi/Ilchul as a quality harness, not just whether it produces cleaner-looking status output or easier-to-pass scores.

The checklist is advisory governance. It does not add runtime gates, kapi-agent policy, score hard-blocking, or a module/plugin framework.

Operating Principle

A harness improvement should make true workflow quality easier to see and weak workflow quality harder to hide.

RunContract quality signals may summarize, warn, or steer, but completion authority stays with the underlying workflow contracts, evidence records, verifier reviews, explicit human decisions, and adapter-level review/merge gates.

Evaluation Checklist

For each RunContract or preset change, review these dimensions before treating a score, hint, or rendered status as useful.

Dimension	Pass question	Warning signs
Objective clarity	Is the workflow objective concrete enough that success and failure can be distinguished without reading the implementer's intent?	Generic goals, vague "quality" claims, hidden target changes, or multiple objectives collapsed into one score.
Evidence integrity	Does the signal point to durable evidence that can be inspected independently?	Evidence counted by filename alone, stale artifact refs, unchecked command output, or ledger fields trusted without source artifact validation.
Verifier independence	Is the verifier meaningfully separate from the producer of the work or metric?	Self-approval, same prompt producing and approving, reviewer text without changed-file/evidence references, or no current-head freshness check.
Artifact usefulness	Would the artifact help the next operator understand, resume, or audit the run?	Artifacts that restate status only, omit decisions/trade-offs, hide rejected paths, or are too verbose to scan.
Regression protection	Does the change protect previously working behavior and known boundaries?	No targeted regression check, backward-incompatible schema/output changes without a migration issue, or changed defaults with only snapshot updates.
Benchmark robustness	Is the benchmark hard to game while still matching the real objective?	Narrow benchmark fixtures, command-string-only checks, magic constants, optimizing one metric while degrading guardrails, or no failure-mode examples.
Context hygiene	Is the signal derived from the right layer and a bounded context window?	GitHub/PR/Ragna/Discord semantics leaking into RunContract core, stale worker registry fields treated as truth, or product-brand names used as reusable code identifiers.
Human override	Is there an explicit path for the human/project owner to accept, reject, or defer advisory output?	Score presented as authority, no rationale for overrides, hidden hard-blocks, or tool behavior that blocks unrelated ordinary Pi work.
Anti-Goodhart resilience	Does the design check whether optimizing the metric can harm the harness objective?	Incentives to write evidence-shaped prose, make benchmarks narrower, split artifacts for score only, or suppress uncertainty to earn a green status.

Advisory Signals vs Completion Authority

RunContract may expose advisory quality signals such as pass, warn, fail, reasons, or compact dimensions. Those signals answer: "What should a supervisor look at next?"

They must not answer by themselves: "Is this workflow complete?"

Completion authority remains with:

workflow lifecycle and validation rules;
required artifacts and freshness checks;
concrete command/evidence records;
independent verifier or reviewer decisions where the workflow requires them;
explicit human approval for design-sensitive or externally visible decisions;
adapter-level gates for external systems, such as issue/PR/review/merge state.

A green quality signal without valid completion evidence is only a green advisory signal. A valid completion record with warning quality signals may still be complete, but the warnings should be visible to the supervisor.

Anti-Goodhart Checks

Before accepting a new score, hint, preset default, or rendered status as useful, ask:

Metric substitution: What real harness property could this visible metric replace or obscure?
Easy gaming path: What is the cheapest way to make this signal green without improving the run?
Counterexample: Can a bad run pass this signal? Can a good run fail it for an understandable reason?
Evidence anchoring: Which artifact, command, ledger entry, or reviewer decision proves the signal's claim?
Freshness: Is the signal attached to the current run/head/artifact version rather than stale state?
Layer boundary: Does the signal belong in generic RunContract core, or only in an adapter/reporting layer?
Override trail: If a human overrides the advisory signal, is the reason visible for future audits?

RunContract and preset examples

A Completeness: pass hint is useful only if it points to required criteria and evidence refs; it is weak if it only counts completed checklist items.
An Evidence: pass hint should validate evidence shape and freshness; it should not accept prose that merely says "tests passed".
A preset that improves rendered status labels but makes the underlying artifact less useful is not a harness improvement.
A benchmark that rewards fewer warnings can be harmful if it encourages hiding uncertainty or dropping edge-case checks.
A GitHub adapter can report review freshness, but RunContract core must not learn GitHub PR or kapi-agent authority semantics.

Level 0/1/2/3 Design-Decision Sensitivity

Use this scale to decide how much evidence and human review a design change needs.

Level	Change type	Required sensitivity
Level 0	Wording, rendering, links, or documentation that does not change behavior or authority.	Keep concise, preserve boundaries, run docs/relevant checks.
Level 1	Additive advisory fields, quality reasons, or preset metadata that do not block tools or completion.	Prove old behavior is preserved, document advisory status, add targeted regression checks when code changes.
Level 2	Changes that affect workflow defaults, completion interpretation, benchmark policy, artifact requirements, or adapter decisions.	Require explicit acceptance criteria, migration/backward-compatibility review, anti-Goodhart examples, and independent verification evidence.
Level 3	Storage/API/default flips, runtime enforcement, hard gates, bot policy changes, or cross-layer authority changes.	Open a separate design issue before implementation; require human approval, rollback/migration plan, and proof that core/adapters remain correctly separated.

For #114 Track D, this checklist is a Level 0 governance artifact. Later scoring/module/runtime changes should reference it and classify themselves before implementation.

PR Review Use

A PR that changes RunContract scoring, preset behavior, or harness governance should include:

the level classification;
which checklist dimensions were affected;
at least one anti-Goodhart example or counterexample for Level 1+ scoring/preset changes;
exact verification commands or evidence refs;
a statement that advisory signals remain separate from completion authority, or a linked design issue authorizing stronger authority.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RunContract Harness Evaluator and Anti-Goodhart Checklist

Operating Principle

Evaluation Checklist

Advisory Signals vs Completion Authority

Anti-Goodhart Checks

RunContract and preset examples

Level 0/1/2/3 Design-Decision Sensitivity

PR Review Use

FilesExpand file tree

runcontract-harness-evaluator.md

Latest commit

History

runcontract-harness-evaluator.md

File metadata and controls

RunContract Harness Evaluator and Anti-Goodhart Checklist

Operating Principle

Evaluation Checklist

Advisory Signals vs Completion Authority

Anti-Goodhart Checks

RunContract and preset examples

Level 0/1/2/3 Design-Decision Sensitivity

PR Review Use