docs: define harness evaluator checklist#129
Conversation
|
@kapi-agent review Revision explanation for current head What changed
Why this closes the prior feedback / issue
Verification
|
There was a problem hiding this comment.
kapi-agent review
Verdict: APPROVE
Final approval summary
Review journey
The original goal is to add a Track D evaluator and anti-Goodhart checklist for RunContract scoring, presets, and governance changes. The semantic scope is documentation-only governance: a new checklist document plus README links. There was no prior kapi-agent review on this PR, so no review/fix ping-pong was required.
What changed
- Added
docs/runcontract-harness-evaluator.md. - Linked it from the RunContract Harness Boundary section in
README.md. - Added it to the README repository layout list.
Why this is correct
The new document preserves the existing RunContract boundary: advisory signals can guide supervisors, but completion authority remains with workflow contracts, evidence, verifier/reviewer decisions, human approval, and adapter-level gates. It explicitly avoids runtime gates, kapi-agent policy changes, score hard-blocking, and plugin/module framework expansion.
Evidence
- Verifier gate: PASS —
npm ci && npm run verifyexited 0. - Size gate: PASS — 90 changed lines, below the 200-line limit.
- Revision explanation status: not required; found anyway in the PR body.
- Inspected files:
README.md,docs/runcontract-harness-evaluator.md. - Local diff check:
git diff --check origin/dev...HEADproduced no errors.
Remaining risks and approval rationale
Remaining risk is limited to documentation drift: future RunContract/scoring PRs must actually cite and follow this checklist. That is acceptable because this PR intentionally does not add enforcement. Approval is justified because the change is small, docs-only, consistent with the stated boundary, verified, and has no blocking issues.
Blocking issues
None.
Warnings / risks
docs/runcontract-harness-evaluator.md: The checklist is advisory only, so it depends on future reviewers and authors applying it consistently.
Suggestions
- For future Level 1+ RunContract scoring or preset PRs, include a short “checklist application” section in the PR body so reviewers can quickly see which evaluator dimensions changed and what anti-Goodhart counterexample was considered.
Looks good
- The document cleanly separates advisory quality signals from completion authority.
- The anti-Goodhart prompts are concrete and reviewable.
- The Level 0/1/2/3 sensitivity scale gives future PRs a useful escalation model without changing runtime behavior.
- README placement is appropriate and does not bloat the main overview.
Verification notes
- Verifier gate status: PASS —
npm ci && npm run verifyexited 0. - Size gate status: PASS — 90 changed lines < 200.
- Revision-explanation status: not required for this first kapi-agent review; PR body includes a clear explanation.
- Additional local evidence: inspected the changed files and ran
git diff --check origin/dev...HEADsuccessfully.
Engine: pi
Summary
docs/runcontract-harness-evaluator.mdas the Track D evaluator and anti-Goodhart checklist for RunContract scoring, presets, and governance changes.Linked issue
Closes #117
Parent roadmap: #114
Problem
RunContract is gaining richer advisory quality surfaces. Without an evaluator, future scoring and preset work can drift toward visible metric optimization instead of improving Kapi/Ilchul as a reliable harness. Track D needs a concise standard that reviewers can apply before scores or module signals become more influential.
Options considered
Selected approach
Selected option: dedicated docs artifact linked from README.
Why: #117 is a documentation/governance slice. A dedicated artifact can cover objective clarity, evidence integrity, verifier independence, artifact usefulness, regression protection, benchmark robustness, context hygiene, human override, anti-Goodhart checks, and Level 0/1/2/3 design sensitivity without expanding runtime authority.
Risks/trade-offs: The checklist only guides future reviews; it does not enforce behavior by itself.
Implementation by file/surface
docs/runcontract-harness-evaluator.mdREADME.mdWhy this fixes it
The new docs artifact gives #114 Track D a concrete standard for judging whether future RunContract changes improve the harness itself. It also preserves the Track A/D boundary by stating that advisory quality signals do not become completion authority and that runtime gates, kapi-agent policy, and score hard-blocking remain out of scope.
QA / Verification
git diff --check— passnpm ci— pass; installed fresh worktree dependencies after initialnpm run verifyfound no localtsxbinary.npm run verify— pass; 396 tests, 385 pass, 11 skipped; typecheck, unused check, and quality budgets completed. Existingcode_smellsbudget warning remains non-failing underquality:budgets.Anomalies observed
node_modules; firstnpm run verifyfailed atsh: tsx: command not found. Rannpm ci, thennpm run verifypassed.[DEP0205] module.register()deprecation warnings during tests.Risks / Follow-up
docs/runcontract-harness-evaluator.mdas the Track D evaluation standard.kapi-agent review expectations and current-head merge gate