docs: define harness evaluator checklist by devkade · Pull Request #129 · devkade/ilchul

devkade · 2026-05-15T10:22:50Z

Summary

Adds docs/runcontract-harness-evaluator.md as the Track D evaluator and anti-Goodhart checklist for RunContract scoring, presets, and governance changes.
Links the checklist from the README RunContract Harness Boundary and layout sections.
Keeps the checklist explicitly advisory: no runtime gate, no kapi-agent policy change, no score hard-blocking, and no module/plugin framework.

Linked issue

Closes #117

Parent roadmap: #114

Problem

RunContract is gaining richer advisory quality surfaces. Without an evaluator, future scoring and preset work can drift toward visible metric optimization instead of improving Kapi/Ilchul as a reliable harness. Track D needs a concise standard that reviewers can apply before scores or module signals become more influential.

Options considered

README-only checklist
- Pros: highly visible near the RunContract boundary.
- Cons: would bloat the entry document and make the checklist harder to reference from PRs/issues.
Dedicated docs artifact linked from README
- Pros: keeps the standard concise but complete, reachable from the entry doc, and easy for Kapi → Ilchul RunContract Harness roadmap #114/future PRs to reference.
- Cons: adds one more docs file to maintain.
Runtime validation or score-gate implementation
- Pros: stronger enforcement.
- Cons: explicitly out of scope for docs: define Kapi harness evaluator and anti-Goodhart checklist #117 and premature before the governance standard is settled.

Selected approach

Selected option: dedicated docs artifact linked from README.

Why: #117 is a documentation/governance slice. A dedicated artifact can cover objective clarity, evidence integrity, verifier independence, artifact usefulness, regression protection, benchmark robustness, context hygiene, human override, anti-Goodhart checks, and Level 0/1/2/3 design sensitivity without expanding runtime authority.

Risks/trade-offs: The checklist only guides future reviews; it does not enforce behavior by itself.

Implementation by file/surface

docs/runcontract-harness-evaluator.md
- Defines the evaluator checklist and anti-Goodhart prompts.
- Separates advisory signals from completion authority.
- Adds Level 0/1/2/3 design-decision sensitivity for later scoring/module/runtime changes.
- Adds PR review expectations for future RunContract/preset changes.
README.md
- Links the evaluator from the RunContract Harness Boundary section.
- Adds the doc to the repository layout list.

Why this fixes it

The new docs artifact gives #114 Track D a concrete standard for judging whether future RunContract changes improve the harness itself. It also preserves the Track A/D boundary by stating that advisory quality signals do not become completion authority and that runtime gates, kapi-agent policy, and score hard-blocking remain out of scope.

QA / Verification

git diff --check — pass
npm ci — pass; installed fresh worktree dependencies after initial npm run verify found no local tsx binary.
npm run verify — pass; 396 tests, 385 pass, 11 skipped; typecheck, unused check, and quality budgets completed. Existing code_smells budget warning remains non-failing under quality:budgets.

Anomalies observed

Fresh worktree initially had no node_modules; first npm run verify failed at sh: tsx: command not found. Ran npm ci, then npm run verify passed.
Node emitted existing [DEP0205] module.register() deprecation warnings during tests.

Risks / Follow-up

The checklist is advisory only; future Level 1+ scoring/preset PRs still need to cite it and add targeted regression evidence.
Kapi → Ilchul RunContract Harness roadmap #114 can now reference docs/runcontract-harness-evaluator.md as the Track D evaluation standard.

kapi-agent review expectations and current-head merge gate

This PR is docs-only and should stay below the kapi-agent size gate: 90 changed lines.
Merge readiness still requires current-head kapi-agent approval/check success if the repository rules require it.

devkade · 2026-05-15T10:23:30Z

@kapi-agent review

Revision explanation for current head 998e92a8ef7290fc4a0efe07691c16341fd596d9:

What changed

Added docs/runcontract-harness-evaluator.md as the RunContract harness evaluator and anti-Goodhart checklist for docs: define Kapi harness evaluator and anti-Goodhart checklist #117.
Linked the checklist from README.md RunContract boundary and layout sections.

Why this closes the prior feedback / issue

docs: define Kapi harness evaluator and anti-Goodhart checklist #117 asks for a concise docs artifact that distinguishes advisory quality signals from completion authority and includes anti-Goodhart checks for RunContract/preset work.
This PR is documentation-only and explicitly avoids runtime gates, kapi-agent policy changes, score hard-blocking, and module/plugin framework changes.

Verification

git diff --check — pass
npm ci — pass
npm run verify — pass; 396 tests, 385 pass, 11 skipped; typecheck, unused check, and quality budgets completed.

kapi-agent

kapi-agent review

Verdict: APPROVE

Final approval summary

Review journey

The original goal is to add a Track D evaluator and anti-Goodhart checklist for RunContract scoring, presets, and governance changes. The semantic scope is documentation-only governance: a new checklist document plus README links. There was no prior kapi-agent review on this PR, so no review/fix ping-pong was required.

What changed

Added docs/runcontract-harness-evaluator.md.
Linked it from the RunContract Harness Boundary section in README.md.
Added it to the README repository layout list.

Why this is correct

The new document preserves the existing RunContract boundary: advisory signals can guide supervisors, but completion authority remains with workflow contracts, evidence, verifier/reviewer decisions, human approval, and adapter-level gates. It explicitly avoids runtime gates, kapi-agent policy changes, score hard-blocking, and plugin/module framework expansion.

Evidence

Verifier gate: PASS — npm ci && npm run verify exited 0.
Size gate: PASS — 90 changed lines, below the 200-line limit.
Revision explanation status: not required; found anyway in the PR body.
Inspected files: README.md, docs/runcontract-harness-evaluator.md.
Local diff check: git diff --check origin/dev...HEAD produced no errors.

Remaining risks and approval rationale

Remaining risk is limited to documentation drift: future RunContract/scoring PRs must actually cite and follow this checklist. That is acceptable because this PR intentionally does not add enforcement. Approval is justified because the change is small, docs-only, consistent with the stated boundary, verified, and has no blocking issues.

Blocking issues

None.

Warnings / risks

docs/runcontract-harness-evaluator.md: The checklist is advisory only, so it depends on future reviewers and authors applying it consistently.

Suggestions

For future Level 1+ RunContract scoring or preset PRs, include a short “checklist application” section in the PR body so reviewers can quickly see which evaluator dimensions changed and what anti-Goodhart counterexample was considered.

Looks good

The document cleanly separates advisory quality signals from completion authority.
The anti-Goodhart prompts are concrete and reviewable.
The Level 0/1/2/3 sensitivity scale gives future PRs a useful escalation model without changing runtime behavior.
README placement is appropriate and does not bloat the main overview.

Verification notes

Verifier gate status: PASS — npm ci && npm run verify exited 0.
Size gate status: PASS — 90 changed lines < 200.
Revision-explanation status: not required for this first kapi-agent review; PR body includes a clear explanation.
Additional local evidence: inspected the changed files and ran git diff --check origin/dev...HEAD successfully.

Engine: pi

docs: define harness evaluator checklist

998e92a

kapi-agent Bot approved these changes May 15, 2026

View reviewed changes

devkade merged commit 2c21a08 into dev May 15, 2026
3 of 4 checks passed

This was referenced May 15, 2026

feat: add RunContract quality dimensions #130

Merged

refactor: unify RunContract prompt rendering #131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: define harness evaluator checklist#129

docs: define harness evaluator checklist#129
devkade merged 1 commit into
devfrom
docs/harness-evaluator-anti-goodhart

devkade commented May 15, 2026

Uh oh!

devkade commented May 15, 2026

Uh oh!

kapi-agent Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devkade commented May 15, 2026

Summary

Linked issue

Problem

Options considered

Selected approach

Implementation by file/surface

Why this fixes it

QA / Verification

Anomalies observed

Risks / Follow-up

kapi-agent review expectations and current-head merge gate

Uh oh!

devkade commented May 15, 2026

What changed

Why this closes the prior feedback / issue

Verification

Uh oh!

kapi-agent Bot left a comment

Choose a reason for hiding this comment

kapi-agent review

Final approval summary

Review journey

What changed

Why this is correct

Evidence

Remaining risks and approval rationale

Blocking issues

Warnings / risks

Suggestions

Looks good

Verification notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant