Skip to content

docs: define harness evaluator checklist#129

Merged
devkade merged 1 commit into
devfrom
docs/harness-evaluator-anti-goodhart
May 15, 2026
Merged

docs: define harness evaluator checklist#129
devkade merged 1 commit into
devfrom
docs/harness-evaluator-anti-goodhart

Conversation

@devkade
Copy link
Copy Markdown
Owner

@devkade devkade commented May 15, 2026

Summary

  • Adds docs/runcontract-harness-evaluator.md as the Track D evaluator and anti-Goodhart checklist for RunContract scoring, presets, and governance changes.
  • Links the checklist from the README RunContract Harness Boundary and layout sections.
  • Keeps the checklist explicitly advisory: no runtime gate, no kapi-agent policy change, no score hard-blocking, and no module/plugin framework.

Linked issue

Closes #117

Parent roadmap: #114

Problem

RunContract is gaining richer advisory quality surfaces. Without an evaluator, future scoring and preset work can drift toward visible metric optimization instead of improving Kapi/Ilchul as a reliable harness. Track D needs a concise standard that reviewers can apply before scores or module signals become more influential.

Options considered

  1. README-only checklist
    • Pros: highly visible near the RunContract boundary.
    • Cons: would bloat the entry document and make the checklist harder to reference from PRs/issues.
  2. Dedicated docs artifact linked from README
  3. Runtime validation or score-gate implementation

Selected approach

Selected option: dedicated docs artifact linked from README.

Why: #117 is a documentation/governance slice. A dedicated artifact can cover objective clarity, evidence integrity, verifier independence, artifact usefulness, regression protection, benchmark robustness, context hygiene, human override, anti-Goodhart checks, and Level 0/1/2/3 design sensitivity without expanding runtime authority.

Risks/trade-offs: The checklist only guides future reviews; it does not enforce behavior by itself.

Implementation by file/surface

  • docs/runcontract-harness-evaluator.md
    • Defines the evaluator checklist and anti-Goodhart prompts.
    • Separates advisory signals from completion authority.
    • Adds Level 0/1/2/3 design-decision sensitivity for later scoring/module/runtime changes.
    • Adds PR review expectations for future RunContract/preset changes.
  • README.md
    • Links the evaluator from the RunContract Harness Boundary section.
    • Adds the doc to the repository layout list.

Why this fixes it

The new docs artifact gives #114 Track D a concrete standard for judging whether future RunContract changes improve the harness itself. It also preserves the Track A/D boundary by stating that advisory quality signals do not become completion authority and that runtime gates, kapi-agent policy, and score hard-blocking remain out of scope.

QA / Verification

  • git diff --check — pass
  • npm ci — pass; installed fresh worktree dependencies after initial npm run verify found no local tsx binary.
  • npm run verify — pass; 396 tests, 385 pass, 11 skipped; typecheck, unused check, and quality budgets completed. Existing code_smells budget warning remains non-failing under quality:budgets.

Anomalies observed

  • Fresh worktree initially had no node_modules; first npm run verify failed at sh: tsx: command not found. Ran npm ci, then npm run verify passed.
  • Node emitted existing [DEP0205] module.register() deprecation warnings during tests.

Risks / Follow-up

  • The checklist is advisory only; future Level 1+ scoring/preset PRs still need to cite it and add targeted regression evidence.
  • Kapi → Ilchul RunContract Harness roadmap #114 can now reference docs/runcontract-harness-evaluator.md as the Track D evaluation standard.

kapi-agent review expectations and current-head merge gate

  • This PR is docs-only and should stay below the kapi-agent size gate: 90 changed lines.
  • Merge readiness still requires current-head kapi-agent approval/check success if the repository rules require it.

@devkade
Copy link
Copy Markdown
Owner Author

devkade commented May 15, 2026

@kapi-agent review

Revision explanation for current head 998e92a8ef7290fc4a0efe07691c16341fd596d9:

What changed

Why this closes the prior feedback / issue

  • docs: define Kapi harness evaluator and anti-Goodhart checklist #117 asks for a concise docs artifact that distinguishes advisory quality signals from completion authority and includes anti-Goodhart checks for RunContract/preset work.
  • This PR is documentation-only and explicitly avoids runtime gates, kapi-agent policy changes, score hard-blocking, and module/plugin framework changes.

Verification

  • git diff --check — pass
  • npm ci — pass
  • npm run verify — pass; 396 tests, 385 pass, 11 skipped; typecheck, unused check, and quality budgets completed.

Copy link
Copy Markdown

@kapi-agent kapi-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kapi-agent review

Verdict: APPROVE

Final approval summary

Review journey

The original goal is to add a Track D evaluator and anti-Goodhart checklist for RunContract scoring, presets, and governance changes. The semantic scope is documentation-only governance: a new checklist document plus README links. There was no prior kapi-agent review on this PR, so no review/fix ping-pong was required.

What changed

  • Added docs/runcontract-harness-evaluator.md.
  • Linked it from the RunContract Harness Boundary section in README.md.
  • Added it to the README repository layout list.

Why this is correct

The new document preserves the existing RunContract boundary: advisory signals can guide supervisors, but completion authority remains with workflow contracts, evidence, verifier/reviewer decisions, human approval, and adapter-level gates. It explicitly avoids runtime gates, kapi-agent policy changes, score hard-blocking, and plugin/module framework expansion.

Evidence

  • Verifier gate: PASS — npm ci && npm run verify exited 0.
  • Size gate: PASS — 90 changed lines, below the 200-line limit.
  • Revision explanation status: not required; found anyway in the PR body.
  • Inspected files: README.md, docs/runcontract-harness-evaluator.md.
  • Local diff check: git diff --check origin/dev...HEAD produced no errors.

Remaining risks and approval rationale

Remaining risk is limited to documentation drift: future RunContract/scoring PRs must actually cite and follow this checklist. That is acceptable because this PR intentionally does not add enforcement. Approval is justified because the change is small, docs-only, consistent with the stated boundary, verified, and has no blocking issues.

Blocking issues

None.

Warnings / risks

  • docs/runcontract-harness-evaluator.md: The checklist is advisory only, so it depends on future reviewers and authors applying it consistently.

Suggestions

  • For future Level 1+ RunContract scoring or preset PRs, include a short “checklist application” section in the PR body so reviewers can quickly see which evaluator dimensions changed and what anti-Goodhart counterexample was considered.

Looks good

  • The document cleanly separates advisory quality signals from completion authority.
  • The anti-Goodhart prompts are concrete and reviewable.
  • The Level 0/1/2/3 sensitivity scale gives future PRs a useful escalation model without changing runtime behavior.
  • README placement is appropriate and does not bloat the main overview.

Verification notes

  • Verifier gate status: PASS — npm ci && npm run verify exited 0.
  • Size gate status: PASS — 90 changed lines < 200.
  • Revision-explanation status: not required for this first kapi-agent review; PR body includes a clear explanation.
  • Additional local evidence: inspected the changed files and ran git diff --check origin/dev...HEAD successfully.

Engine: pi

@devkade devkade merged commit 2c21a08 into dev May 15, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: define Kapi harness evaluator and anti-Goodhart checklist

1 participant