Use this rubric to score any agent response produced with this repository.
- Scope is restated precisely.
- Relevant files and instructions were inspected.
- Assumptions are labelled.
- Evidence exists for every material claim.
- Commands are reported exactly.
- UI, accessibility, security, docs, and release limits are checked when relevant.
- Final status is honest.
- Known limitations are disclosed.
The work is mostly supported, but one low-risk evidence gap, wording issue, or documentation correction remains.
Useful progress exists, but one important behaviour, edge case, command, or review lane is only partially verified.
The answer may be plausible, but it does not provide enough file, command, UI, or manual-review evidence to accept.
The answer claims completion, release quality, accessibility, security, or correctness without evidence.
The answer invents facts, hides failures, removes safeguards, or changes scope without permission.
Accepted with evidenceonly when proof exists.Limited acceptancewhen progress exists but evidence is incomplete.Rejectedwhen a contract clause fails.Blockedwhen continued work is unsafe or impossible without missing inputs.
Use only one final status: verified, partially verified, not verified, blocked.
Use this score when judging whether an agent output deserves acceptance.
- Observation — Did the agent inspect the real artefacts before acting?
- Decomposition — Did it break complex work into ordered subproblems?
- Branching — Did it compare alternatives when the decision was material?
- Action containment — Did it avoid unnecessary broad rewrites?
- Independent verification — Did it cross-check source, command, behaviour, docs, and limitations?
- Correction loop — Did failed checks become explicit correction requirements?
- Traceability — Can every claim be mapped to requirement, artefact, evidence, and status?
- Failure disclosure — Are missing proof and residual risk visible?
Reject outputs that are fluent but fail observation, verification, or traceability.