#Phase 4 — Mechanical Verification: Finalized Scope
Target repo: repository-harness (feature branch off main)
Validation: harness-benchmark re-run after implementation
Current harness maturity: H3 partial (Phase 3 active observability complete)
Target maturity: H3 (full) → H4 (partial: story verification, auto-scoring, pre-close gate)
The first Phase 4 benchmark showed one isolated compliance miss and several command-shape friction loops:
- T4 authentication included decision text in the trace but did not create a
durable decision record. High-risk work that changes auth, authorization,
data ownership, API behavior, architecture, or validation must add a
docs/decisions/NNNN-*.mdrecord and a durabledecisionrow withscripts/bin/harness-cli decision add. Trace--decisionsis evidence, not the decision log. - Rust CLI proof flags require numeric booleans. Use
--unit 1 --integration 1 --e2e 0 --platform 0; do not useyesorno. story verify <id>runs the story's configuredverify_commandand records pass/fail. It accepts only the story id. Proof flags belong tostory update.- Agents should prefer the command examples in
docs/HARNESS.mdandscripts/README.mdbefore repeated help probing. Re-run help only when the command shape is still unknown.
Phase 4 turns the harness from a system that observes agent work into one that verifies it. Phase 3 gave agents a way to check whether their trace was good enough. Phase 4 gives them a way to check whether their implementation meets the story contract — and warns them before they close a task without running the check.
The decision table already has a verify_command column and decision verify
already runs it (Phase 1 infrastructure). The story table does not. Phase 4
extends the same pattern to stories, adds automatic trace scoring on write, and
introduces a pre-close verification gate.
Phase 4 is Rust CLI code + schema migration + documentation.
Five of the nine Arxiv papers surveyed in Phase 0 converge on verification as the next capability:
| Paper | Recommendation |
|---|---|
| Runtime Substrate (2605.13357) | H3→H4 = "the harness can verify, not just observe" |
| AHE (2604.25850) | The verify_command column exists but has no story-level execution path |
| NLAHs (2603.25723) | NL policies need enforcement — validation gates before state transitions |
| "The Last Harness" (2604.21003) | The Evaluator role in the Worker→Evaluator→Evolution loop must be mechanical |
| Continual Harness (2605.09998) | Self-improvement requires knowing whether traces are accurate, which requires verification |
US-012 Story verify_command Field
↓ schema migration adds the column; CLI accepts the flag
↓ stories can now carry a mechanical proof command
US-015 Story Verify Command
↓ agents can run the proof command and record the result
↓ the Evaluator role becomes mechanical
US-016 Auto Trace Scoring on Write
↓ agents get immediate trace quality feedback when recording
↓ removes the need to remember to run score-trace separately
US-017 Pre-Close Verification Gate
↓ combines trace scoring + verification into a single checkpoint
↓ agents are warned before closing a task without proof
US-012 must be first because it creates the schema column. US-015 depends on it to have something to execute. US-016 is independent of US-012/US-015 but ordered here because it's simpler. US-017 depends on both US-015 (verification) and US-016 (auto-scoring) to compose them into a single gate.
Background:
The decision table already has verify_command, last_verified_at, and
last_verified_result columns (Phase 1 schema, 001-init.sql lines 75-79).
The decision verify <id> CLI command already runs the command via sh -c,
records pass/fail, and updates the timestamp (infrastructure.rs line 508+).
The story table has proof columns (unit_proof, integration_proof,
e2e_proof, platform_proof) and a free-text evidence field. But it has no
verify_command column. Stories cannot carry a mechanical check command that
proves the story's acceptance criteria are met.
Reason:
AHE (arXiv:2604.25850) says "every edit is a falsifiable contract." NLAHs
(arXiv:2603.25723) says NL policies need enforceable validation gates. The
decision table already implements this pattern — stories should too.
In the Phase 3 benchmark, T4 (authentication, high_risk) was the only task
that failed its trace tier requirement. There was no mechanism for the agent
to mechanically verify the story was complete beyond checking trace quality.
A verify_command on the story would let the agent (or benchmark) run
npm test -- --run auth to confirm the implementation works.
Solution:
- New migration file
scripts/schema/002-story-verify.sql:ALTER TABLE story ADD COLUMN verify_command TEXT; ALTER TABLE story ADD COLUMN last_verified_at TEXT; ALTER TABLE story ADD COLUMN last_verified_result TEXT CHECK(last_verified_result IN ('pass','fail') OR last_verified_result IS NULL);
- Update
harness-cli story addto accept--verify <command>. - Update
harness-cli story updateto accept--verify <command>. - Update
StoryAddInputandStoryUpdateInputinapplication.rs. - Update
StoryAddArgsandStoryUpdateArgsininterface.rs. - Update SQL
INSERTandUPDATEininfrastructure.rs.
Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | scripts/schema/002-story-verify.sql exists and adds verify_command, last_verified_at, and last_verified_result columns to the story table. |
Read the file. Confirm the three ALTER TABLE statements and the CHECK constraint on last_verified_result. |
| 2 | harness-cli migrate applies migration 002 on an existing database. |
Run harness-cli init (creates v1 DB), then harness-cli migrate. Verify schema_version contains version 2 and story table has the three new columns via harness-cli query sql "PRAGMA table_info(story)". |
| 3 | harness-cli story add --id US-099 --title "Test" --lane normal --verify "echo ok" stores the verify_command. |
Run the command, then harness-cli query sql "SELECT verify_command FROM story WHERE id='US-099'". Expect echo ok. |
| 4 | harness-cli story update --id US-099 --verify "npm test" updates the verify_command on an existing story. |
Run the command, then query again. Expect npm test. |
| 5 | harness-cli init on a fresh database creates tables with the v2 columns present. |
Delete the DB, run init. Confirm story table has verify_command, last_verified_at, last_verified_result via PRAGMA. |
| 6 | cargo test passes with tests covering the migration and the new fields. |
Run cargo test in the workspace root. |
Lane: Normal (schema migration + CLI changes across all four layers).
Background:
decision verify <id> already exists. It reads verify_command from the
decision table, runs it via sh -c from the repo root, stores pass or
fail in last_verified_result, and updates last_verified_at
(infrastructure.rs lines 508-540).
After US-012, stories will have the same three columns. But there is no
story verify CLI command to execute the check.
Reason:
Runtime Substrate (arXiv:2605.13357) defines H4 as "the harness can run or
orchestrate proof checks consistently." "The Last Harness" (arXiv:2604.21003)
describes the Evaluator role — a mechanical agent that checks whether work
meets its contract. story verify is this Evaluator for story-level work.
Solution:
- Add
Verify { id: String }variant toStoryActionenum ininterface.rs. - Add
verify_story(&self, id: &str)toHarnessServiceinapplication.rs. - Add
verify_story(&self, id: &str)toHarnessRepositorytrait andSqliteHarnessRepositoryininfrastructure.rs. - Implementation mirrors
verify_decision: readverify_commandfrom story, runsh -c <command>from repo root, store result and timestamp. - Add
StoryVerifyResulttoapplication.rs(mirrorsDecisionVerifyResult). - Add
MissingStoryVerifyCommand(String)variant toHarnessInfraError. - Print output:
Running: <command>thenStory <id> verification: pass/fail. - Exit code 0 for pass, 1 for fail.
Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | harness-cli story verify US-099 runs the story's verify_command and prints the result. |
Add a story with --verify "echo ok", run story verify US-099. Output: Running: echo ok then Story US-099 verification: pass. |
| 2 | The command updates last_verified_at and last_verified_result in the database. |
After verify, query sql "SELECT last_verified_at, last_verified_result FROM story WHERE id='US-099'" shows a timestamp and pass. |
| 3 | A failing verify_command records fail. |
Add story with --verify "exit 1", run story verify. Output shows fail. DB shows fail. |
| 4 | A story with no verify_command produces an error. | Add story without --verify, run story verify. Error: story US-100 has no verify_command. |
| 5 | story verify exits with code 0 on pass and code 1 on fail. |
harness-cli story verify US-099 && echo OK prints OK for passing command. `harness-cli story verify US-fail |
| 6 | The command runs from the repo root directory. | Add story with --verify "pwd", verify output shows the repo root path. |
| 7 | cargo test passes with tests covering pass, fail, and missing verify_command cases. |
Run cargo test. |
Lane: Normal (new CLI subcommand, touches all four code layers).
Background:
harness-cli score-trace exists as a separate command (Phase 3). Agents must
remember to run it after recording a trace. In the Phase 3 benchmark, trace
quality was 2.5/3.0 — agents sometimes forgot to self-check. The score-trace
command is available but not integrated into the trace recording workflow.
Reason:
AHE (arXiv:2604.25850) emphasizes immediate feedback over post-hoc evaluation. The Context Engineering paper (arXiv:2603.05344) notes that agents follow guidance best when it's presented at the point of action, not as a separate step. Auto-scoring removes the "remember to run score-trace" failure mode.
Solution:
- After
record_tracesucceeds inHarnessService, callscore_tracewith the newly created trace ID. - Print the score summary after the
Trace #N recorded.confirmation. - If the trace is below its lane requirement, print a warning and the missing fields — but do NOT exit with code 1 (trace recording should always succeed; the warning is advisory).
- Update the
Tracecommand handler ininterface.rsto callscore_traceafter recording and print the result using the existingprint_trace_scorefunction.
Example Output:
Trace #8 recorded.
Tier achieved: standard (2/3)
Lane: high_risk -> required tier: detailed (3/3)
BELOW REQUIREMENT
Missing for detailed:
- decisions_made: empty
- duration_seconds: null (no explanation in notes)
Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | harness-cli trace --summary "test" --outcome completed prints the trace ID and the trace quality score. |
Run the command. Output includes both Trace #N recorded. and Tier achieved:. |
| 2 | When the trace is linked to an intake via --intake, the output shows the lane requirement and whether it is met. |
Record an intake with --lane high_risk, then trace --summary "test" --outcome completed --intake 1. Output includes Lane: high_risk -> required tier: detailed. |
| 3 | When the trace is below its lane requirement, the output shows BELOW REQUIREMENT and lists missing fields. |
Record a minimal trace linked to a high_risk intake. Output includes BELOW REQUIREMENT and missing field list. |
| 4 | The trace is always recorded successfully regardless of the score. | Even when below requirement, the trace row exists in the database. |
| 5 | The trace command always exits with code 0 (scoring is advisory, not blocking). |
harness-cli trace --summary "test" --outcome completed; echo $? outputs 0. |
| 6 | cargo test passes with tests covering auto-scoring output for minimal, standard, and detailed traces. |
Run cargo test. |
Lane: Tiny (extends existing trace command output, no schema change, no new command).
Background:
When an agent records a trace with --story US-012, the trace is linked to
that story. But there is no check for whether the story's verify_command has
been run. An agent can close a task (record a trace with --outcome completed)
without ever verifying the story's acceptance criteria.
Reason:
NLAHs (arXiv:2603.25723) describes validation gates — checkpoints before state
transitions that enforce NL policy compliance. "The Last Harness"
(arXiv:2604.21003) says the Evaluator should catch incomplete work before the
final response. The pre-close gate combines US-015 (verification) and US-016
(auto-scoring) into a single checkpoint: when recording a trace, the agent is
warned if the linked story has an unverified verify_command.
Solution:
- In the
Tracecommand handler (after recording and auto-scoring), check if the trace has a--storyargument. - If a story is linked, query the story's
verify_commandandlast_verified_result. - If
verify_commandis not null andlast_verified_resultis null (never verified) orfail(last run failed), print a warning:Warning: Story US-012 has verify_command but verification has not passed. Run: harness-cli story verify US-012 - The warning is advisory — the trace is still recorded. Exit code remains 0.
Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | When recording a trace with --story US-099 where the story has a verify_command that has never been run, a warning is printed. |
Add story with --verify "echo ok". Record trace with --story US-099 --summary "test" --outcome completed. Output includes Warning: Story US-099 has verify_command but verification has not passed. |
| 2 | When the story's verify_command was already run and passed, no warning is printed. | Run story verify US-099 (passes), then record trace with --story US-099. No warning in output. |
| 3 | When the story's last verification result is fail, the warning is printed. |
Run story verify US-fail (fails), then record trace with --story US-fail. Warning is printed. |
| 4 | When the story has no verify_command, no warning is printed. | Add story without --verify. Record trace with that story. No warning. |
| 5 | When the trace has no --story flag, no verification check occurs. |
Record trace without --story. No warning. |
| 6 | The trace is always recorded regardless of the warning. | After a warning, the trace row exists in the database. Exit code is 0. |
| 7 | cargo test passes with tests covering all four cases: no story, no verify_command, unverified, and previously passed. |
Run cargo test. |
Lane: Tiny (extends existing trace command output, no schema change, no new command).
| Item | Why deferred | Phase |
|---|---|---|
| Benchmark comparison attribution (US-014) | Lives in harness-benchmark, not repository-harness. |
Benchmark work |
| Machine-readable tool registry | NexAU gap, lower priority than verification | Phase 5 |
| Executable agent skills | Platform-dependent, moving target | Phase 5 |
| Sub-agents | No use case yet | Phase 5+ |
| Automated improvement proposals | Requires verification data first | Phase 5 (H5) |
| Config parameter search (Harbor) | Need more benchmark runs | Phase 6+ |
| Context rule enforcement / measurement | Secondary to verification | Phase 5 |
| Drift detection / entropy score | Interesting but not blocking | Phase 5 |
| Batch verification across all stories | Useful but not core — can be composed via query sql + shell |
Phase 5 |
| Installer propagation of Phase 3/4 docs (US-007) | Separate PR | Separate |
Step 1: US-012 — Story verify_command field
- Create scripts/schema/002-story-verify.sql
- Update domain.rs (no new types needed, verify columns are strings)
- Update application.rs (StoryAddInput, StoryUpdateInput)
- Update interface.rs (StoryAddArgs, StoryUpdateArgs)
- Update infrastructure.rs (INSERT, UPDATE SQL, migrate logic)
- Write unit tests for migration and new fields
Estimated effort: ~3-4 hours
Step 2: US-015 — Story verify command
- Add StoryAction::Verify to interface.rs
- Add StoryVerifyResult to application.rs
- Add verify_story to HarnessService and HarnessRepository
- Add MissingStoryVerifyCommand to HarnessInfraError
- Implementation mirrors verify_decision
- Write unit tests for pass, fail, missing cases
Estimated effort: ~2-3 hours
Step 3: US-016 — Auto trace scoring on write
- Update Trace command handler in interface.rs
- After record_trace, call score_trace with the returned ID
- Print score using existing print_trace_score function
- Do NOT change exit code (advisory only)
- Write unit tests
Estimated effort: ~1-2 hours
Step 4: US-017 — Pre-close verification gate
- After auto-scoring in Trace handler, check --story link
- Query story verify_command and last_verified_result
- Print advisory warning if unverified or failed
- Add query_story_verify_status helper to infrastructure
- Write unit tests for all four cases
Estimated effort: ~1-2 hours
Step 5: Cross-references and documentation
- Update docs/HARNESS.md with story verification workflow
- Update docs/HARNESS_COMPONENTS.md (Verification: Partial → Covered)
- Update docs/HARNESS_MATURITY.md (H4 current status)
- Update docs/GLOSSARY.md with "verification gate" term
- Update AGENTS.md if new commands need to be in the reading list
- Record Phase 4 trace
Estimated effort: ~1-2 hours
Total estimated effort: ~8-13 hours
- Branch:
git checkout -b feature/phase-4-mechanical-verification main - Implement US-012 → US-015 → US-016 → US-017 (in order)
- Update cross-references (HARNESS.md, HARNESS_COMPONENTS.md, HARNESS_MATURITY.md, GLOSSARY.md)
- Run
cargo test— all tests must pass - Run
cargo clippy— no warnings - Run benchmark in
harness-benchmark:- Install harness from feature branch
- Run
./benchmark/run.sh --agent codex --harness feature/phase-4-mechanical-verification
- Compare:
./benchmark/compare.sh phase-3-active-observability phase-4 - Merge: Only when benchmark shows stable or improved results
| Metric | Phase 3 (current) | Phase 4 Target | Reasoning |
|---|---|---|---|
| Functional score | 37/37 (100%) | 37/37 (100%) | Phase 4 doesn't change app code |
| Harness compliance | 31/31 (100%) | 31/31 (100%) | Already perfect |
| Trace quality | 2.5/3.0 | 2.8-3.0/3.0 | Auto-scoring on write gives immediate feedback; agents fix traces before closing |
| Lane accuracy | 6/6 | 6/6 | Already perfect |
| Wall time | 1749s | ~1800-1900s | Slight increase from running verify commands |
| Token cost | $21.97 | ~$22-23 | Slight increase from reading verify output |
cargo testpasses with coverage for all four stories.story verifycorrectly runs verify_command and records pass/fail.tracecommand auto-scores and prints the tier summary.trace --storywarns when the linked story is unverified.- Benchmark trace quality rises to ≥2.8/3.0 (agents use auto-score feedback).
- No regression in functional score, harness compliance, or lane accuracy.
story verifydisagrees withdecision verifybehavior (inconsistent verification patterns).- Auto-scoring breaks existing
tracecommand behavior or exit codes. - Benchmark trace quality stays at 2.5 (auto-scoring feedback ignored by agent).
cargo testorcargo clippyfailures.- Functional score drops (verification overhead confused the agent).