repository-harness/PHASE4.md at main · hoangnb24/repository-harness

#Phase 4 — Mechanical Verification: Finalized Scope

Target repo: repository-harness (feature branch off main) Validation: harness-benchmark re-run after implementation Current harness maturity: H3 partial (Phase 3 active observability complete) Target maturity: H3 (full) → H4 (partial: story verification, auto-scoring, pre-close gate)

Benchmark Triage After First Re-Run

The first Phase 4 benchmark showed one isolated compliance miss and several command-shape friction loops:

T4 authentication included decision text in the trace but did not create a durable decision record. High-risk work that changes auth, authorization, data ownership, API behavior, architecture, or validation must add a docs/decisions/NNNN-*.md record and a durable decision row with scripts/bin/harness-cli decision add. Trace --decisions is evidence, not the decision log.
Rust CLI proof flags require numeric booleans. Use --unit 1 --integration 1 --e2e 0 --platform 0; do not use yes or no.
story verify <id> runs the story's configured verify_command and records pass/fail. It accepts only the story id. Proof flags belong to story update.
Agents should prefer the command examples in docs/HARNESS.md and scripts/README.md before repeated help probing. Re-run help only when the command shape is still unknown.

What Phase 4 Is

Phase 4 turns the harness from a system that observes agent work into one that verifies it. Phase 3 gave agents a way to check whether their trace was good enough. Phase 4 gives them a way to check whether their implementation meets the story contract — and warns them before they close a task without running the check.

The decision table already has a verify_command column and decision verify already runs it (Phase 1 infrastructure). The story table does not. Phase 4 extends the same pattern to stories, adds automatic trace scoring on write, and introduces a pre-close verification gate.

Phase 4 is Rust CLI code + schema migration + documentation.

Research Grounding

Five of the nine Arxiv papers surveyed in Phase 0 converge on verification as the next capability:

Paper	Recommendation
Runtime Substrate (2605.13357)	H3→H4 = "the harness can verify, not just observe"
AHE (2604.25850)	The `verify_command` column exists but has no story-level execution path
NLAHs (2603.25723)	NL policies need enforcement — validation gates before state transitions
"The Last Harness" (2604.21003)	The Evaluator role in the Worker→Evaluator→Evolution loop must be mechanical
Continual Harness (2605.09998)	Self-improvement requires knowing whether traces are accurate, which requires verification

Why This Order Matters

US-012 Story verify_command Field
  ↓ schema migration adds the column; CLI accepts the flag
  ↓ stories can now carry a mechanical proof command
US-015 Story Verify Command
  ↓ agents can run the proof command and record the result
  ↓ the Evaluator role becomes mechanical
US-016 Auto Trace Scoring on Write
  ↓ agents get immediate trace quality feedback when recording
  ↓ removes the need to remember to run score-trace separately
US-017 Pre-Close Verification Gate
  ↓ combines trace scoring + verification into a single checkpoint
  ↓ agents are warned before closing a task without proof

US-012 must be first because it creates the schema column. US-015 depends on it to have something to execute. US-016 is independent of US-012/US-015 but ordered here because it's simpler. US-017 depends on both US-015 (verification) and US-016 (auto-scoring) to compose them into a single gate.

Stories

US-012: Story `verify_command` Field

Background:

The decision table already has verify_command, last_verified_at, and last_verified_result columns (Phase 1 schema, 001-init.sql lines 75-79). The decision verify <id> CLI command already runs the command via sh -c, records pass/fail, and updates the timestamp (infrastructure.rs line 508+).

The story table has proof columns (unit_proof, integration_proof, e2e_proof, platform_proof) and a free-text evidence field. But it has no verify_command column. Stories cannot carry a mechanical check command that proves the story's acceptance criteria are met.

Reason:

AHE (arXiv:2604.25850) says "every edit is a falsifiable contract." NLAHs (arXiv:2603.25723) says NL policies need enforceable validation gates. The decision table already implements this pattern — stories should too.

In the Phase 3 benchmark, T4 (authentication, high_risk) was the only task that failed its trace tier requirement. There was no mechanism for the agent to mechanically verify the story was complete beyond checking trace quality. A verify_command on the story would let the agent (or benchmark) run npm test -- --run auth to confirm the implementation works.

Solution:

New migration file scripts/schema/002-story-verify.sql:

ALTER TABLE story ADD COLUMN verify_command TEXT;
ALTER TABLE story ADD COLUMN last_verified_at TEXT;
ALTER TABLE story ADD COLUMN last_verified_result TEXT
  CHECK(last_verified_result IN ('pass','fail') OR last_verified_result IS NULL);

Update harness-cli story add to accept --verify <command>.
Update harness-cli story update to accept --verify <command>.
Update StoryAddInput and StoryUpdateInput in application.rs.
Update StoryAddArgs and StoryUpdateArgs in interface.rs.
Update SQL INSERT and UPDATE in infrastructure.rs.

Acceptance Criteria:

#	Criterion	How to verify
1	`scripts/schema/002-story-verify.sql` exists and adds `verify_command`, `last_verified_at`, and `last_verified_result` columns to the `story` table.	Read the file. Confirm the three ALTER TABLE statements and the CHECK constraint on `last_verified_result`.
2	`harness-cli migrate` applies migration 002 on an existing database.	Run `harness-cli init` (creates v1 DB), then `harness-cli migrate`. Verify `schema_version` contains version 2 and `story` table has the three new columns via `harness-cli query sql "PRAGMA table_info(story)"`.
3	`harness-cli story add --id US-099 --title "Test" --lane normal --verify "echo ok"` stores the verify_command.	Run the command, then `harness-cli query sql "SELECT verify_command FROM story WHERE id='US-099'"`. Expect `echo ok`.
4	`harness-cli story update --id US-099 --verify "npm test"` updates the verify_command on an existing story.	Run the command, then query again. Expect `npm test`.
5	`harness-cli init` on a fresh database creates tables with the v2 columns present.	Delete the DB, run `init`. Confirm `story` table has `verify_command`, `last_verified_at`, `last_verified_result` via PRAGMA.
6	`cargo test` passes with tests covering the migration and the new fields.	Run `cargo test` in the workspace root.

Lane: Normal (schema migration + CLI changes across all four layers).

US-015: Story Verify Command

Background:

decision verify <id> already exists. It reads verify_command from the decision table, runs it via sh -c from the repo root, stores pass or fail in last_verified_result, and updates last_verified_at (infrastructure.rs lines 508-540).

After US-012, stories will have the same three columns. But there is no story verify CLI command to execute the check.

Reason:

Runtime Substrate (arXiv:2605.13357) defines H4 as "the harness can run or orchestrate proof checks consistently." "The Last Harness" (arXiv:2604.21003) describes the Evaluator role — a mechanical agent that checks whether work meets its contract. story verify is this Evaluator for story-level work.

Solution:

Add Verify { id: String } variant to StoryAction enum in interface.rs.
Add verify_story(&self, id: &str) to HarnessService in application.rs.
Add verify_story(&self, id: &str) to HarnessRepository trait and SqliteHarnessRepository in infrastructure.rs.
Implementation mirrors verify_decision: read verify_command from story, run sh -c <command> from repo root, store result and timestamp.
Add StoryVerifyResult to application.rs (mirrors DecisionVerifyResult).
Add MissingStoryVerifyCommand(String) variant to HarnessInfraError.
Print output: Running: <command> then Story <id> verification: pass/fail.
Exit code 0 for pass, 1 for fail.

Acceptance Criteria:

#	Criterion	How to verify
1	`harness-cli story verify US-099` runs the story's `verify_command` and prints the result.	Add a story with `--verify "echo ok"`, run `story verify US-099`. Output: `Running: echo ok` then `Story US-099 verification: pass`.
2	The command updates `last_verified_at` and `last_verified_result` in the database.	After verify, `query sql "SELECT last_verified_at, last_verified_result FROM story WHERE id='US-099'"` shows a timestamp and `pass`.
3	A failing verify_command records `fail`.	Add story with `--verify "exit 1"`, run `story verify`. Output shows `fail`. DB shows `fail`.
4	A story with no verify_command produces an error.	Add story without `--verify`, run `story verify`. Error: `story US-100 has no verify_command`.
5	`story verify` exits with code 0 on pass and code 1 on fail.	`harness-cli story verify US-099 && echo OK` prints `OK` for passing command. `harness-cli story verify US-fail
6	The command runs from the repo root directory.	Add story with `--verify "pwd"`, verify output shows the repo root path.
7	`cargo test` passes with tests covering pass, fail, and missing verify_command cases.	Run `cargo test`.

Lane: Normal (new CLI subcommand, touches all four code layers).

US-016: Auto Trace Scoring on Write

Background:

harness-cli score-trace exists as a separate command (Phase 3). Agents must remember to run it after recording a trace. In the Phase 3 benchmark, trace quality was 2.5/3.0 — agents sometimes forgot to self-check. The score-trace command is available but not integrated into the trace recording workflow.

Reason:

AHE (arXiv:2604.25850) emphasizes immediate feedback over post-hoc evaluation. The Context Engineering paper (arXiv:2603.05344) notes that agents follow guidance best when it's presented at the point of action, not as a separate step. Auto-scoring removes the "remember to run score-trace" failure mode.

Solution:

After record_trace succeeds in HarnessService, call score_trace with the newly created trace ID.
Print the score summary after the Trace #N recorded. confirmation.
If the trace is below its lane requirement, print a warning and the missing fields — but do NOT exit with code 1 (trace recording should always succeed; the warning is advisory).
Update the Trace command handler in interface.rs to call score_trace after recording and print the result using the existing print_trace_score function.

Example Output:

Trace #8 recorded.
  Tier achieved: standard (2/3)
  Lane: high_risk -> required tier: detailed (3/3)
  BELOW REQUIREMENT

  Missing for detailed:
    - decisions_made: empty
    - duration_seconds: null (no explanation in notes)

Acceptance Criteria:

#	Criterion	How to verify
1	`harness-cli trace --summary "test" --outcome completed` prints the trace ID and the trace quality score.	Run the command. Output includes both `Trace #N recorded.` and `Tier achieved:`.
2	When the trace is linked to an intake via `--intake`, the output shows the lane requirement and whether it is met.	Record an intake with `--lane high_risk`, then `trace --summary "test" --outcome completed --intake 1`. Output includes `Lane: high_risk -> required tier: detailed`.
3	When the trace is below its lane requirement, the output shows `BELOW REQUIREMENT` and lists missing fields.	Record a minimal trace linked to a high_risk intake. Output includes `BELOW REQUIREMENT` and missing field list.
4	The trace is always recorded successfully regardless of the score.	Even when below requirement, the trace row exists in the database.
5	The `trace` command always exits with code 0 (scoring is advisory, not blocking).	`harness-cli trace --summary "test" --outcome completed; echo $?` outputs `0`.
6	`cargo test` passes with tests covering auto-scoring output for minimal, standard, and detailed traces.	Run `cargo test`.

Lane: Tiny (extends existing trace command output, no schema change, no new command).

US-017: Pre-Close Verification Gate

Background:

When an agent records a trace with --story US-012, the trace is linked to that story. But there is no check for whether the story's verify_command has been run. An agent can close a task (record a trace with --outcome completed) without ever verifying the story's acceptance criteria.

Reason:

NLAHs (arXiv:2603.25723) describes validation gates — checkpoints before state transitions that enforce NL policy compliance. "The Last Harness" (arXiv:2604.21003) says the Evaluator should catch incomplete work before the final response. The pre-close gate combines US-015 (verification) and US-016 (auto-scoring) into a single checkpoint: when recording a trace, the agent is warned if the linked story has an unverified verify_command.

Solution:

In the Trace command handler (after recording and auto-scoring), check if the trace has a --story argument.
If a story is linked, query the story's verify_command and last_verified_result.
If verify_command is not null and last_verified_result is null (never verified) or fail (last run failed), print a warning:
```
Warning: Story US-012 has verify_command but verification has not passed.
Run: harness-cli story verify US-012
```
The warning is advisory — the trace is still recorded. Exit code remains 0.

Acceptance Criteria:

#	Criterion	How to verify
1	When recording a trace with `--story US-099` where the story has a `verify_command` that has never been run, a warning is printed.	Add story with `--verify "echo ok"`. Record trace with `--story US-099 --summary "test" --outcome completed`. Output includes `Warning: Story US-099 has verify_command but verification has not passed.`
2	When the story's verify_command was already run and passed, no warning is printed.	Run `story verify US-099` (passes), then record trace with `--story US-099`. No warning in output.
3	When the story's last verification result is `fail`, the warning is printed.	Run `story verify US-fail` (fails), then record trace with `--story US-fail`. Warning is printed.
4	When the story has no verify_command, no warning is printed.	Add story without `--verify`. Record trace with that story. No warning.
5	When the trace has no `--story` flag, no verification check occurs.	Record trace without `--story`. No warning.
6	The trace is always recorded regardless of the warning.	After a warning, the trace row exists in the database. Exit code is 0.
7	`cargo test` passes with tests covering all four cases: no story, no verify_command, unverified, and previously passed.	Run `cargo test`.

Lane: Tiny (extends existing trace command output, no schema change, no new command).

Out of Scope for Phase 4

Item	Why deferred	Phase
Benchmark comparison attribution (US-014)	Lives in `harness-benchmark`, not `repository-harness`.	Benchmark work
Machine-readable tool registry	NexAU gap, lower priority than verification	Phase 5
Executable agent skills	Platform-dependent, moving target	Phase 5
Sub-agents	No use case yet	Phase 5+
Automated improvement proposals	Requires verification data first	Phase 5 (H5)
Config parameter search (Harbor)	Need more benchmark runs	Phase 6+
Context rule enforcement / measurement	Secondary to verification	Phase 5
Drift detection / entropy score	Interesting but not blocking	Phase 5
Batch verification across all stories	Useful but not core — can be composed via `query sql` + shell	Phase 5
Installer propagation of Phase 3/4 docs (US-007)	Separate PR	Separate

Implementation Sequence

Step 1: US-012 — Story verify_command field
  - Create scripts/schema/002-story-verify.sql
  - Update domain.rs (no new types needed, verify columns are strings)
  - Update application.rs (StoryAddInput, StoryUpdateInput)
  - Update interface.rs (StoryAddArgs, StoryUpdateArgs)
  - Update infrastructure.rs (INSERT, UPDATE SQL, migrate logic)
  - Write unit tests for migration and new fields
  Estimated effort: ~3-4 hours

Step 2: US-015 — Story verify command
  - Add StoryAction::Verify to interface.rs
  - Add StoryVerifyResult to application.rs
  - Add verify_story to HarnessService and HarnessRepository
  - Add MissingStoryVerifyCommand to HarnessInfraError
  - Implementation mirrors verify_decision
  - Write unit tests for pass, fail, missing cases
  Estimated effort: ~2-3 hours

Step 3: US-016 — Auto trace scoring on write
  - Update Trace command handler in interface.rs
  - After record_trace, call score_trace with the returned ID
  - Print score using existing print_trace_score function
  - Do NOT change exit code (advisory only)
  - Write unit tests
  Estimated effort: ~1-2 hours

Step 4: US-017 — Pre-close verification gate
  - After auto-scoring in Trace handler, check --story link
  - Query story verify_command and last_verified_result
  - Print advisory warning if unverified or failed
  - Add query_story_verify_status helper to infrastructure
  - Write unit tests for all four cases
  Estimated effort: ~1-2 hours

Step 5: Cross-references and documentation
  - Update docs/HARNESS.md with story verification workflow
  - Update docs/HARNESS_COMPONENTS.md (Verification: Partial → Covered)
  - Update docs/HARNESS_MATURITY.md (H4 current status)
  - Update docs/GLOSSARY.md with "verification gate" term
  - Update AGENTS.md if new commands need to be in the reading list
  - Record Phase 4 trace
  Estimated effort: ~1-2 hours

Total estimated effort: ~8-13 hours

Execution Workflow

Branch: git checkout -b feature/phase-4-mechanical-verification main
Implement US-012 → US-015 → US-016 → US-017 (in order)
Update cross-references (HARNESS.md, HARNESS_COMPONENTS.md, HARNESS_MATURITY.md, GLOSSARY.md)
Run cargo test — all tests must pass
Run cargo clippy — no warnings
Run benchmark in harness-benchmark:
- Install harness from feature branch
- Run ./benchmark/run.sh --agent codex --harness feature/phase-4-mechanical-verification
Compare: ./benchmark/compare.sh phase-3-active-observability phase-4
Merge: Only when benchmark shows stable or improved results

Expected Benchmark Deltas

Metric	Phase 3 (current)	Phase 4 Target	Reasoning
Functional score	37/37 (100%)	37/37 (100%)	Phase 4 doesn't change app code
Harness compliance	31/31 (100%)	31/31 (100%)	Already perfect
Trace quality	2.5/3.0	2.8-3.0/3.0	Auto-scoring on write gives immediate feedback; agents fix traces before closing
Lane accuracy	6/6	6/6	Already perfect
Wall time	1749s	~1800-1900s	Slight increase from running verify commands
Token cost	$21.97	~$22-23	Slight increase from reading verify output

What Would Signal Success

cargo test passes with coverage for all four stories.
story verify correctly runs verify_command and records pass/fail.
trace command auto-scores and prints the tier summary.
trace --story warns when the linked story is unverified.
Benchmark trace quality rises to ≥2.8/3.0 (agents use auto-score feedback).
No regression in functional score, harness compliance, or lane accuracy.

What Would Signal Failure

story verify disagrees with decision verify behavior (inconsistent verification patterns).
Auto-scoring breaks existing trace command behavior or exit codes.
Benchmark trace quality stays at 2.5 (auto-scoring feedback ignored by agent).
cargo test or cargo clippy failures.
Functional score drops (verification overhead confused the agent).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Triage After First Re-Run

What Phase 4 Is

Research Grounding

Why This Order Matters

Stories

US-012: Story `verify_command` Field

US-015: Story Verify Command

US-016: Auto Trace Scoring on Write

US-017: Pre-Close Verification Gate

Out of Scope for Phase 4

Implementation Sequence

Execution Workflow

Expected Benchmark Deltas

What Would Signal Success

What Would Signal Failure

FilesExpand file tree

PHASE4.md

Latest commit

History

PHASE4.md

File metadata and controls

Benchmark Triage After First Re-Run

What Phase 4 Is

Research Grounding

Why This Order Matters

Stories

US-012: Story verify_command Field

US-015: Story Verify Command

US-016: Auto Trace Scoring on Write

US-017: Pre-Close Verification Gate

Out of Scope for Phase 4

Implementation Sequence

Execution Workflow

Expected Benchmark Deltas

What Would Signal Success

What Would Signal Failure

US-012: Story `verify_command` Field