Target repo: repository-harness (feature branch off main)
Validation: harness-benchmark re-run after implementation
Current harness maturity: H4 partial (Phase 4 story verification + pre-close gate complete)
Target maturity: H4 (full) → H5 (partial: self-improvement loop, drift detection, batch verification)
Phase 4 is implemented and merged to main. Evidence:
| Story | Status | Evidence |
|---|---|---|
US-012: Story verify_command field |
Implemented | 002-story-verify.sql migration; story add --verify and story update --verify work |
| US-015: Story verify command | Implemented | story verify <id> runs verify_command, records pass/fail |
| US-016: Auto trace scoring on write | Implemented | trace command auto-scores and prints tier summary |
| US-017: Pre-close verification gate | Implemented | trace --story warns when linked story verification hasn't passed |
| US-018: CLI UX hardening | Implemented | --version, accepted-value help, numeric proof hints, recovery messages |
The maturity ladder shows H4 as partial — batch verification and proof-column automation remain open.
Phase 5 turns the harness from a system that verifies individual stories into one that can evolve itself. Phase 4 gave agents mechanical proof for each story. Phase 5 gives the harness the ability to:
- Complete H4 — batch verification across all stories.
- Start H5 — friction-to-proposal pipeline, drift detection, self-improvement protocol.
- Close remaining component gaps — tool registry, intervention recording schema, context rule measurement.
The theme is Validate → Check → Improve:
Validate: score-context (US-022) + audit (US-023) → "here's what's wrong"
Check: verify-all (US-020) → "here's what's broken"
Improve: propose (US-024) → "here's what to fix, based on patterns"
| Paper | Phase 5 Relevance |
|---|---|
| Runtime Substrate (2605.13357) | H5 = "the harness can propose safe improvements to itself" |
| Continual Harness (2605.09998) | Self-improvement requires verified traces + outcome comparison — Phase 4 provides the verification; Phase 5 adds the evolution loop |
| "The Last Harness" (2604.21003) | The Evolution role in Worker→Evaluator→Evolution must be mechanical and auditable |
| AHE (2604.25850) | Tool registry and capability manifest close the NexAU "Tool descriptions" gap |
| NLAHs (2603.25723) | Context rule enforcement needs measurement before it can be enforced |
US-019 Tool Registry
↓ machine-readable command manifest enables measurement and discovery
↓ closes the NexAU "Tool descriptions" gap
US-020 Batch Story Verification
↓ agents can verify all stories at once, completes H4
↓ the Evaluator role scales beyond single stories
US-021 Intervention Recording Schema
↓ human review events are separated from agent traces
↓ enables analysis of where humans override agent work
US-022 Context Rule Measurement
↓ measures whether agents actually follow context rules
↓ enables future enforcement and identifies over/under-reading
US-023 Drift Detection / Entropy Score
↓ stale docs, orphaned stories, and unlinked decisions are flagged
↓ enables the self-improvement proposal pipeline
US-024 Improvement Proposal Pipeline
↓ friction patterns become structured proposals with predicted impact
↓ the harness starts proposing its own backlog items — H5 begins
US-019 is first because the tool registry is needed for context measurement (US-022) and provides the manifest that batch verification (US-020) can reference. US-020 completes H4. US-021 is independent but unlocks better analysis. US-022 depends on having a manifest of what exists. US-023 depends on having structured records to audit. US-024 depends on friction data (existing), drift data (US-023), and intervention data (US-021) to compose proposals.
Background:
The NexAU cross-reference in HARNESS_COMPONENTS.md notes: "no standalone tool
schema or generated command reference." The CLI has --help output, but agents
cannot programmatically discover available commands, their arguments, or which
harness responsibilities they serve. Additionally, no mechanism exists for
registering non-harness-cli tools (project scripts, custom CLIs, etc.).
Reason:
AHE (arXiv:2604.25850) calls for explicit capability manifests. Context measurement (US-022) needs to know what tools exist to measure whether agents use the right ones.
Solution:
- New file
docs/TOOL_REGISTRY.md— human-readable command reference generated from CLI help, grouped by responsibility. - New command
harness-cli query tools --json— outputs a JSON array of tool entries from two sources:- Compiled-in (harness-cli commands,
source: "compiled") - User-registered (from
tooltable,source: "registered")
- Compiled-in (harness-cli commands,
- New command
harness-cli query tools --summary— compact one-liner list for agent context (avoids dumping full JSON into the model). - New command
harness-cli query tools --responsibility <name>— filtered view. - New command
harness-cli tool register— registers external tools with validation. - New command
harness-cli tool remove— deregisters a tool. - New schema:
tooltable for user-registered tools. - Update
HARNESS_COMPONENTS.md— Tool access: Partial → Covered; Tool descriptions: Partial → Covered.
Tool entry schema (JSON output):
{
"provider": "harness-cli | custom",
"command": "story verify",
"description": "Run a story's verify_command and record pass/fail",
"args": [
{ "name": "id", "type": "string", "required": true, "help": "Story id" }
],
"responsibility": "Verification",
"source": "compiled | registered",
"since": "0.1.6"
}Registration validation rules:
| Check | Enforcement |
|---|---|
--name is unique |
Reject duplicate names; show existing entry on error |
--command path exists or is on PATH |
Warn if not found; require --force to register anyway |
--description is 10-200 chars |
Reject too short or too long |
--responsibility is a valid Runtime Substrate responsibility |
Reject unknown values; print accepted list on error |
--args follow name:type:required schema |
Parse and reject malformed arg specs |
Registration example:
harness-cli tool register \
--name "deploy-check" \
--command "./scripts/deploy-check.sh" \
--description "Verify staging env is healthy before production deploy" \
--responsibility Verification \
--args "env:enum:required:staging,production"Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | docs/TOOL_REGISTRY.md lists every CLI command with arguments and purpose. |
Read the file. Every command visible in harness-cli --help is documented. |
| 2 | harness-cli query tools --json returns valid JSON listing all compiled-in and registered commands. |
Run command, pipe to jq .; verify each entry has command, description, args, responsibility. |
| 3 | harness-cli query tools --summary prints a compact one-liner list. |
Run command. Output is a table with command, responsibility, and short description. |
| 4 | harness-cli query tools --responsibility Verification filters correctly. |
Run command. Only verification-related tools appear. |
| 5 | harness-cli tool register validates inputs and stores in tool table. |
Register a tool with valid inputs. Query tool table. Row exists. |
| 6 | harness-cli tool register rejects invalid inputs (missing description, bad responsibility, duplicate name). |
Attempt invalid registrations. All rejected with actionable errors. |
| 7 | harness-cli tool remove --name <name> deregisters a tool. |
Register then remove. Query tool table. Row gone. |
| 8 | A test asserts every harness-cli subcommand has a compiled-in registry entry. | Run cargo test. |
| 9 | cargo test passes. |
Run cargo test. |
Lane: Normal (new query subcommand + schema + documentation, touches all four layers).
Background:
story verify <id> verifies one story at a time. There is no way to verify all
stories with verify_command set in a single pass. PHASE4.md deferred "batch
verification across all stories" to Phase 5.
Reason:
Runtime Substrate (arXiv:2605.13357) says H4 requires consistent proof checks. A single-story command requires agents to enumerate stories manually. Batch verification lets an agent or CI pipeline confirm all story contracts in one command.
Use cases:
- Pre-merge confidence gate: Run
story verify-allbefore merging to confirm nothing regressed. Likecargo testbut for story-level contracts. - Benchmark validation:
harness-benchmarkcallsstory verify-allon the target repo to check that stories still pass after agent work. - After refactoring: Unit tests pass, but did you break any story-level
behavior?
verify-allconfirms product contracts still hold.
Solution:
- New command
harness-cli story verify-all— queries all stories with non-nullverify_command, runs each, prints per-story results, and exits with code 1 if any fail. - Summary output:
N stories verified: X passed, Y failed, Z skipped (no verify_command). - Update
docs/HARNESS.mdwith batch verification workflow. - Update
docs/HARNESS_MATURITY.md— H4 current status: Partial → Achieved.
Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | harness-cli story verify-all runs all stories with verify_command and prints summary. |
Add 3 stories (2 with verify commands, 1 without). Run verify-all. Output shows 2 verified, 1 skipped. |
| 2 | Exit code is 0 when all pass, 1 when any fail. | Add a story with --verify "exit 1". Run verify-all. Exit code is 1. |
| 3 | Each story result is printed individually before the summary. | Output includes per-story lines: Story US-XXX: pass/fail. |
| 4 | docs/HARNESS_MATURITY.md H4 current status is updated to "Achieved." |
Read the file. |
| 5 | cargo test passes with tests covering pass-all, fail-some, and no-verify-command cases. |
Run cargo test. |
Lane: Normal (new CLI subcommand, touches all four code layers + documentation).
Background:
HARNESS_COMPONENTS.md notes: "Human interventions are not separated from normal
agent actions, and there is no review-event schema." Traces record what happened,
but don't distinguish agent-autonomous actions from human-directed corrections.
Reason:
"The Last Harness" (arXiv:2604.21003) describes the Evolution role as learning from where humans intervene. Without structured intervention records, the harness can't identify which agent behaviors need improvement. The improvement proposal pipeline (US-024) needs this data to propose targeted fixes.
Solution:
- New migration
scripts/schema/003-intervention.sql:CREATE TABLE intervention ( id INTEGER PRIMARY KEY AUTOINCREMENT, created_at TEXT NOT NULL DEFAULT (datetime('now')), trace_id INTEGER REFERENCES trace(id), story_id TEXT, type TEXT NOT NULL CHECK(type IN ('correction','override','escalation','approval')), description TEXT NOT NULL, source TEXT NOT NULL CHECK(source IN ('human','reviewer','ci','agent')), impact TEXT );
- New command
harness-cli intervention add --trace <id> --type <type> --description <text> --source <source>. - New query
harness-cli query interventions— lists interventions with optional--trace,--story, and--typefilters. - Update trace recording prompt — after recording a trace, print a reminder:
Reminder: Record any human corrections with: harness-cli intervention add. - Update
HARNESS_COMPONENTS.md— Intervention recording: Partial → Covered. - Update
docs/HARNESS.md— add intervention recording to the Harness Delta section.
Intervention types:
| Type | Meaning | Example |
|---|---|---|
correction |
Human fixed something the agent got wrong | "Use error handling, not unwrap()" |
override |
Human rejected agent's approach entirely | "Don't add a new table, use existing backlog table" |
escalation |
Risk or scope was raised by human | "This is high-risk, add a decision record" |
approval |
Human explicitly approved a risky action | "Yes, proceed with schema migration" |
Sources:
| Source | Who triggered it |
|---|---|
human |
Direct human feedback (PR review, chat, verbal) |
reviewer |
Code review tool or automated review bot |
ci |
CI/CD pipeline caught an issue |
agent |
Another agent flagged the issue (future: sub-agents) |
Recording triggers (how agents know to record):
- Pre-close gate prompts it — trace recording prints a reminder.
- Agent instructions in
HARNESS.mddocument the policy. - Humans can record directly after giving feedback.
- Imperfect capture is acceptable for Phase 5 — partial data still feeds US-024.
Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | scripts/schema/003-intervention.sql creates the intervention table with the specified columns and constraints. |
Read the file. |
| 2 | harness-cli migrate applies migration 003 on an existing v2 database. |
Run init, migrate. Verify schema_version contains version 3. |
| 3 | harness-cli intervention add stores a record and prints confirmation. |
Run command, query intervention table. |
| 4 | harness-cli query interventions returns stored records. |
Add 2 interventions, query. Both appear. |
| 5 | --trace, --story, and --type filters work. |
Add interventions with different traces/stories/types. Filters return correct subsets. |
| 6 | Trace recording prints intervention reminder after confirmation. | Record a trace. Output includes the reminder line. |
| 7 | cargo test passes. |
Run cargo test. |
Lane: Normal (schema migration + new CLI commands across all four layers).
Background:
CONTEXT_RULES.md defines what agents should read per phase and lane, but there's
no measurement of whether agents actually follow these rules. PHASE4.md deferred
"context rule enforcement / measurement" to Phase 5.
Reason:
NLAHs (arXiv:2603.25723) says NL policies need measurement before enforcement. You
can't enforce what you can't measure. The files_read field in traces already
captures what agents read — US-022 adds a command that compares that against what
CONTEXT_RULES.md says they should have read.
The feedback loop:
score-context → "you read 3/5 Must docs, missed: ARCHITECTURE.md"
→ agent reads the missing docs next time → score goes up
→ if agents consistently miss the same doc → propose command (US-024) suggests
updating context rules
Solution:
- New command
harness-cli score-context <trace-id>— reads the trace'sfiles_read, determines lane from linked intake, infers phase from trace content, and compares againstCONTEXT_RULES.mdrules compiled into the binary. - Output: Must-read compliance (X/Y), Should-read compliance, over-reading advisory.
- Advisory output — no exit code change, similar to
score-trace. - Phase inference heuristic:
- Has
story_id+files_changed→ implementation/validation phase. - Has only
intake_id, nofiles_changed→ intake/planning phase. - Has
outcome = completed→ trace phase.
- Has
- "Relevant" doc matching uses Retrieval Triggers from
CONTEXT_RULES.md:- If
files_changedincludesscripts/schema/*→ flag decision 0004 as Must. - If
files_changedincludes CLI code → flag decision 0005 as Must. - Pattern paths (
docs/stories/*) match if any file under that path was read.
- If
Example output:
harness-cli score-context 43
→ Trace #43 | Lane: normal | Phase: implementation
→
→ Must-read compliance: 4/5
→ ✓ Files being changed
→ ✓ Adjacent files with same pattern
→ ✓ Relevant product docs
→ ✓ Relevant story packet
→ ✗ MISSING: docs/ARCHITECTURE.md (Must for structural changes)
→
→ Should-read: 2/3
→ Over-reading: 1 doc (docs/HARNESS_MATURITY.md — Skip for normal implementation)
→
→ Context score: 4/5 must, 2/3 should
Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | harness-cli score-context <trace-id> prints required vs actual docs read. |
Record a trace with --files-read including some required docs. Run score-context. Output shows which required docs were read and which were missed. |
| 2 | Lane is inferred from the trace's linked intake. | Link trace to a normal lane intake. Score shows normal-lane requirements. |
| 3 | Missing required docs are listed with their Must/Should priority. | Output distinguishes Must-read misses from Should-read misses. |
| 4 | Over-reading (docs marked Skip that were read) is flagged as advisory. | Read a doc marked Skip for the lane. Score notes it. |
| 5 | Retrieval Triggers are applied (e.g., schema changes flag decision 0004). | Create trace with files_changed including a schema file. Score flags decision 0004 as Must. |
| 6 | cargo test passes with tests covering all lanes. |
Run cargo test. |
Lane: Normal (new CLI command, requires compiled-in context rule lookup table).
Background:
HARNESS_COMPONENTS.md notes: "No drift detector, stale-doc audit, or entropy
score exists." PHASE4.md deferred this to Phase 5.
Reason:
Continual Harness (arXiv:2605.09998) says self-improvement requires knowing what's stale. Before the harness can propose improvements (US-024), it needs to identify what has drifted: stories with no recent traces, decisions never verified, docs not referenced in any trace, and backlog items that predicted impact but have no outcome.
Solution:
- New command
harness-cli audit— scans durable records and reports six drift categories:- Orphaned stories:
status = 'accepted'but no linked trace. - Unverified stories:
verify_commandset butlast_verified_resultis null. - Unverified decisions:
verify_commandset butlast_verified_resultis null. - Open backlog without outcomes:
predicted_impactset butactual_outcomeis null andstatus = 'implemented'. - Stale stories: most recent linked trace >30 days old, status not
implemented. - Broken tools: registered tools whose
commandpath does not exist on disk.
- Orphaned stories:
- Output: per-category counts, specific record IDs, and an overall entropy score (0-100, lower is better).
- New
docs/HARNESS_AUDIT.mddocumenting what the audit checks and how to interpret scores.
Entropy score calculation:
score = (orphaned_stories × 10)
+ (unverified_stories × 5)
+ (unverified_decisions × 5)
+ (open_backlog_no_outcome × 2)
+ (stale_stories × 3)
+ (broken_tools × 8)
capped at 100
| Range | Interpretation |
|---|---|
| 0 | Perfect — everything verified, traced, outcomes recorded, tools healthy |
| 1-25 | Healthy — minor housekeeping needed |
| 26-50 | Attention needed — drift is accumulating |
| 51-100 | Action required — significant staleness undermining harness value |
Example output:
harness-cli audit
→ === Harness Drift Audit ===
→
→ Orphaned stories (accepted, no traces): 2
→ - US-042: "Add retry logic" (accepted 45 days ago)
→ - US-043: "Improve error messages" (accepted 30 days ago)
→
→ Unverified stories: 1
→ - US-019: has verify_command, never verified
→
→ Unverified decisions: 0
→
→ Open backlog without outcomes: 3
→ - #4: "Reduce trace verbosity"
→ - #7: "Add lint step"
→ - #9: "Update context rules"
→
→ Stale stories: 1
→ - US-015: last trace 40 days ago, status still 'in_progress'
→
→ Broken tools: 0
→
→ Entropy score: 34/100 (lower is better)
→ Breakdown: orphaned=20, unverified=5, no-outcomes=6, stale=3, broken=0
Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | harness-cli audit prints category counts and specific record IDs for each drift category. |
Create stories, decisions, and backlog items in various states. Run audit. Verify correct categorization. |
| 2 | Entropy score is 0 when all records are clean. | Create a clean dataset (all verified, traced, outcomes recorded). Score is 0. |
| 3 | Entropy score increases proportionally with drift. | Add unverified stories and open backlog items. Score increases. |
| 4 | Broken tools check finds registered tools with missing command paths. | Register a tool with a nonexistent path. Run audit. Broken tools category lists it. |
| 5 | docs/HARNESS_AUDIT.md documents each check and score interpretation. |
Read the file. |
| 6 | cargo test passes with tests covering clean, partial, and high-drift states. |
Run cargo test. |
Lane: Normal (new CLI command + documentation, queries existing tables plus
tool table from US-019, no schema change beyond US-019).
Background:
H5 requires: "Repeated friction patterns are summarized into proposed harness changes." The backlog outcome loop (US-011) captures predicted vs actual impact. The audit command (US-023) identifies drift. The friction query (US-009) shows friction with context. The intervention table (US-021) records human corrections. But no command composes these into actionable improvement proposals.
Reason:
"The Last Harness" (arXiv:2604.21003) says the Evolution role must be mechanical. Runtime Substrate (arXiv:2605.13357) defines H5 as "the harness can propose safe improvements." This is the capstone story for Phase 5.
Solution:
- New command
harness-cli propose— analyzes friction entries, audit results, intervention patterns, and backlog outcomes to generate structured improvement proposals. - Each proposal includes: title, harness component affected, evidence, predicted impact, risk level, suggested action, and validation plan.
- Proposals are printed as structured text and optionally stored as
proposedbacklog items via--commitflag. - Update
docs/HARNESS_MATURITY.md— H5 current status: Not achieved → Partial. - New
docs/IMPROVEMENT_PROTOCOL.mddocumenting the proposal review workflow.
Pipeline architecture:
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Friction │ │ Interventions │ │ Audit/Drift │
│ (existing) │ │ (US-021) │ │ (US-023) │
└──────┬──────┘ └────────┬─────────┘ └────────┬────────┘
│ │ │
└───────────────────┼───────────────────────┘
│
▼
┌───────────────────────┐
│ Pattern Detection │
│ (group, count, rank) │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ Proposal Generation │
│ (structured output) │
└───────────┬───────────┘
│
┌──────┴──────┐
│ │
▼ ▼
Print only --commit
(advisory) (creates backlog items)
Pattern detection (rule-based, not LLM):
- Friction: group by text similarity, flag patterns with ≥2 occurrences.
- Interventions: group by type + description, flag patterns with ≥2 occurrences.
- Audit: each non-zero drift category generates a housekeeping proposal.
Confidence scoring:
| Confidence | Criteria |
|---|---|
| High | Pattern seen ≥3 times + affects a documented component + clear action |
| Medium | Pattern seen 2 times, or ≥3 times but action is ambiguous |
| Low | Single occurrence from audit drift, or pattern too vague for specific action |
Example output:
harness-cli propose
→ === Improvement Proposals ===
→
→ Proposal 1 (high confidence):
→ Title: Add error handling guidance to CONTEXT_RULES.md
→ Component: Context selection
→ Evidence: 3 interventions (type=correction) about error handling
→ Predicted impact: Fewer correction interventions for error handling
→ Risk: Tiny
→ Suggested action: Add retrieval trigger for Rust error paths
→ Validation: Next 5 tasks show fewer error handling corrections
→
→ Proposal 2 (medium confidence):
→ Title: Reclassify schema changes as normal-lane
→ Component: Task specification
→ Evidence: 3 escalations (tiny→normal) when tasks touched schema
→ Predicted impact: Fewer mid-task escalations
→ Risk: Normal
→ Suggested action: Update FEATURE_INTAKE.md default lane for schema
→ Validation: Next schema task starts as normal without escalation
→
→ 2 proposals generated. Use --commit to create backlog items.
--commit behavior:
harness-cli propose --commit
→ Created backlog item #12: "Add error handling guidance"
→ predicted_impact: "Fewer correction interventions"
→ status: proposedItems appear in query backlog --open for human review.
Acceptance Criteria:
| # | Criterion | How to verify |
|---|---|---|
| 1 | harness-cli propose analyzes friction, audit, and intervention data and prints at least one proposal when patterns exist. |
Create friction entries and interventions with repeated patterns. Run propose. Output includes structured proposals. |
| 2 | Each proposal has title, component, evidence, predicted impact, risk, suggested action, and validation plan. | Read output. All fields present. |
| 3 | Confidence levels are assigned based on pattern frequency. | Create patterns with 2 and ≥3 occurrences. Medium and high confidence respectively. |
| 4 | harness-cli propose --commit creates proposed backlog items. |
Run with --commit. Query backlog. New items exist with status proposed and predicted_impact filled. |
| 5 | docs/IMPROVEMENT_PROTOCOL.md documents the proposal-to-implementation workflow. |
Read the file. |
| 6 | docs/HARNESS_MATURITY.md H5 updated to "Partial." |
Read the file. |
| 7 | cargo test passes. |
Run cargo test. |
Lane: High-risk (new analysis logic, changes harness evolution model, affects maturity claims, requires decision record).
| Item | Why deferred | Phase |
|---|---|---|
| Tool usage analytics | Unreliable without proper instrumentation — string-matching traces is fragile | Phase 6 (with explicit tool run wrapper or --tools-used flag) |
| Executable agent skills (NexAU "Skills") | Platform-dependent, requires skill registry standard | Phase 6 |
| Story dependency graph | Schema change + dependency resolution is complex | Phase 5.5 |
| Trace replay / comparison | Needs more trace data to be useful | Phase 5.5 |
| Sub-agents | No use case yet in this repo | Phase 6+ |
| Config parameter search (Harbor) | Needs more benchmark runs | Phase 6+ |
| Enforced permission layer | Needs measurement data from US-022 first | Phase 6 |
| Runtime middleware | Framework-dependent, no application code exists | Phase 6+ |
| Export/import of harness state | Nice-to-have, not on the H4→H5 critical path | Phase 6 |
| Benchmark comparison attribution (US-014) | Lives in harness-benchmark |
Benchmark work |
| Installer propagation of Phase 3/4/5 docs | Separate PR, not core harness work | Separate |
Step 1: US-019 — Tool registry
- Create scripts/schema/003-tool-registry.sql (tool table)
- Add ToolAction to interface.rs (register, remove)
- Add QueryAction::Tools to interface.rs (--json, --summary, --responsibility)
- Add compiled-in registry Vec<ToolEntry> to domain.rs
- Add tool CRUD and query to infrastructure.rs
- Write docs/TOOL_REGISTRY.md
- Update HARNESS_COMPONENTS.md
- Test: registration validation, query output, compiled-in completeness
Estimated effort: ~4-5 hours
Step 2: US-020 — Batch story verification
- Add StoryAction::VerifyAll to interface.rs
- Add verify_all_stories to application.rs / infrastructure.rs
- Update HARNESS.md and HARNESS_MATURITY.md
- Test: pass-all, fail-some, no-verify-command
Estimated effort: ~2-3 hours
Step 3: US-021 — Intervention recording
- Create scripts/schema/004-intervention.sql
- Add InterventionAction to interface.rs
- Add intervention commands to all four layers
- Update trace recording to print reminder
- Update HARNESS_COMPONENTS.md and HARNESS.md
- Test: add, query, filters
Estimated effort: ~4-5 hours
Step 4: US-022 — Context rule measurement
- Compile context rules into a lookup table in domain.rs
- Add score-context command to interface.rs
- Add context scoring logic to application.rs
- Add retrieval trigger matching
- Test: all lanes, phase inference, retrieval triggers
Estimated effort: ~4-5 hours
Step 5: US-023 — Drift detection / entropy score
- Add `audit` command querying existing tables + tool table
- Write docs/HARNESS_AUDIT.md
- Test: clean, partial, and high-drift states; broken tools
Estimated effort: ~3-4 hours
Step 6: US-024 — Improvement proposal pipeline
- Add `propose` command composing friction + audit + interventions
- Add pattern detection logic (grouping, counting, ranking)
- Add --commit flag to create backlog items
- Write docs/IMPROVEMENT_PROTOCOL.md
- Write docs/decisions/0007-improvement-proposal-rules.md
- Update HARNESS_MATURITY.md
- Test: pattern detection, confidence levels, commit behavior
Estimated effort: ~5-6 hours
Step 7: Cross-references and documentation
- Update GLOSSARY.md with new terms (tool registry, intervention,
entropy score, improvement proposal, context score)
- Update HARNESS_COMPONENTS.md coverage assessment
- Update HARNESS_MATURITY.md current assessment table
- Record Phase 5 trace
Estimated effort: ~1-2 hours
Total estimated effort: ~23-30 hours
- Branch:
git checkout -b feature/phase-5-evolution-infrastructure main - Implement US-019 → US-020 → US-021 → US-022 → US-023 → US-024 (in order)
- Update cross-references (HARNESS.md, HARNESS_COMPONENTS.md, HARNESS_MATURITY.md, GLOSSARY.md)
- Run
cargo test— all tests must pass - Run
cargo clippy— no warnings - Run benchmark in
harness-benchmark - Compare: benchmark results against Phase 4 baseline
- Merge: Only when benchmark shows stable or improved results
| Metric | Phase 4 (current) | Phase 5 Target | Reasoning |
|---|---|---|---|
| Functional score | 37/37 (100%) | 37/37 (100%) | Phase 5 doesn't change app code |
| Harness compliance | 31/31 (100%) | 31/31 (100%) | Already perfect |
| Trace quality | 2.5-2.8/3.0 | 2.8-3.0/3.0 | Context scoring + audit give agents better feedback |
| Lane accuracy | 6/6 | 6/6 | Already perfect |
| Wall time | ~1800-1900s | ~2000-2100s | Slight increase from audit and context scoring |
| Token cost | ~$22-23 | ~$23-25 | Slight increase from reading audit output |
cargo testpasses with coverage for all six stories.story verify-allconfirms all stories in one pass — H4 fully achieved.auditidentifies real drift in benchmark runs.proposegenerates at least one actionable proposal from friction data.- Context scoring shows whether agents follow
CONTEXT_RULES.md. - Benchmark trace quality rises to ≥2.8/3.0.
- No regression in functional score, harness compliance, or lane accuracy.
- Batch verification disagrees with individual
story verifyresults. - Audit command produces false positives that confuse agents.
- Proposal pipeline generates low-quality or circular proposals.
- Context rule measurement penalizes useful over-reading.
- Tool registration validation is too strict, discouraging adoption.
cargo testorcargo clippyfailures.- Functional score drops (overhead confused the agent).