Refactor Kapi from the current workflow-heavy implementation into the durable-mode architecture described in GOAL.md, without weakening verification, hiding legacy behavior, or compromising ordinary Pi thinness.
Target:
kapi_architecture_score >= 90 / 100
The score must reflect real implementation progress, not cosmetic text changes. A kept experiment must improve the implementation, tests, or artifact contracts while preserving the safety boundaries in GOAL.md.
- Primary:
kapi_architecture_score(points, higher is better) — uncapped readiness-plus-quality score for the GOAL.md architecture. The first 100 points represent architecture readiness; verified maintainability improvements can add beyond-readiness points without weakening safety gates. - Secondary:
verify_pass— 1 whennpm run verifypasses, 0 otherwise.obsolete_command_refs— references to removed user-facing commands in implementation source/tests.obsolete_contract_refs— stale references to removed user-facing commands in README/docs/skills/prompts.legacy_fallback_refs— references to alias/fallback/legacy/compat/shim/redirect behavior.legacy_surface_refs— explicit retained legacy command-surface evidence, including compatibility help text and old/kapi-statussubcommand aliases.legacy_removal_score— additive uncapped score awarded only when the source/tests/docs have no detected legacy command surface.required_mode_refs— evidence that durable modes are represented.state_model_refs— evidence for command/event/snapshot/active inventory semantics.worktree_boundary_refs— evidence for Kapi-owned worktree/branch boundaries.anti_gaming_flags— red flags such as test weakening, metric hardcoding, or no-op verification.test_conversion_score— diagnostic test-suite conversion score from -100 to 0. It starts at -100 when stale legacy workflow assumptions dominate tests, and reaches 0 only when tests are converted to the durable-mode contract without weakening verification.stale_test_conversion_flags— remaining stale-test indicators such as removed workflow commands, TDD/Review/Ultrawork assumptions,progress.json, Ralph-as-plan folder expectations, or Ralphcontext.md/plan.mdartifact expectations.semantic_consistency_score— diagnostic score for Kapi Autoresearch semantic ownership: bridge-term misuse, non-isolated rootautoresearch.*dependencies, durable artifact mismatches, pi-autoresearch role coverage, and source-of-truth conflicts.pi_autoresearch_reference_score— diagnostic score for mapping the pi-autoresearch reference loop into Kapi durable artifacts and behavior.root_autoresearch_dependency_count,non_isolated_root_autoresearch_refs,autoresearch_artifact_mismatch_count,source_of_truth_conflict_count— detailed semantic debt counters for rootautoresearch.md,autoresearch.sh,autoresearch.checks.sh,autoresearch.jsonl,autoresearch.ideas.md, andautoresearch.config.jsonreferences.pi_autoresearch_metric_parsing_role,pi_autoresearch_resume_reconstruction_role— coarse role indicators for metric parsing and resume/reconstruction semantics.runtime_autoresearch_probe_executed,runtime_autoresearch_start_pass,runtime_autoresearch_start_contract_pass— runtime probes that start/kapi-autoresearchin a temporary workspace and validate that Kapi-owned durable artifacts are created on disk.runtime_deep_interview_start_contract_pass,runtime_ralph_start_contract_pass,runtime_integrate_start_contract_pass,mode_runtime_probe_coverage— runtime probes for the other durable modes.event_log_jsonl_parse_pass,snapshot_json_parse_pass,state_json_parse_pass— semantic artifact parse checks for state/event/snapshot files.command_surface_probe_executed,exact_command_surface_pass,extra_human_command_count,missing_mode_subcommand_count,mode_subcommand_behavior_pass— human command-surface contract diagnostics.kapi_readiness_score,ship_blocker_count,runtime_blocker_count,semantic_blocker_count— separate readiness/blocker rollups that keep runtime and semantic blockers visible even when the primary architecture score is high.
Launch only from a dedicated autoresearch branch or worktree, preferably named autoresearch/<goal> or <type>/autoresearch-<goal>. Do not launch the autonomous pi-autoresearch loop directly from the shared dev, main, or ordinary feature checkout, because pi-autoresearch may auto-commit kept runs and revert discarded runs. The benchmark and checks refuse non-dedicated branch names.
The benchmark command must be exactly one of:
./autoresearch.sh
bash autoresearch.shDo not wrap it with shell chaining, fallback echo METRIC ..., || true, or any alternate command that could hide benchmark failure.
./autoresearch.shautoresearch.sh runs the project verification gate and emits parseable metric lines:
METRIC kapi_architecture_score=<number>
METRIC verify_pass=<0|1>
...
Implementation and tests:
src/**— Kapi domain, application, adapters, presentation, state, worker, command, and tool implementation.test/**— behavioral, state-machine, command-surface, validation, artifact, and worker tests.scripts/**— local quality/scoring scripts when they strengthen measurement rather than bypass it.
Docs and prompt/skill surfaces:
GOAL.md— target architecture and scoring source.README.md— user-facing command and architecture documentation.docs/**— supporting design and completeness docs.skills/**— Kapi phase/mode guidance, especially deep interview, Ralph, Autoresearch, Integrate, and review guidance.prompts/**— prompt contracts aligned with deterministic state-based skill injection.
Autoresearch session files:
autoresearch.mdautoresearch.shautoresearch.checks.shautoresearch.ideas.md
Do not modify or rely on these to fake progress:
- Do not weaken
npm run verify,npm test,npm run check, ornpm run quality:budgets. - Do not delete, skip, or loosen tests just to improve the score.
- Do not hardcode metric output or remove scoring checks.
- Do not keep removed commands as aliases, redirects, fallback handlers, compatibility shims, or hidden command paths.
- Do not rename legacy behavior to avoid detection.
- Do not add direct
mainmerge behavior. - Do not allow commit/revert/reset outside Kapi-owned worktrees.
- Do not make ordinary Pi turns create Kapi state, artifacts, workers, or blocking hooks without explicit mode activation.
- Do not copy heavy runtime machinery from references when a thin Pi-native implementation is enough.
Hard gates for keep:
npm run verifymust pass.- Verification surface must not be weakened.
kapi_architecture_scoremust improve, or reach>= 90with no regressions.- The improvement must come from real architecture, behavior, tests, or artifact-contract implementation.
- No anti-gaming flags may be introduced.
- The loop must run from a dedicated autoresearch branch/worktree, never from shared
dev,main, or an ordinary feature checkout.
The primary score is an uncapped architecture readiness-plus-quality score. The first 100 points remain the architecture readiness baseline. At least 90 points should require real behavior, source inventory, and artifact-contract progress; documentation alignment alone can contribute at most 10 points. Beyond-readiness points must come from verified maintainability signals, not cosmetic text changes.
| Category | Points | Evidence expected |
|---|---|---|
| Behavior tests | 40 | Tests cover the durable modes, status/resume/approve, pending decisions, command/event/snapshot behavior, cross-mode links, artifact layout, worktree boundaries, and integration dev-merge rules. |
| Structural inventory | 35 | Source definitions and command registry match required modes/support commands, remove obsolete commands, include active inventory, pendingDecision, skill injection, worktree boundaries, and dev-only integrate semantics. |
| Artifact/contract checks | 15 | Source/tests enforce .ilchul/workflows/<mode>/<001-slug>/, state.json, events.jsonl, snapshot.json, Kapi Autoresearch artifacts, Integrate artifacts, decision-report.md, and verify.md. |
| Documentation/reference alignment | 10 | GOAL/README/docs align with references (oh-my-codex, ouroboros, pi-autoresearch), no legacy shadow architecture, ordinary Pi thinness, and maintainability metrics. |
| Legacy surface removal | 10 | No retained compatibility help, old status subcommand aliases, hidden redirects, shims, or fallback command paths. This rewards completing the durable-mode conversion rather than merely documenting it. |
| Beyond-readiness maintainability | uncapped | Verified quality-budget and maintainability signals such as zero warnings, low complexity, low duplication, low code smell count, and bounded coupling. Additive headroom points reward real reductions below the current quality thresholds so cleanup that improves tracked secondary metrics is visible in the primary score. These points are additive only after the same anti-gaming penalties and verification gates are applied. |
The benchmark emits component metrics:
METRIC behavior_score=<0..40>
METRIC inventory_score=<0..35>
METRIC artifact_score=<0..15>
METRIC docs_score=<0..10>
METRIC base_architecture_score=<number>
METRIC maintainability_bonus=<number>
METRIC quality_headroom_bonus=<number>
METRIC legacy_removal_score=<number>
METRIC legacy_surface_refs=<number>
METRIC anti_gaming_penalty=<number>
METRIC test_conversion_score=<-100..0>
METRIC stale_test_conversion_flags=<number>
METRIC semantic_consistency_score=<0..20>
METRIC pi_autoresearch_reference_score=<0..20>
METRIC root_autoresearch_dependency_count=<number>
METRIC non_isolated_root_autoresearch_refs=<number>
METRIC autoresearch_artifact_mismatch_count=<number>
METRIC source_of_truth_conflict_count=<number>
METRIC pi_autoresearch_metric_parsing_role=<0|1>
METRIC pi_autoresearch_resume_reconstruction_role=<0|1>
METRIC runtime_autoresearch_probe_executed=<0|1>
METRIC runtime_autoresearch_start_pass=<0|1>
METRIC runtime_autoresearch_start_contract_pass=<0|1>
METRIC runtime_deep_interview_start_contract_pass=<0|1>
METRIC runtime_ralph_start_contract_pass=<0|1>
METRIC runtime_integrate_start_contract_pass=<0|1>
METRIC mode_runtime_probe_coverage=<0..4>
METRIC event_log_jsonl_parse_pass=<0|1>
METRIC snapshot_json_parse_pass=<0|1>
METRIC state_json_parse_pass=<0|1>
METRIC command_surface_probe_executed=<0|1>
METRIC exact_command_surface_pass=<0|1>
METRIC extra_human_command_count=<number>
METRIC missing_mode_subcommand_count=<number>
METRIC mode_subcommand_behavior_pass=<0|1>
METRIC kapi_readiness_score=<0..100>
METRIC ship_blocker_count=<number>
METRIC runtime_blocker_count=<number>
METRIC semantic_blocker_count=<number>
Discard any run that improves the numeric score by:
- weakening verification;
- deleting tests or assertions;
- hardcoding metrics;
- removing strings without changing command registry/behavior/tests;
- hiding legacy paths under new names;
- stubbing state/event/snapshot files without transition semantics;
- adding unsafe git/worktree behavior;
- changing docs only while claiming implementation progress.
Read these before implementing large changes:
GOAL.md— canonical target.references/oh-my-codex— command/event/snapshot, active inventory, cross-mode linkage, transition denial, recovery.references/ouroboros— deep interview behavior and human values/intent/context extraction.references/*— repo structure, command organization, artifact layout, tests.pi-autoresearch— loop behavior, checks, keep/discard/crash/checks_failed, ledger, confidence, finalize.
- 2026-05-07: GOAL.md redesigned around four durable modes: Deep Interview, Ralph, Autoresearch, and Integrate.
- 2026-05-07: Root pi-autoresearch session files prepared. Autoresearch has not been started yet.