Skip to content

Latest commit

 

History

History
197 lines (156 loc) · 12.1 KB

File metadata and controls

197 lines (156 loc) · 12.1 KB

Autoresearch: Kapi Durable-Mode Redesign

Objective

Refactor Kapi from the current workflow-heavy implementation into the durable-mode architecture described in GOAL.md, without weakening verification, hiding legacy behavior, or compromising ordinary Pi thinness.

Target:

kapi_architecture_score >= 90 / 100

The score must reflect real implementation progress, not cosmetic text changes. A kept experiment must improve the implementation, tests, or artifact contracts while preserving the safety boundaries in GOAL.md.

Metrics

  • Primary: kapi_architecture_score (points, higher is better) — uncapped readiness-plus-quality score for the GOAL.md architecture. The first 100 points represent architecture readiness; verified maintainability improvements can add beyond-readiness points without weakening safety gates.
  • Secondary:
    • verify_pass — 1 when npm run verify passes, 0 otherwise.
    • obsolete_command_refs — references to removed user-facing commands in implementation source/tests.
    • obsolete_contract_refs — stale references to removed user-facing commands in README/docs/skills/prompts.
    • legacy_fallback_refs — references to alias/fallback/legacy/compat/shim/redirect behavior.
    • legacy_surface_refs — explicit retained legacy command-surface evidence, including compatibility help text and old /kapi-status subcommand aliases.
    • legacy_removal_score — additive uncapped score awarded only when the source/tests/docs have no detected legacy command surface.
    • required_mode_refs — evidence that durable modes are represented.
    • state_model_refs — evidence for command/event/snapshot/active inventory semantics.
    • worktree_boundary_refs — evidence for Kapi-owned worktree/branch boundaries.
    • anti_gaming_flags — red flags such as test weakening, metric hardcoding, or no-op verification.
    • test_conversion_score — diagnostic test-suite conversion score from -100 to 0. It starts at -100 when stale legacy workflow assumptions dominate tests, and reaches 0 only when tests are converted to the durable-mode contract without weakening verification.
    • stale_test_conversion_flags — remaining stale-test indicators such as removed workflow commands, TDD/Review/Ultrawork assumptions, progress.json, Ralph-as-plan folder expectations, or Ralph context.md/plan.md artifact expectations.
    • semantic_consistency_score — diagnostic score for Kapi Autoresearch semantic ownership: bridge-term misuse, non-isolated root autoresearch.* dependencies, durable artifact mismatches, pi-autoresearch role coverage, and source-of-truth conflicts.
    • pi_autoresearch_reference_score — diagnostic score for mapping the pi-autoresearch reference loop into Kapi durable artifacts and behavior.
    • root_autoresearch_dependency_count, non_isolated_root_autoresearch_refs, autoresearch_artifact_mismatch_count, source_of_truth_conflict_count — detailed semantic debt counters for root autoresearch.md, autoresearch.sh, autoresearch.checks.sh, autoresearch.jsonl, autoresearch.ideas.md, and autoresearch.config.json references.
    • pi_autoresearch_metric_parsing_role, pi_autoresearch_resume_reconstruction_role — coarse role indicators for metric parsing and resume/reconstruction semantics.
    • runtime_autoresearch_probe_executed, runtime_autoresearch_start_pass, runtime_autoresearch_start_contract_pass — runtime probes that start /kapi-autoresearch in a temporary workspace and validate that Kapi-owned durable artifacts are created on disk.
    • runtime_deep_interview_start_contract_pass, runtime_ralph_start_contract_pass, runtime_integrate_start_contract_pass, mode_runtime_probe_coverage — runtime probes for the other durable modes.
    • event_log_jsonl_parse_pass, snapshot_json_parse_pass, state_json_parse_pass — semantic artifact parse checks for state/event/snapshot files.
    • command_surface_probe_executed, exact_command_surface_pass, extra_human_command_count, missing_mode_subcommand_count, mode_subcommand_behavior_pass — human command-surface contract diagnostics.
    • kapi_readiness_score, ship_blocker_count, runtime_blocker_count, semantic_blocker_count — separate readiness/blocker rollups that keep runtime and semantic blockers visible even when the primary architecture score is high.

How to Run

Launch only from a dedicated autoresearch branch or worktree, preferably named autoresearch/<goal> or <type>/autoresearch-<goal>. Do not launch the autonomous pi-autoresearch loop directly from the shared dev, main, or ordinary feature checkout, because pi-autoresearch may auto-commit kept runs and revert discarded runs. The benchmark and checks refuse non-dedicated branch names.

The benchmark command must be exactly one of:

./autoresearch.sh
bash autoresearch.sh

Do not wrap it with shell chaining, fallback echo METRIC ..., || true, or any alternate command that could hide benchmark failure.

./autoresearch.sh

autoresearch.sh runs the project verification gate and emits parseable metric lines:

METRIC kapi_architecture_score=<number>
METRIC verify_pass=<0|1>
...

Files in Scope

Implementation and tests:

  • src/** — Kapi domain, application, adapters, presentation, state, worker, command, and tool implementation.
  • test/** — behavioral, state-machine, command-surface, validation, artifact, and worker tests.
  • scripts/** — local quality/scoring scripts when they strengthen measurement rather than bypass it.

Docs and prompt/skill surfaces:

  • GOAL.md — target architecture and scoring source.
  • README.md — user-facing command and architecture documentation.
  • docs/** — supporting design and completeness docs.
  • skills/** — Kapi phase/mode guidance, especially deep interview, Ralph, Autoresearch, Integrate, and review guidance.
  • prompts/** — prompt contracts aligned with deterministic state-based skill injection.

Autoresearch session files:

  • autoresearch.md
  • autoresearch.sh
  • autoresearch.checks.sh
  • autoresearch.ideas.md

Off Limits

Do not modify or rely on these to fake progress:

  • Do not weaken npm run verify, npm test, npm run check, or npm run quality:budgets.
  • Do not delete, skip, or loosen tests just to improve the score.
  • Do not hardcode metric output or remove scoring checks.
  • Do not keep removed commands as aliases, redirects, fallback handlers, compatibility shims, or hidden command paths.
  • Do not rename legacy behavior to avoid detection.
  • Do not add direct main merge behavior.
  • Do not allow commit/revert/reset outside Kapi-owned worktrees.
  • Do not make ordinary Pi turns create Kapi state, artifacts, workers, or blocking hooks without explicit mode activation.
  • Do not copy heavy runtime machinery from references when a thin Pi-native implementation is enough.

Constraints

Hard gates for keep:

  1. npm run verify must pass.
  2. Verification surface must not be weakened.
  3. kapi_architecture_score must improve, or reach >= 90 with no regressions.
  4. The improvement must come from real architecture, behavior, tests, or artifact-contract implementation.
  5. No anti-gaming flags may be introduced.
  6. The loop must run from a dedicated autoresearch branch/worktree, never from shared dev, main, or an ordinary feature checkout.

Scoring Rubric

The primary score is an uncapped architecture readiness-plus-quality score. The first 100 points remain the architecture readiness baseline. At least 90 points should require real behavior, source inventory, and artifact-contract progress; documentation alignment alone can contribute at most 10 points. Beyond-readiness points must come from verified maintainability signals, not cosmetic text changes.

Category Points Evidence expected
Behavior tests 40 Tests cover the durable modes, status/resume/approve, pending decisions, command/event/snapshot behavior, cross-mode links, artifact layout, worktree boundaries, and integration dev-merge rules.
Structural inventory 35 Source definitions and command registry match required modes/support commands, remove obsolete commands, include active inventory, pendingDecision, skill injection, worktree boundaries, and dev-only integrate semantics.
Artifact/contract checks 15 Source/tests enforce .ilchul/workflows/<mode>/<001-slug>/, state.json, events.jsonl, snapshot.json, Kapi Autoresearch artifacts, Integrate artifacts, decision-report.md, and verify.md.
Documentation/reference alignment 10 GOAL/README/docs align with references (oh-my-codex, ouroboros, pi-autoresearch), no legacy shadow architecture, ordinary Pi thinness, and maintainability metrics.
Legacy surface removal 10 No retained compatibility help, old status subcommand aliases, hidden redirects, shims, or fallback command paths. This rewards completing the durable-mode conversion rather than merely documenting it.
Beyond-readiness maintainability uncapped Verified quality-budget and maintainability signals such as zero warnings, low complexity, low duplication, low code smell count, and bounded coupling. Additive headroom points reward real reductions below the current quality thresholds so cleanup that improves tracked secondary metrics is visible in the primary score. These points are additive only after the same anti-gaming penalties and verification gates are applied.

The benchmark emits component metrics:

METRIC behavior_score=<0..40>
METRIC inventory_score=<0..35>
METRIC artifact_score=<0..15>
METRIC docs_score=<0..10>
METRIC base_architecture_score=<number>
METRIC maintainability_bonus=<number>
METRIC quality_headroom_bonus=<number>
METRIC legacy_removal_score=<number>
METRIC legacy_surface_refs=<number>
METRIC anti_gaming_penalty=<number>
METRIC test_conversion_score=<-100..0>
METRIC stale_test_conversion_flags=<number>
METRIC semantic_consistency_score=<0..20>
METRIC pi_autoresearch_reference_score=<0..20>
METRIC root_autoresearch_dependency_count=<number>
METRIC non_isolated_root_autoresearch_refs=<number>
METRIC autoresearch_artifact_mismatch_count=<number>
METRIC source_of_truth_conflict_count=<number>
METRIC pi_autoresearch_metric_parsing_role=<0|1>
METRIC pi_autoresearch_resume_reconstruction_role=<0|1>
METRIC runtime_autoresearch_probe_executed=<0|1>
METRIC runtime_autoresearch_start_pass=<0|1>
METRIC runtime_autoresearch_start_contract_pass=<0|1>
METRIC runtime_deep_interview_start_contract_pass=<0|1>
METRIC runtime_ralph_start_contract_pass=<0|1>
METRIC runtime_integrate_start_contract_pass=<0|1>
METRIC mode_runtime_probe_coverage=<0..4>
METRIC event_log_jsonl_parse_pass=<0|1>
METRIC snapshot_json_parse_pass=<0|1>
METRIC state_json_parse_pass=<0|1>
METRIC command_surface_probe_executed=<0|1>
METRIC exact_command_surface_pass=<0|1>
METRIC extra_human_command_count=<number>
METRIC missing_mode_subcommand_count=<number>
METRIC mode_subcommand_behavior_pass=<0|1>
METRIC kapi_readiness_score=<0..100>
METRIC ship_blocker_count=<number>
METRIC runtime_blocker_count=<number>
METRIC semantic_blocker_count=<number>

Anti-Gaming Rules

Discard any run that improves the numeric score by:

  • weakening verification;
  • deleting tests or assertions;
  • hardcoding metrics;
  • removing strings without changing command registry/behavior/tests;
  • hiding legacy paths under new names;
  • stubbing state/event/snapshot files without transition semantics;
  • adding unsafe git/worktree behavior;
  • changing docs only while claiming implementation progress.

Reference Material

Read these before implementing large changes:

  • GOAL.md — canonical target.
  • references/oh-my-codex — command/event/snapshot, active inventory, cross-mode linkage, transition denial, recovery.
  • references/ouroboros — deep interview behavior and human values/intent/context extraction.
  • references/* — repo structure, command organization, artifact layout, tests.
  • pi-autoresearch — loop behavior, checks, keep/discard/crash/checks_failed, ledger, confidence, finalize.

What's Been Tried

  • 2026-05-07: GOAL.md redesigned around four durable modes: Deep Interview, Ralph, Autoresearch, and Integrate.
  • 2026-05-07: Root pi-autoresearch session files prepared. Autoresearch has not been started yet.