Skip to content

Latest commit

 

History

History
179 lines (147 loc) · 8.79 KB

File metadata and controls

179 lines (147 loc) · 8.79 KB

Ilchul runtime storage, adapter config, and worker retention design

This document defines the Ilchul runtime storage/configuration surface for issue #169. The implementation now routes active workflow and worker storage to .ilchul / ~/.ilchul; legacy .kapi folders are preserved only as historical local state and are not active fallback roots.

Goals

  • Define the canonical .ilchul runtime shape without deleting or renaming existing .kapi state.
  • Define a portable adapter configuration surface for Codex, Pi, Claude Code, and future worker substrates.
  • Define worker retention states so supervisors can distinguish audit retention from safe cleanup and stale leaks.
  • Identify additive implementation slices that can be reviewed independently.

Non-goals

  • No broad kapi -> ilchul rename.
  • No filesystem mutation in this design issue.
  • No destructive .kapi migration or local-folder cleanup.
  • No automatic tmux/worktree deletion without an explicit safe cleanup command.
  • No runtime plugin system or dynamic adapter loading authority.

Storage root policy

The canonical active root is .ilchul/ inside the supervised workspace, with user-level defaults under ~/.ilchul/. The naming policy in docs/ilchul-naming-policy.md remains authoritative for storage safety:

  • new runtime state is written under .ilchul/;
  • default worker worktrees are created under ~/.ilchul/worktrees;
  • existing .kapi state must be preserved but is not detected as an active fallback root;
  • both-present roots use .ilchul for normal routing and must not trigger implicit cleanup.

Proposed .ilchul/ layout

.ilchul/
  active.json
  config.json
  runs/
    <run-id>/
      run-contract.json
      objective.json
      policy-selection.json
      task-graph.json
      workers.json
      claims.json
      events.jsonl
      evidence/
      evaluations/
      integration/
      learning-summary.md
  learning/
    reward-ledger.jsonl
    policy-hints.json
    strategy-stats.json
    simulator-calibration.json
  migrations/
    storage-root.json
    recovery.log

File responsibilities

File or directory Responsibility Authority
active.json Points at the active non-terminal run, if any. Runtime pointer only; not a deletion authority.
config.json Workspace-local adapter/substrate/defaults config. Configuration input after validation.
runs/<run-id>/run-contract.json RunContract projection/contract snapshot for the run. Run meaning and evidence expectation source.
runs/<run-id>/objective.json Objective/evaluation intent. Advisory until a later issue grants stronger authority.
runs/<run-id>/policy-selection.json Records why an execution strategy was chosen. Audit trail for strategy choice.
runs/<run-id>/task-graph.json DAG task ids, dependencies, attempts, and statuses. Task readiness source after scheduler implementation.
runs/<run-id>/workers.json Worker lifecycle/retention state. Supervisor inspection and safe-cleanup input.
runs/<run-id>/claims.json Claim tokens, leases, and claim owner records. Duplicate-execution guard after scheduler implementation.
runs/<run-id>/events.jsonl Append-only runtime event stream for replay/recovery. Recovery/audit evidence; malformed events fail closed.
evidence/ Evidence refs and command/artifact proof. Completion support; exact authority depends on workflow contract.
evaluations/ Objective/evaluator outputs. Advisory unless a future design grants hard gates.
integration/ Merge/repair/integration records. Explicit integration state, not hidden mutation.
learning/ Cross-run reward and policy-learning data. Learning input only; policy changes must be recorded in policy-selection.json.
migrations/ Storage-root selection and recovery records. Migration audit evidence; cleanup still requires explicit authorization.

Adapter config shape

The future config should be JSON-serializable, validated before use, and conservative by default:

{
  "schemaVersion": 1,
  "storage": {
    "root": ".ilchul",
    "legacyRootPolicy": "preserve-without-active-fallback"
  },
  "adapters": {
    "codex": { "enabled": true, "defaultSubstrate": "tmux" },
    "pi": { "enabled": true, "defaultSubstrate": "tmux" },
    "claudeCode": { "enabled": false, "defaultSubstrate": "process" }
  },
  "runtime": {
    "worktreeRoot": ".ilchul/worktrees",
    "maxWorkers": 3,
    "readinessTimeoutSeconds": 60,
    "leaseDurationSeconds": 900,
    "heartbeatTimeoutSeconds": 120
  },
  "retention": {
    "completedDefault": "completed-retained",
    "safeCleanupRequiresCommand": true,
    "retainLogsDays": 7
  },
  "verification": {
    "defaultDepth": "standard",
    "requireEvidenceForTaskCompletion": true
  }
}

Config rules:

  1. Unknown adapters are rejected or ignored with a warning until a scoped adapter issue accepts them.
  2. worktreeRoot must stay inside the trusted workspace or approved user-level Ilchul root.
  3. maxWorkers, lease, heartbeat, and readiness values must have bounded minimums/maximums.
  4. safeCleanupRequiresCommand defaults to true; normal status/report/start commands must not delete sessions, worktrees, branches, or storage roots.
  5. Learning and objective fields may advise strategy, but selected runtime behavior must be recorded in policy-selection.json.

Worker retention lifecycle

Worker state is separate from task status. A run can be terminal while one or more worker sessions remain retained for audit.

active
  -> completed-retained
  -> safe-to-close
  -> cleanup-released
  -> closed

active
  -> stale-registry
  -> safe-to-close
State Meaning Supervisor action
active Worker is assigned to non-terminal work or still expected to produce output. Do not close. Inspect heartbeat/readiness first.
completed-retained Work is terminal but the session/log remains intentionally available for audit. Safe to leave alone; report as expected retention.
safe-to-close Worker is terminal, unretained, and no active claim depends on it. Eligible for explicit safe cleanup command.
stale-registry Registry says the worker exists, but runtime evidence is missing, contradictory, or expired. Report recovery/cleanup options; do not guess ownership.
cleanup-released A safe cleanup command has released the retention hold or requested shutdown. Await closure confirmation; preserve audit event.
closed Runtime handle is gone and closure was observed. Keep metadata for history; no runtime action needed.

Transition rules:

  • Active workers cannot become safe-to-close while they hold an unexpired claim or pending evidence obligation.
  • Completed workers default to completed-retained when audit inspection is useful.
  • Only an explicit safe cleanup command may move completed-retained or safe-to-close toward cleanup-released.
  • stale-registry is a diagnostic state, not proof that deletion is safe.
  • Cleanup must target Kapi/Ilchul-owned handles only and must not delete user-owned worktrees or branches.

Safe cleanup boundary

Safe cleanup may close an unretained terminal tmux/process handle after ownership is verified. It must not:

  • delete .kapi, .ilchul, worktrees, branches, evidence, learning ledgers, or run directories;
  • kill active or uncertain workers;
  • collapse both-present .kapi/.ilchul migration decisions into cleanup behavior;
  • run as a side effect of status, report, doctor, verify, or workflow start.

Destructive cleanup requires a separate issue, explicit scope, rollback/recovery notes, and Kade authorization for that slice.

Follow-up implementation slices

  1. Add read-only legacy .kapi diagnostics without using it as a routing root.
  2. Add config schema validation and a config.json reader with bounded defaults.
  3. Persist worker retention states and expose them through read-only report/doctor/status surfaces.
  4. Add explicit cleanup --safe behavior for verified terminal unretained runtime handles only.
  5. Add migration recovery records only if a future issue authorizes importing or archiving legacy .kapi state.
  6. Design any destructive legacy cleanup as a separately authorized migration slice.

Design verification checklist

  • The .ilchul layout documents .kapi as legacy local state rather than an active fallback root.
  • Adapter config covers enabled adapters, default substrates, worktree root, worker caps, readiness timeout, lease duration, cleanup/retention defaults, and verification depth.
  • Worker retention distinguishes active, completed-retained, safe-to-close, stale-registry, cleanup-released, and closed.
  • Safe cleanup is explicit and non-destructive.
  • Follow-up implementation slices are additive and independently reviewable.