Skip to content

Latest commit

 

History

History
245 lines (166 loc) · 14.8 KB

File metadata and controls

245 lines (166 loc) · 14.8 KB

RunContract preset contracts

This document defines the current design contract for RunContract workflow presets. It covers Ralph, Integrate, and Autoresearch as generic harness presets; it does not add a durable contract.json, scheduler, merge bot, score gate, command rename, or storage migration.

RunContract presets are contract shapes over existing WorkflowState and WorkflowDefinition truth. They make goal shape, required inputs, artifacts, evidence, done criteria, repair criteria, and closeout expectations explicit so each workflow can be compared without importing adapter-specific policy.

Shared preset rules

All presets follow these boundaries:

  1. WorkflowState and workflow validation remain the source of truth for lifecycle, artifacts, and completion.
  2. RunContract projection may summarize or steer, but it must not become a second durable authority.
  3. External meanings such as GitHub PR state, Discord lane ownership, Ragna supervision, or kapi-agent review freshness stay in adapters and supervisor reports.
  4. Objective/evaluation signals are advisory unless a separate approved issue grants a stronger gate.
  5. Handoffs consume upstream artifacts by reference and freshness checks; they must not copy an upstream draft into a competing source of truth.
  6. Completion requires inspectable evidence: artifact refs, command evidence, reviewer/evaluator records where required, and explicit human decisions for design-sensitive steps.
  7. Repair is explicit: record the defect, affected artifact/evidence refs, selected repair path, and new verification evidence rather than silently rewriting history.

Ralph governed implementation preset

Purpose

Ralph is the governed implementation preset: it turns an accepted goal or upstream design handoff into a bounded implementation plan, executes one or more reviewed build iterations, and closes only after concrete verification and independent closeout evidence.

Goal shape

A Ralph goal should include:

  • requested behavior or change target;
  • acceptance criteria or enough context to derive them;
  • constraints and non-goals;
  • affected repository/workspace boundary;
  • expected verification standard;
  • optional upstream Deep Interview source refs.

Vague goals should remain in planning or blocked state until the missing context is recorded. Ralph should not treat task intent inferred from worker prose as a replacement for accepted criteria.

Required inputs

  • A human or supervisor-provided implementation goal.
  • Current project context from the target workspace.
  • Optional Deep Interview run-contract-draft.md, decision-report.md, and context artifacts, referenced as handoff evidence rather than copied into a new source of truth.
  • Explicit human approval before moving from planning consensus into build when the workflow contract requires it.

Required artifacts

Ralph-owned artifacts stay under the Ralph workflow root and should include:

  • AGENTS.md — scoped implementation instructions for the worker.
  • IMPLEMENTATION_PLAN.md — prioritized task list, acceptance criteria, and verification plan.
  • specs/ — confined specification directory for concrete specs or guidance.
  • handoff.json — structured handoff state for downstream integration or supervisor inspection.
  • decision-report.md — selected plan, rejected alternatives, approvals, and open decisions.
  • verify.md — evidence ledger with command output refs, reviewer closeout, and residual risks.
  • state.json, events.jsonl, and snapshot.json — durable workflow state/projection artifacts.

Evidence standard

Ralph evidence must prove that implementation claims match the plan:

  • planning consensus evidence from the required reviewers when applicable;
  • explicit human approval for build entry when required;
  • command evidence with command, exit code, verdict, and artifact/ref linkage;
  • changed-file or acceptance-criterion references in verifier/reviewer closeout;
  • task disposition for deferred, superseded, or blocked plan items;
  • stale artifact and stale approval checks before terminal completion.

Narrative statements such as "tests passed" are insufficient without the corresponding command evidence.

Done criteria

Ralph may be complete only when:

  1. required artifacts exist and validate;
  2. build work is tied to accepted plan items;
  3. required command evidence passes or any failure is explicitly accepted/deferred with rationale;
  4. required reviewer/verifier closeout is present and independent from the producer;
  5. unresolved risks, blockers, or pending decisions are either closed or explicitly deferred by the authorized human/supervisor path;
  6. downstream handoff state is present when the run expects Integrate or another consumer.

Repair criteria

Ralph repair should be chosen when implementation output diverges from the accepted plan, verification fails, artifacts drift, reviewer closeout is stale, or upstream handoff assumptions become invalid. The repair record should name the failed criterion, affected files/artifacts, selected repair task, and new evidence refs.

Deep Interview handoff rule

A Deep Interview RunContractDraft is input evidence, not durable authority for Ralph. Ralph should reference the source workflow, verify source artifact freshness, and translate accepted requirements into its own plan and evidence ledger. If the Deep Interview decision report is missing, stale, or contradicted by current project context, Ralph should block or request clarification rather than silently treating the draft as authoritative.

Integrate governance preset

Purpose

Integrate is the integration-governance preset: it accepts candidate implementation output, plans how it should be integrated, records conflicts and source-of-truth decisions, verifies the integrated result, and defines rollback or repair criteria without becoming a merge bot or PR adapter.

Goal shape

An Integrate goal should include:

  • source artifact or implementation branch/worktree refs;
  • target workspace or baseline context;
  • intended integration outcome;
  • known conflicts or compatibility constraints;
  • required verification and rollback expectations.

Required inputs

  • Upstream implementation artifacts, handoff state, or candidate diff refs.
  • Target baseline and current project context.
  • Integration constraints, such as compatibility boundaries, source-of-truth ownership, and non-goals.
  • Human/supervisor approval for externally visible or destructive integration steps when required.

Required artifacts

Integrate-owned artifacts should include:

  • merge-plan.md — integration order, selected strategy, rejected options, and verification plan.
  • conflict-matrix.md — conflicting files/symbols/contracts, owner, resolution, and residual risk.
  • integration-report.md — final integrated state, accepted candidate refs, deviations, and rollback notes.
  • decision-report.md — decisions, approvals, deferrals, and unresolved questions.
  • verify.md — command evidence, artifact checks, and integration smoke evidence.
  • state.json, events.jsonl, and snapshot.json — durable workflow state/projection artifacts.

Evidence standard

Integrate evidence must prove both source consumption and target health:

  • source refs for accepted implementation artifacts or candidate output;
  • conflict-matrix rows for every material conflict or explicit evidence that no conflicts were found;
  • command evidence covering integration checks relevant to the target surface;
  • verification that required upstream artifacts were consumed by reference rather than duplicated as new truth;
  • rollback or repair evidence when a candidate is rejected or partially accepted.

Done criteria

Integrate may be complete only when:

  1. accepted source refs are named and inspectable;
  2. conflicts are resolved, deferred, or rejected with rationale;
  3. target verification passes or failures are explicitly accepted under the workflow’s authority rules;
  4. integration report and verify artifacts are current with the final state;
  5. rollback or repair criteria are documented for any residual risk;
  6. no adapter-specific PR/merge/tracker semantics have been promoted into generic RunContract core.

Repair and rollback criteria

Repair is required when integrated output fails verification, conflict resolution invalidates an upstream assumption, target context changes, or source artifacts are stale. Rollback criteria should name the failing signal, safe restoration point, affected artifacts, and evidence required before retrying.

Upstream consumption rule

Integrate consumes upstream Ralph or Autoresearch artifacts by reference and validation. It should not duplicate upstream workflow state, become a generic scheduler, or infer external merge authority from generic RunContract completion.

Autoresearch metric-experiment preset

Purpose

Autoresearch is the metric-experiment preset: it runs bounded experiments against an explicit metric/benchmark contract, records each attempt in an experiment ledger, and keeps objective/evaluation signals advisory unless a specific workflow contract requires them as evidence.

Goal shape

An Autoresearch goal should include:

  • optimization target and primary metric;
  • metric direction (higher-is-better or lower-is-better);
  • benchmark command or measurement source;
  • guardrail checks and non-goals;
  • stop conditions, attempt bounds, and acceptable trade-offs;
  • anti-Goodhart concerns or known ways the metric could be gamed.

Required inputs

  • An experiment contract with goal, metric, direction, benchmark/check commands, constraints, and stop conditions.
  • Baseline measurement before candidate optimization.
  • Project context and artifact boundaries.
  • Optional objective/evaluation hints, kept advisory unless separately authorized.

Required artifacts

Autoresearch-owned artifacts should include:

  • contract.md — normalized experiment contract and approval state.
  • benchmark.sh and checks.sh — executable measurement and guardrail commands.
  • ledger.jsonl — baseline plus every candidate attempt, metric, decision, artifact refs, and anti-gaming scan result.
  • ideas.md — generated or human-provided candidate ideas and rejected paths.
  • decision-report.md — selected keep/discard decisions, stop rationale, and trade-offs.
  • verify.md — final command evidence, ledger validation, and residual risk.
  • state.json, events.jsonl, and snapshot.json — durable workflow state/projection artifacts.

Evidence standard

Autoresearch evidence must make the metric result reproducible and auditable:

  • approved contract fields and baseline metric;
  • executable benchmark/check outputs for each candidate;
  • ledger rows with metric direction, previous best, decision (keep, discard, stop, or failure class), artifact refs, and anti-gaming flags;
  • guardrail results for non-primary objectives;
  • explicit rationale when a better metric is discarded because it violates constraints or anti-Goodhart safeguards;
  • final verification evidence that replays or validates the ledger and selected result.

Done criteria

Autoresearch may be complete only when:

  1. the experiment contract is approved and current;
  2. baseline and candidate ledger entries are parseable and gap-free;
  3. keep/discard decisions follow the declared metric direction and guardrails;
  4. anti-Goodhart checks are recorded and visible;
  5. stop conditions or attempt bounds are satisfied;
  6. final selected output and residual trade-offs are documented in decision-report.md and verify.md.

Repair criteria

Repair is required when the contract is incomplete, benchmark output is non-reproducible, metric direction is ambiguous, ledger entries have gaps or missing artifact refs, guardrails fail, anti-gaming flags are ignored, or objective/evaluation outputs are presented as hidden completion authority.

Objective/evaluation boundary

Objective and evaluation signals can suggest candidate policies, retries, repairs, or human inspection. They must not silently hard-block completion, auto-select hidden policies, or replace ledger/benchmark evidence. Stronger authority requires a separate approved issue.

Preset comparison matrix

Preset Primary question Required evidence focus Completion authority Adapter-neutral boundary
Ralph Did the implementation satisfy an accepted plan? Plan, command evidence, changed-file/acceptance refs, independent closeout Workflow validation, evidence records, reviewer/human gates No GitHub/PR/Ragna/kapi-agent assumptions in core
Integrate Was candidate output safely integrated or rejected? Source refs, conflict matrix, target verification, rollback/repair evidence Workflow validation plus explicit integration decisions No merge bot, tracker, or PR policy in core
Autoresearch Did bounded experiments improve the declared metric without gaming it? Contract, benchmark/check outputs, ledger rows, anti-Goodhart scan, stop rationale Workflow validation plus approved experiment contract and ledger policy Objective/evaluation signals stay advisory by default

Design verification checklist

  • Ralph required artifacts, evidence expectations, done criteria, repair criteria, and Deep Interview handoff behavior are explicit.
  • Integrate plan/conflict/evidence/verification/rollback expectations are explicit.
  • Autoresearch metric, benchmark, guardrail, ledger, stop-condition, and anti-Goodhart expectations are explicit.
  • Presets remain generic and do not absorb GitHub/PR/Discord/Ragna/kapi-agent semantics.
  • Objective/evaluation signals remain advisory unless separately authorized.

Graph-execution component boundary

The execute preset uses TaskGraph as the runtime primitive for single-agent, sequential, DAG-parallel, and team-parallel runs. Parallelism is a policy choice, not a separate graph representation.

Runtime authority is split deliberately: Decomposer creates the concrete graph from approved objective plus selected policy, Scheduler computes ready tasks, WorkerRuntime dispatches and heartbeats workers, Verifier validates evidence refs, and GateEngine decides whether transitions are allowed. Policy selection may pass advisory PolicyGraphSketch values for simulation context, but those sketches are not the concrete runtime graph. This keeps valid evidence separate from allowed transitions: evidence can validate while a phase invariant or RunObjective gate still blocks progress.

Follow-up implementation slices

  1. Add preset metadata types or docs-derived fixtures only if a runtime surface needs machine-readable preset summaries.
  2. Add focused validation tests for any gap discovered between these contracts and current workflow validation.
  3. Add adapter-specific GitHub/PR/report interpretations only outside RunContract core.
  4. Add stronger objective/evaluation gates only through a separate design issue that changes authority explicitly.