Document AI-assisted engineering guardrails (#208)

bglusman · web-flow · commit 78f2e12e44f0 · 2026-05-20T08:40:25.000-04:00
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -18,10 +18,19 @@ Brief description of the changes in this PR.
 - [ ] `cargo fmt` is clean
 - [ ] New tests added for new functionality
 
+## Contract / Agent-Authored Work
+
+- [ ] PR names the user-visible story or boundary contract it changes
+- [ ] Bug fixes include a failing regression test, or explain why one is not practical
+- [ ] New/changed boundary tests assert operator-visible behavior, not private implementation details
+- [ ] Performance changes include baseline and post-change measurements
+- [ ] AI-generated tests/docs were reviewed for whether they can fail for the intended reason
+
 ## Checklist
 
 - [ ] Code follows the project style guidelines
 - [ ] Self-review completed
+- [ ] Adversarial review completed after commit
 - [ ] Comments added for complex logic
 - [ ] Documentation updated (if applicable)
 - [ ] No new warnings introduced
diff --git a/AGENTS.md b/AGENTS.md
@@ -53,6 +53,9 @@ User-facing tour: `README.md` → [calciforge.org](https://calciforge.org/).
 13. **Large files are debt with budgets, not precedent.** `scripts/check-architecture-ratchets.rb` pins current oversized Rust modules to explicit line budgets and fails CI if they grow. New Rust modules should stay under the default budget unless the PR explains the boundary being created and adds a budget consciously.
 14. **Stringly data stays at the boundary.** It is acceptable for config, JSON, CLI args, and protocol payloads to enter as `String`, `Vec<String>`, or `HashMap<String, String>`, but core logic should convert them into typed structs/enums before making security, routing, lifecycle, or persistence decisions.
 15. **Detached work needs an owner.** New `tokio::spawn` or thread-spawned work must have an explicit lifecycle owner, cancellation/error path, and state handoff. Do not update shared mutable state from background tasks unless the owning module documents the ordering and failure behavior.
+16. **Work in reviewable story slices.** Before starting a broad change, write down the smallest user-visible story or contract being improved. Keep the first patch inside that slice unless the code proves the boundary is wrong. If the task expands, split it into follow-up PRs instead of letting one branch become a second architecture.
+17. **Contracts beat generated volume.** AI-generated tests, fixtures, and docs are not evidence by themselves. Every generated artifact must tie back to a precondition, postcondition, invariant, scenario, or operator-visible promise. Remove or rewrite tests that cannot fail for the intended reason.
+18. **Measure before performance fixes.** For latency, throughput, model cold starts, lock contention, retry storms, and async task behavior, capture the measurement first. A performance PR should name the baseline, the bottleneck hypothesis, the change, and the post-change measurement.
 
 ## Build / test
 
diff --git a/README.md b/README.md
@@ -245,10 +245,16 @@ Install hooks once:
 bash scripts/install-git-hooks.sh
 ```
 
+For AI-assisted changes, use the
+[engineering discipline checklist](docs/roadmap/agent-engineering-discipline.md):
+one user-visible story, explicit boundary contracts, tests that can fail for
+the intended reason, and adversarial review after commit.
+
 ## Docs
 
 - [Feature tour and install notes](https://calciforge.org/)
 - [Agent runtime contract](docs/agent-runtime-contract.md)
+- [AI-assisted engineering discipline](docs/roadmap/agent-engineering-discipline.md)
 - [Model gateway reference](docs/model-gateway.md)
 - [Codex/OpenClaw integration](docs/codex-openclaw-integration.md)
 - [Model gateway RFC](docs/rfcs/model-gateway-primitives.md)
diff --git a/docs/README.md b/docs/README.md
@@ -32,6 +32,9 @@ that should be reasonably stable:
   - `roadmap/architecture-laws-action-plan.md` — refactor plan for channel
     pipelines, command handling, security proxy policy, installer structure,
     and adapter lifecycle cleanup
+  - `roadmap/agent-engineering-discipline.md` — maintainer rules for
+    AI-assisted Calciforge changes: contracts first, story-sized slices,
+    adversarial review, measurable performance work, and test-quality gates
 
 ## Status labels
 
diff --git a/docs/index.md b/docs/index.md
@@ -944,7 +944,10 @@ gateway providers, and synthetic routing selectors pass smoke tests.
 The status summary above is the site-facing snapshot of what works today and
 what is still in flight. Public roadmap ideas live in
 the [roadmap notes](roadmap/v3-ideas.html), with product-interface
-direction in the [UX roadmap](roadmap/product-ux.html).
+direction in the [UX roadmap](roadmap/product-ux.html). Maintainer-facing
+agent-work rules live in the
+[AI-assisted engineering discipline](roadmap/agent-engineering-discipline.html)
+note.
 
 <footer>
 <div class="name-origin">
diff --git a/docs/roadmap/agent-engineering-discipline.md b/docs/roadmap/agent-engineering-discipline.md
@@ -0,0 +1,171 @@
+---
+layout: default
+title: AI-Assisted Engineering Discipline
+---
+
+# AI-Assisted Engineering Discipline
+
+**Status:** Design sketch
+
+Calciforge is itself a safety tool for AI agents, so the project has to be
+honest about how agent-written code fails. The lesson from recent Calciforge
+bugs, and from broader Rust-with-agents reports such as
+[Cheng Huang's write-up](https://zfhuang99.github.io/rust/claude%20code/codex/contracts/spec-driven%20development/2025/12/01/rust-with-ai.html)
+and the [Hacker News discussion](https://news.ycombinator.com/item?id=48205415),
+is not "generate more code." It is: keep contracts explicit, make feedback
+loops mechanical, and treat every generated test as suspicious until it proves
+which promise it protects.
+
+This page is maintainer-facing. It records how Calciforge agents and humans
+should shape future work.
+
+## Useful Lessons to Import
+
+### Contracts before code
+
+Before changing a boundary, write the contract in plain language:
+
+- the preconditions Calciforge expects,
+- the postconditions it promises,
+- the invariants that must survive failure,
+- the operator-visible behavior that proves the contract held.
+
+This is especially important for model/provider adapters, security-proxy
+rewrites, channel routing, secret handling, installer paths, and doctor checks.
+Those are not "just implementation" areas. They are the castle doors.
+
+Contracts do not need a formal language before they help. A short table in a
+test, ADR, or roadmap note is enough if it names the failure mode clearly.
+
+### One story per branch
+
+A single user story is the right default unit for agent implementation. A story
+can still cross several files, but it should have one visible outcome:
+
+- "A first-class agent using `stream=true` gets a valid response through the
+  configured provider."
+- "`doctor` catches an ACP binary path that would fail at runtime."
+- "A secret entered through the paste UI appears in `!secret list` with its
+  destination policy."
+
+When a branch starts solving three stories, split it. Calciforge already has
+enough moving parts; one PR should not become a second moving castle.
+
+### Feedback loops must be executable
+
+Rust helps because the compiler, formatter, linter, and tests can push back on
+agent mistakes. Use that. Every meaningful PR should name the narrowest checks
+that exercise the contract it changes.
+
+For Calciforge, the usual progression is:
+
+1. reproduce the failure or write the contract test,
+2. make the smallest behavioral change,
+3. run focused tests for the touched boundary,
+4. run the relevant docs/ratchet/doctor checks,
+5. commit,
+6. perform adversarial review on the diff.
+
+Generated code is allowed. Unreviewed generated behavior is not.
+
+### Test quality over test count
+
+HN commenters pushed on the right weak spot: a large test count does not prove
+much if nobody can say what those tests protect. Calciforge should judge tests
+by contract value:
+
+- Would this test fail on the bug we are fixing?
+- Does it exercise behavior an operator or agent can observe?
+- Does it cover legal variation from a real upstream, not only our ideal mock?
+- If the implementation changed, would the test still describe the same
+  promise?
+
+The [failure discovery action plan](failure-discovery-action-plan.html) calls
+these aggression tests when they deliberately search for likely future
+failures. That should become normal for security, gateway, channel, installer,
+and agent-adapter boundaries.
+
+### Human-readable abstractions matter
+
+One HN theme was that agents can make code grow faster than it becomes
+understandable. Calciforge should resist that. A useful abstraction should make
+the product contract easier to see:
+
+- one place for model/provider selector resolution,
+- one place for first-class adapter lifecycle metadata,
+- one place for security proxy policy decisions,
+- one place for channel command state,
+- one place for installer ownership of each generated file or service.
+
+If a change adds another "almost the same" path, treat that as architecture
+debt even when tests pass.
+
+### Measure before tuning
+
+The article's performance loop is worth copying: instrument, run, analyze,
+change one thing, and measure again. For Calciforge this applies to:
+
+- model cold starts and local model swapping,
+- provider retry behavior,
+- gateway streaming latency,
+- lock contention around channel/session state,
+- doctor and install runtime,
+- security-proxy scan overhead.
+
+Do not merge speculative latency fixes that lack a baseline and a
+post-change measurement. We have enough guesses; keep the ones with numbers.
+
+## Required PR Checklist for Agent-Authored Changes
+
+Use this checklist for branches substantially authored by an AI coding agent:
+
+- **Story:** the PR names one user-visible story or boundary contract.
+- **Contract:** changed boundaries name preconditions, postconditions, or
+  invariants in tests, docs, or comments.
+- **Regression:** bug fixes start with a failing test, or the PR explains why a
+  reproducer is not practical and adds the closest executable guardrail.
+- **Aggression:** new adapters, providers, channel paths, or installer surfaces
+  add or update a high-risk scenario or boundary registry entry when the change
+  creates a new failure shape.
+- **Review:** after commit, the author performs adversarial review focused on
+  security, architecture drift, code quality, and whether tests can really fail.
+- **Measurement:** performance changes include before/after numbers and the
+  command or script used to collect them.
+- **No line-count trophy:** code volume is not presented as evidence of
+  progress. Smaller, clearer patches win.
+
+## Where This Should Become Automation
+
+Near-term automation should make the good path easier:
+
+- Extend PR templates to ask for story, contract, regression, and measurement
+  sections.
+- Teach `scripts/check-scenarios.py` to flag new adapter/provider/channel
+  files that lack a high-risk scenario or boundary registry entry.
+- Add a lightweight `scripts/check-agent-pr-discipline.py` if PR drift
+  continues: changed boundary files should either link a scenario, add a test,
+  or mark the exception explicitly.
+- Add `doctor --live` checks for first-class agents so the same path users test
+  manually is exercised by automation.
+- Keep ratchets pointed at drift that matters: duplicated decision points,
+  oversized responsibility modules, unowned background tasks, and stringly
+  security decisions.
+
+## Anti-Patterns
+
+Avoid these failure modes:
+
+- **Mock-shaped confidence:** tests that only prove Calciforge can parse the
+  response shape it wished the upstream returned.
+- **Generated-test fog:** many tests with unclear oracles, snapshots, or
+  implementation-detail assertions.
+- **Architecture by accumulation:** adding a second source of truth because it
+  is faster than routing through the existing one.
+- **Tool-result obedience:** accepting compiler or linter suggestions without
+  checking whether they address the root cause.
+- **Performance folklore:** changing async, locks, cloning, retries, or local
+  model settings because they "seem slow" without measurements.
+
+The bar is simple: every agent-aided change should leave the contract clearer
+than it found it. If the change only leaves more code, Calcifer is allowed to
+complain.
diff --git a/docs/roadmap/failure-discovery-action-plan.md b/docs/roadmap/failure-discovery-action-plan.md
@@ -383,6 +383,10 @@ facts from a user's failed test message.
 7. Add a release-candidate checklist item: one manually observed failure must
    become either an automated regression test or a documented impossible-to-test
    gap before the PR merges.
+8. Tie new aggression tests back to an explicit contract, scenario, or
+   invariant from the
+   [AI-assisted engineering discipline](agent-engineering-discipline.html)
+   page, so test growth stays tied to the promises Calciforge actually makes.
 
 ## Test Quality Standard
 
@@ -395,3 +399,8 @@ A useful regression test should answer three questions:
 
 If the answer to the first question is "no," the test may still be useful, but
 it is not a regression test. Label it honestly.
+
+For AI-assisted changes, the same rule applies one level higher: a generated
+test only earns trust when a human can name the contract it protects and the
+failure it would catch. Count fewer, sharper tests before counting files or
+lines.