Skip to content

Commit 78f2e12

Browse files
authored
Document AI-assisted engineering guardrails (#208)
1 parent 28a133b commit 78f2e12

7 files changed

Lines changed: 205 additions & 1 deletion

File tree

.github/pull_request_template.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,19 @@ Brief description of the changes in this PR.
1818
- [ ] `cargo fmt` is clean
1919
- [ ] New tests added for new functionality
2020

21+
## Contract / Agent-Authored Work
22+
23+
- [ ] PR names the user-visible story or boundary contract it changes
24+
- [ ] Bug fixes include a failing regression test, or explain why one is not practical
25+
- [ ] New/changed boundary tests assert operator-visible behavior, not private implementation details
26+
- [ ] Performance changes include baseline and post-change measurements
27+
- [ ] AI-generated tests/docs were reviewed for whether they can fail for the intended reason
28+
2129
## Checklist
2230

2331
- [ ] Code follows the project style guidelines
2432
- [ ] Self-review completed
33+
- [ ] Adversarial review completed after commit
2534
- [ ] Comments added for complex logic
2635
- [ ] Documentation updated (if applicable)
2736
- [ ] No new warnings introduced

AGENTS.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,9 @@ User-facing tour: `README.md` → [calciforge.org](https://calciforge.org/).
5353
13. **Large files are debt with budgets, not precedent.** `scripts/check-architecture-ratchets.rb` pins current oversized Rust modules to explicit line budgets and fails CI if they grow. New Rust modules should stay under the default budget unless the PR explains the boundary being created and adds a budget consciously.
5454
14. **Stringly data stays at the boundary.** It is acceptable for config, JSON, CLI args, and protocol payloads to enter as `String`, `Vec<String>`, or `HashMap<String, String>`, but core logic should convert them into typed structs/enums before making security, routing, lifecycle, or persistence decisions.
5555
15. **Detached work needs an owner.** New `tokio::spawn` or thread-spawned work must have an explicit lifecycle owner, cancellation/error path, and state handoff. Do not update shared mutable state from background tasks unless the owning module documents the ordering and failure behavior.
56+
16. **Work in reviewable story slices.** Before starting a broad change, write down the smallest user-visible story or contract being improved. Keep the first patch inside that slice unless the code proves the boundary is wrong. If the task expands, split it into follow-up PRs instead of letting one branch become a second architecture.
57+
17. **Contracts beat generated volume.** AI-generated tests, fixtures, and docs are not evidence by themselves. Every generated artifact must tie back to a precondition, postcondition, invariant, scenario, or operator-visible promise. Remove or rewrite tests that cannot fail for the intended reason.
58+
18. **Measure before performance fixes.** For latency, throughput, model cold starts, lock contention, retry storms, and async task behavior, capture the measurement first. A performance PR should name the baseline, the bottleneck hypothesis, the change, and the post-change measurement.
5659

5760
## Build / test
5861

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,10 +245,16 @@ Install hooks once:
245245
bash scripts/install-git-hooks.sh
246246
```
247247

248+
For AI-assisted changes, use the
249+
[engineering discipline checklist](docs/roadmap/agent-engineering-discipline.md):
250+
one user-visible story, explicit boundary contracts, tests that can fail for
251+
the intended reason, and adversarial review after commit.
252+
248253
## Docs
249254

250255
- [Feature tour and install notes](https://calciforge.org/)
251256
- [Agent runtime contract](docs/agent-runtime-contract.md)
257+
- [AI-assisted engineering discipline](docs/roadmap/agent-engineering-discipline.md)
252258
- [Model gateway reference](docs/model-gateway.md)
253259
- [Codex/OpenClaw integration](docs/codex-openclaw-integration.md)
254260
- [Model gateway RFC](docs/rfcs/model-gateway-primitives.md)

docs/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ that should be reasonably stable:
3232
- `roadmap/architecture-laws-action-plan.md` — refactor plan for channel
3333
pipelines, command handling, security proxy policy, installer structure,
3434
and adapter lifecycle cleanup
35+
- `roadmap/agent-engineering-discipline.md` — maintainer rules for
36+
AI-assisted Calciforge changes: contracts first, story-sized slices,
37+
adversarial review, measurable performance work, and test-quality gates
3538

3639
## Status labels
3740

docs/index.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -944,7 +944,10 @@ gateway providers, and synthetic routing selectors pass smoke tests.
944944
The status summary above is the site-facing snapshot of what works today and
945945
what is still in flight. Public roadmap ideas live in
946946
the [roadmap notes](roadmap/v3-ideas.html), with product-interface
947-
direction in the [UX roadmap](roadmap/product-ux.html).
947+
direction in the [UX roadmap](roadmap/product-ux.html). Maintainer-facing
948+
agent-work rules live in the
949+
[AI-assisted engineering discipline](roadmap/agent-engineering-discipline.html)
950+
note.
948951

949952
<footer>
950953
<div class="name-origin">
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
---
2+
layout: default
3+
title: AI-Assisted Engineering Discipline
4+
---
5+
6+
# AI-Assisted Engineering Discipline
7+
8+
**Status:** Design sketch
9+
10+
Calciforge is itself a safety tool for AI agents, so the project has to be
11+
honest about how agent-written code fails. The lesson from recent Calciforge
12+
bugs, and from broader Rust-with-agents reports such as
13+
[Cheng Huang's write-up](https://zfhuang99.github.io/rust/claude%20code/codex/contracts/spec-driven%20development/2025/12/01/rust-with-ai.html)
14+
and the [Hacker News discussion](https://news.ycombinator.com/item?id=48205415),
15+
is not "generate more code." It is: keep contracts explicit, make feedback
16+
loops mechanical, and treat every generated test as suspicious until it proves
17+
which promise it protects.
18+
19+
This page is maintainer-facing. It records how Calciforge agents and humans
20+
should shape future work.
21+
22+
## Useful Lessons to Import
23+
24+
### Contracts before code
25+
26+
Before changing a boundary, write the contract in plain language:
27+
28+
- the preconditions Calciforge expects,
29+
- the postconditions it promises,
30+
- the invariants that must survive failure,
31+
- the operator-visible behavior that proves the contract held.
32+
33+
This is especially important for model/provider adapters, security-proxy
34+
rewrites, channel routing, secret handling, installer paths, and doctor checks.
35+
Those are not "just implementation" areas. They are the castle doors.
36+
37+
Contracts do not need a formal language before they help. A short table in a
38+
test, ADR, or roadmap note is enough if it names the failure mode clearly.
39+
40+
### One story per branch
41+
42+
A single user story is the right default unit for agent implementation. A story
43+
can still cross several files, but it should have one visible outcome:
44+
45+
- "A first-class agent using `stream=true` gets a valid response through the
46+
configured provider."
47+
- "`doctor` catches an ACP binary path that would fail at runtime."
48+
- "A secret entered through the paste UI appears in `!secret list` with its
49+
destination policy."
50+
51+
When a branch starts solving three stories, split it. Calciforge already has
52+
enough moving parts; one PR should not become a second moving castle.
53+
54+
### Feedback loops must be executable
55+
56+
Rust helps because the compiler, formatter, linter, and tests can push back on
57+
agent mistakes. Use that. Every meaningful PR should name the narrowest checks
58+
that exercise the contract it changes.
59+
60+
For Calciforge, the usual progression is:
61+
62+
1. reproduce the failure or write the contract test,
63+
2. make the smallest behavioral change,
64+
3. run focused tests for the touched boundary,
65+
4. run the relevant docs/ratchet/doctor checks,
66+
5. commit,
67+
6. perform adversarial review on the diff.
68+
69+
Generated code is allowed. Unreviewed generated behavior is not.
70+
71+
### Test quality over test count
72+
73+
HN commenters pushed on the right weak spot: a large test count does not prove
74+
much if nobody can say what those tests protect. Calciforge should judge tests
75+
by contract value:
76+
77+
- Would this test fail on the bug we are fixing?
78+
- Does it exercise behavior an operator or agent can observe?
79+
- Does it cover legal variation from a real upstream, not only our ideal mock?
80+
- If the implementation changed, would the test still describe the same
81+
promise?
82+
83+
The [failure discovery action plan](failure-discovery-action-plan.html) calls
84+
these aggression tests when they deliberately search for likely future
85+
failures. That should become normal for security, gateway, channel, installer,
86+
and agent-adapter boundaries.
87+
88+
### Human-readable abstractions matter
89+
90+
One HN theme was that agents can make code grow faster than it becomes
91+
understandable. Calciforge should resist that. A useful abstraction should make
92+
the product contract easier to see:
93+
94+
- one place for model/provider selector resolution,
95+
- one place for first-class adapter lifecycle metadata,
96+
- one place for security proxy policy decisions,
97+
- one place for channel command state,
98+
- one place for installer ownership of each generated file or service.
99+
100+
If a change adds another "almost the same" path, treat that as architecture
101+
debt even when tests pass.
102+
103+
### Measure before tuning
104+
105+
The article's performance loop is worth copying: instrument, run, analyze,
106+
change one thing, and measure again. For Calciforge this applies to:
107+
108+
- model cold starts and local model swapping,
109+
- provider retry behavior,
110+
- gateway streaming latency,
111+
- lock contention around channel/session state,
112+
- doctor and install runtime,
113+
- security-proxy scan overhead.
114+
115+
Do not merge speculative latency fixes that lack a baseline and a
116+
post-change measurement. We have enough guesses; keep the ones with numbers.
117+
118+
## Required PR Checklist for Agent-Authored Changes
119+
120+
Use this checklist for branches substantially authored by an AI coding agent:
121+
122+
- **Story:** the PR names one user-visible story or boundary contract.
123+
- **Contract:** changed boundaries name preconditions, postconditions, or
124+
invariants in tests, docs, or comments.
125+
- **Regression:** bug fixes start with a failing test, or the PR explains why a
126+
reproducer is not practical and adds the closest executable guardrail.
127+
- **Aggression:** new adapters, providers, channel paths, or installer surfaces
128+
add or update a high-risk scenario or boundary registry entry when the change
129+
creates a new failure shape.
130+
- **Review:** after commit, the author performs adversarial review focused on
131+
security, architecture drift, code quality, and whether tests can really fail.
132+
- **Measurement:** performance changes include before/after numbers and the
133+
command or script used to collect them.
134+
- **No line-count trophy:** code volume is not presented as evidence of
135+
progress. Smaller, clearer patches win.
136+
137+
## Where This Should Become Automation
138+
139+
Near-term automation should make the good path easier:
140+
141+
- Extend PR templates to ask for story, contract, regression, and measurement
142+
sections.
143+
- Teach `scripts/check-scenarios.py` to flag new adapter/provider/channel
144+
files that lack a high-risk scenario or boundary registry entry.
145+
- Add a lightweight `scripts/check-agent-pr-discipline.py` if PR drift
146+
continues: changed boundary files should either link a scenario, add a test,
147+
or mark the exception explicitly.
148+
- Add `doctor --live` checks for first-class agents so the same path users test
149+
manually is exercised by automation.
150+
- Keep ratchets pointed at drift that matters: duplicated decision points,
151+
oversized responsibility modules, unowned background tasks, and stringly
152+
security decisions.
153+
154+
## Anti-Patterns
155+
156+
Avoid these failure modes:
157+
158+
- **Mock-shaped confidence:** tests that only prove Calciforge can parse the
159+
response shape it wished the upstream returned.
160+
- **Generated-test fog:** many tests with unclear oracles, snapshots, or
161+
implementation-detail assertions.
162+
- **Architecture by accumulation:** adding a second source of truth because it
163+
is faster than routing through the existing one.
164+
- **Tool-result obedience:** accepting compiler or linter suggestions without
165+
checking whether they address the root cause.
166+
- **Performance folklore:** changing async, locks, cloning, retries, or local
167+
model settings because they "seem slow" without measurements.
168+
169+
The bar is simple: every agent-aided change should leave the contract clearer
170+
than it found it. If the change only leaves more code, Calcifer is allowed to
171+
complain.

docs/roadmap/failure-discovery-action-plan.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -383,6 +383,10 @@ facts from a user's failed test message.
383383
7. Add a release-candidate checklist item: one manually observed failure must
384384
become either an automated regression test or a documented impossible-to-test
385385
gap before the PR merges.
386+
8. Tie new aggression tests back to an explicit contract, scenario, or
387+
invariant from the
388+
[AI-assisted engineering discipline](agent-engineering-discipline.html)
389+
page, so test growth stays tied to the promises Calciforge actually makes.
386390

387391
## Test Quality Standard
388392

@@ -395,3 +399,8 @@ A useful regression test should answer three questions:
395399

396400
If the answer to the first question is "no," the test may still be useful, but
397401
it is not a regression test. Label it honestly.
402+
403+
For AI-assisted changes, the same rule applies one level higher: a generated
404+
test only earns trust when a human can name the contract it protects and the
405+
failure it would catch. Count fewer, sharper tests before counting files or
406+
lines.

0 commit comments

Comments
 (0)