|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: AI-Assisted Engineering Discipline |
| 4 | +--- |
| 5 | + |
| 6 | +# AI-Assisted Engineering Discipline |
| 7 | + |
| 8 | +**Status:** Design sketch |
| 9 | + |
| 10 | +Calciforge is itself a safety tool for AI agents, so the project has to be |
| 11 | +honest about how agent-written code fails. The lesson from recent Calciforge |
| 12 | +bugs, and from broader Rust-with-agents reports such as |
| 13 | +[Cheng Huang's write-up](https://zfhuang99.github.io/rust/claude%20code/codex/contracts/spec-driven%20development/2025/12/01/rust-with-ai.html) |
| 14 | +and the [Hacker News discussion](https://news.ycombinator.com/item?id=48205415), |
| 15 | +is not "generate more code." It is: keep contracts explicit, make feedback |
| 16 | +loops mechanical, and treat every generated test as suspicious until it proves |
| 17 | +which promise it protects. |
| 18 | + |
| 19 | +This page is maintainer-facing. It records how Calciforge agents and humans |
| 20 | +should shape future work. |
| 21 | + |
| 22 | +## Useful Lessons to Import |
| 23 | + |
| 24 | +### Contracts before code |
| 25 | + |
| 26 | +Before changing a boundary, write the contract in plain language: |
| 27 | + |
| 28 | +- the preconditions Calciforge expects, |
| 29 | +- the postconditions it promises, |
| 30 | +- the invariants that must survive failure, |
| 31 | +- the operator-visible behavior that proves the contract held. |
| 32 | + |
| 33 | +This is especially important for model/provider adapters, security-proxy |
| 34 | +rewrites, channel routing, secret handling, installer paths, and doctor checks. |
| 35 | +Those are not "just implementation" areas. They are the castle doors. |
| 36 | + |
| 37 | +Contracts do not need a formal language before they help. A short table in a |
| 38 | +test, ADR, or roadmap note is enough if it names the failure mode clearly. |
| 39 | + |
| 40 | +### One story per branch |
| 41 | + |
| 42 | +A single user story is the right default unit for agent implementation. A story |
| 43 | +can still cross several files, but it should have one visible outcome: |
| 44 | + |
| 45 | +- "A first-class agent using `stream=true` gets a valid response through the |
| 46 | + configured provider." |
| 47 | +- "`doctor` catches an ACP binary path that would fail at runtime." |
| 48 | +- "A secret entered through the paste UI appears in `!secret list` with its |
| 49 | + destination policy." |
| 50 | + |
| 51 | +When a branch starts solving three stories, split it. Calciforge already has |
| 52 | +enough moving parts; one PR should not become a second moving castle. |
| 53 | + |
| 54 | +### Feedback loops must be executable |
| 55 | + |
| 56 | +Rust helps because the compiler, formatter, linter, and tests can push back on |
| 57 | +agent mistakes. Use that. Every meaningful PR should name the narrowest checks |
| 58 | +that exercise the contract it changes. |
| 59 | + |
| 60 | +For Calciforge, the usual progression is: |
| 61 | + |
| 62 | +1. reproduce the failure or write the contract test, |
| 63 | +2. make the smallest behavioral change, |
| 64 | +3. run focused tests for the touched boundary, |
| 65 | +4. run the relevant docs/ratchet/doctor checks, |
| 66 | +5. commit, |
| 67 | +6. perform adversarial review on the diff. |
| 68 | + |
| 69 | +Generated code is allowed. Unreviewed generated behavior is not. |
| 70 | + |
| 71 | +### Test quality over test count |
| 72 | + |
| 73 | +HN commenters pushed on the right weak spot: a large test count does not prove |
| 74 | +much if nobody can say what those tests protect. Calciforge should judge tests |
| 75 | +by contract value: |
| 76 | + |
| 77 | +- Would this test fail on the bug we are fixing? |
| 78 | +- Does it exercise behavior an operator or agent can observe? |
| 79 | +- Does it cover legal variation from a real upstream, not only our ideal mock? |
| 80 | +- If the implementation changed, would the test still describe the same |
| 81 | + promise? |
| 82 | + |
| 83 | +The [failure discovery action plan](failure-discovery-action-plan.html) calls |
| 84 | +these aggression tests when they deliberately search for likely future |
| 85 | +failures. That should become normal for security, gateway, channel, installer, |
| 86 | +and agent-adapter boundaries. |
| 87 | + |
| 88 | +### Human-readable abstractions matter |
| 89 | + |
| 90 | +One HN theme was that agents can make code grow faster than it becomes |
| 91 | +understandable. Calciforge should resist that. A useful abstraction should make |
| 92 | +the product contract easier to see: |
| 93 | + |
| 94 | +- one place for model/provider selector resolution, |
| 95 | +- one place for first-class adapter lifecycle metadata, |
| 96 | +- one place for security proxy policy decisions, |
| 97 | +- one place for channel command state, |
| 98 | +- one place for installer ownership of each generated file or service. |
| 99 | + |
| 100 | +If a change adds another "almost the same" path, treat that as architecture |
| 101 | +debt even when tests pass. |
| 102 | + |
| 103 | +### Measure before tuning |
| 104 | + |
| 105 | +The article's performance loop is worth copying: instrument, run, analyze, |
| 106 | +change one thing, and measure again. For Calciforge this applies to: |
| 107 | + |
| 108 | +- model cold starts and local model swapping, |
| 109 | +- provider retry behavior, |
| 110 | +- gateway streaming latency, |
| 111 | +- lock contention around channel/session state, |
| 112 | +- doctor and install runtime, |
| 113 | +- security-proxy scan overhead. |
| 114 | + |
| 115 | +Do not merge speculative latency fixes that lack a baseline and a |
| 116 | +post-change measurement. We have enough guesses; keep the ones with numbers. |
| 117 | + |
| 118 | +## Required PR Checklist for Agent-Authored Changes |
| 119 | + |
| 120 | +Use this checklist for branches substantially authored by an AI coding agent: |
| 121 | + |
| 122 | +- **Story:** the PR names one user-visible story or boundary contract. |
| 123 | +- **Contract:** changed boundaries name preconditions, postconditions, or |
| 124 | + invariants in tests, docs, or comments. |
| 125 | +- **Regression:** bug fixes start with a failing test, or the PR explains why a |
| 126 | + reproducer is not practical and adds the closest executable guardrail. |
| 127 | +- **Aggression:** new adapters, providers, channel paths, or installer surfaces |
| 128 | + add or update a high-risk scenario or boundary registry entry when the change |
| 129 | + creates a new failure shape. |
| 130 | +- **Review:** after commit, the author performs adversarial review focused on |
| 131 | + security, architecture drift, code quality, and whether tests can really fail. |
| 132 | +- **Measurement:** performance changes include before/after numbers and the |
| 133 | + command or script used to collect them. |
| 134 | +- **No line-count trophy:** code volume is not presented as evidence of |
| 135 | + progress. Smaller, clearer patches win. |
| 136 | + |
| 137 | +## Where This Should Become Automation |
| 138 | + |
| 139 | +Near-term automation should make the good path easier: |
| 140 | + |
| 141 | +- Extend PR templates to ask for story, contract, regression, and measurement |
| 142 | + sections. |
| 143 | +- Teach `scripts/check-scenarios.py` to flag new adapter/provider/channel |
| 144 | + files that lack a high-risk scenario or boundary registry entry. |
| 145 | +- Add a lightweight `scripts/check-agent-pr-discipline.py` if PR drift |
| 146 | + continues: changed boundary files should either link a scenario, add a test, |
| 147 | + or mark the exception explicitly. |
| 148 | +- Add `doctor --live` checks for first-class agents so the same path users test |
| 149 | + manually is exercised by automation. |
| 150 | +- Keep ratchets pointed at drift that matters: duplicated decision points, |
| 151 | + oversized responsibility modules, unowned background tasks, and stringly |
| 152 | + security decisions. |
| 153 | + |
| 154 | +## Anti-Patterns |
| 155 | + |
| 156 | +Avoid these failure modes: |
| 157 | + |
| 158 | +- **Mock-shaped confidence:** tests that only prove Calciforge can parse the |
| 159 | + response shape it wished the upstream returned. |
| 160 | +- **Generated-test fog:** many tests with unclear oracles, snapshots, or |
| 161 | + implementation-detail assertions. |
| 162 | +- **Architecture by accumulation:** adding a second source of truth because it |
| 163 | + is faster than routing through the existing one. |
| 164 | +- **Tool-result obedience:** accepting compiler or linter suggestions without |
| 165 | + checking whether they address the root cause. |
| 166 | +- **Performance folklore:** changing async, locks, cloning, retries, or local |
| 167 | + model settings because they "seem slow" without measurements. |
| 168 | + |
| 169 | +The bar is simple: every agent-aided change should leave the contract clearer |
| 170 | +than it found it. If the change only leaves more code, Calcifer is allowed to |
| 171 | +complain. |
0 commit comments