Make workload fault-tolerance mindset explicit (#150)

SeanTAllen · web-flow · commit a0f67a6cd6c2 · 2026-05-15T13:03:38.000-04:00
The skill listed fault types to handle but didn't frame the underlying
mindset — that workloads run under fault injection where transient
errors are expected, and design should retry toward goals rather than
bail. LLMs using the skill have been inconsistent at producing this
behavior; a colleague confirmed the pattern.

Source: Bug Bash podcast discussion (Marco Preey + Sean Allen).
diff --git a/antithesis-workload/SKILL.md b/antithesis-workload/SKILL.md
@@ -131,6 +131,7 @@ Use the `antithesis-documentation` skill to access these pages. Prefer `snouty d
 
 - Keep Antithesis-only code out of production paths. If you must touch shared code, make the change surgical and easy to wall off.
 - Prefer simple workload code over highly configurable abstractions.
+- Design workloads to be fault-tolerant. Unlike happy-path tests where errors signal bugs, workloads run under fault injection and concurrent activity — transient errors are expected, not exceptional. Make progress toward a goal rather than bail on the first failure.
 - Assume `antithesis-setup` has already made the system runnable in a mostly idle state; this skill owns what the workload does once the system is up.
 - Assume `antithesis-setup` has already installed the relevant SDK and added one minimal bootstrap assertion in the SUT. This skill owns the broader property catalog beyond that initial integration check.
 - Write test commands in the project's language, not Bash, so they can reuse the project's clients, helpers, and libraries.
@@ -152,6 +153,9 @@ Review criteria:
 - `Sometimes(true, ...)` assertions should be rewritten as `Reachable(...)`.
 - Assertion messages are unique across the touched code; no broad property is implemented by reusing one message at multiple unrelated callsites
 - Workload-only instrumentation was not used where surgical SUT-side assertions would provide materially better search guidance for rare, dangerous, or timing-sensitive internal states
+- Workload code makes progress toward stated goals under transient errors rather than bailing on first failure
+- The workload records both attempted and acknowledged operations so later assertions can check bounds (e.g., "counter changed by some value in `[acknowledged, attempted]`")
+- Where retries are used, they're consistent with the SUT's idempotency contract
 - `Reachable(...)` markers are attached to distinct outcomes or branch results, not redundant early path-entry locations on the same straight-line flow
 - For bounded inputs in test commands, draws come from property-specific value menus (boundary values for the input type plus configured-limit families from the property's code paths) rather than arbitrary ranges, or the test command documents that the menu axis is not applicable. See `interesting-values.md`.
 - Test commands exist under `antithesis/test/` and use valid prefixes (`parallel_driver_`, `singleton_driver_`, `serial_driver_`, `first_`, `eventually_`, `finally_`, `anytime_`)
diff --git a/antithesis-workload/references/assertions.md b/antithesis-workload/references/assertions.md
@@ -62,12 +62,26 @@ These fit properties where the individual boolean inputs matter on their own, fo
 
 When available, the SDK adds the named boolean inputs to assertion details under their names, which makes triage more informative than an anonymous combined expression.
 
+## Assert Bounds, Not Exact Values
+
+When a workload runs under fault injection, it can't directly observe everything that happened — requests fail in flight, acknowledgments are lost. The workload constructs bounds (see `component-implementation.md`, "Construct bounds, don't claim exact knowledge"), and the assertion's job is to check that observed state falls within those bounds.
+
+Use rich numeric assertions to encode the bound. If a workload sent 100 increment requests to a counter and 80 were acknowledged:
+
+- `AlwaysGreaterThanOrEqualTo(observed_delta, 80, "counter reflects all acknowledged increments")`
+- `AlwaysLessThanOrEqualTo(observed_delta, 100, "counter reflects no more than attempted increments")`
+
+Two assertions, not one, so triage shows which bound failed.
+
+Don't write `Always(observed_delta == 100, ...)` for the same property — that assertion will fire legitimately whenever the environment dropped requests, drowning real bugs in false positives.
+
 ## Anti-Rules
 
 - Do not use `Sometimes(true, ...)` in normal workload or SUT code. If the condition is constant true, use `Reachable(...)` instead.
 - Do not use `Sometimes(cond, ...)` when the only thing you care about is that execution hit a path. Use `Reachable(...)`.
 - Do not reuse one assertion message across multiple unrelated callsites. Every assertion message should be unique in the codebase.
 - Do not stack broad early `Reachable(...)` markers on a straight-line flow when a later, more specific outcome marker already proves the path was exercised.
+- Do not assert exact equality on values affected by transient errors. Use bounded assertions (see "Assert Bounds, Not Exact Values").
 
 ## Good and Bad Uses
 
diff --git a/antithesis-workload/references/component-implementation.md b/antithesis-workload/references/component-implementation.md
@@ -21,7 +21,27 @@ This is the most open-ended part of the skill. Start with the simplest component
 
 ## Fault Tolerance in Workload Code
 
-If you are writing code that connects to a service over the network (e.g., test commands), ensure it handles all forms of temporary network faults. Read the Fault injection documentation to learn which faults your code needs to handle:
+Workloads run under fault injection. Transient errors are expected, not exceptional — the workload's job is to keep making progress toward its goal, not to bail on the first failure.
+
+### Design for goals, not procedures
+
+Write workload operations as loops that drive toward an objective, not fixed sequences of attempts. A workload that tracks "I tried 100 times" has nothing useful to assert against in a faulty environment; one that tracks progress against a goal can operate while the environment drops requests, partitions the network, or restarts nodes.
+
+### Construct bounds, don't claim exact knowledge
+
+Under fault injection a workload doesn't have perfect visibility — requests fail in flight, acknowledgments get dropped, clocks drift. The workload's job is to record enough of what happened to construct bounds the SUT must satisfy. Anything outside those bounds has probability zero of being correct, and that's how bugs surface.
+
+The most common way to construct a bound is to track attempts and acknowledgments separately. If a workload sends 100 increment requests to a counter and 80 are acknowledged, the counter must have changed by some number between 80 and 100 — unacknowledged requests may have succeeded anyway. A later read showing 75 or 120 is a bug, in the SUT or in the workload itself. A workload that records only "publish() called 100 times" has no bound to assert against.
+
+See `assertions.md`, "Assert Bounds, Not Exact Values" for how to express these bounds as assertions.
+
+### Retries are inputs to the system
+
+Retries are inputs the workload feeds into the SUT, not free additions — they shape what bugs you find. Retrying after an acknowledgment is lost can surface real idempotency bugs (good — that's a production scenario). Retrying past the SUT's idempotency contract creates faulty client behavior that produces false-positive assertion failures (bad). Decide what the SUT promises about idempotency before designing retry behavior.
+
+### Faults to handle
+
+Read the Fault injection documentation to learn what your code needs to tolerate:
 
 - Process kill/restart
 - Network partition (full and partial)