Skip to content

Commit a0f67a6

Browse files
authored
Make workload fault-tolerance mindset explicit (#150)
The skill listed fault types to handle but didn't frame the underlying mindset — that workloads run under fault injection where transient errors are expected, and design should retry toward goals rather than bail. LLMs using the skill have been inconsistent at producing this behavior; a colleague confirmed the pattern. Source: Bug Bash podcast discussion (Marco Preey + Sean Allen).
1 parent ca53cf3 commit a0f67a6

3 files changed

Lines changed: 39 additions & 1 deletion

File tree

antithesis-workload/SKILL.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,7 @@ Use the `antithesis-documentation` skill to access these pages. Prefer `snouty d
131131

132132
- Keep Antithesis-only code out of production paths. If you must touch shared code, make the change surgical and easy to wall off.
133133
- Prefer simple workload code over highly configurable abstractions.
134+
- Design workloads to be fault-tolerant. Unlike happy-path tests where errors signal bugs, workloads run under fault injection and concurrent activity — transient errors are expected, not exceptional. Make progress toward a goal rather than bail on the first failure.
134135
- Assume `antithesis-setup` has already made the system runnable in a mostly idle state; this skill owns what the workload does once the system is up.
135136
- Assume `antithesis-setup` has already installed the relevant SDK and added one minimal bootstrap assertion in the SUT. This skill owns the broader property catalog beyond that initial integration check.
136137
- Write test commands in the project's language, not Bash, so they can reuse the project's clients, helpers, and libraries.
@@ -152,6 +153,9 @@ Review criteria:
152153
- `Sometimes(true, ...)` assertions should be rewritten as `Reachable(...)`.
153154
- Assertion messages are unique across the touched code; no broad property is implemented by reusing one message at multiple unrelated callsites
154155
- Workload-only instrumentation was not used where surgical SUT-side assertions would provide materially better search guidance for rare, dangerous, or timing-sensitive internal states
156+
- Workload code makes progress toward stated goals under transient errors rather than bailing on first failure
157+
- The workload records both attempted and acknowledged operations so later assertions can check bounds (e.g., "counter changed by some value in `[acknowledged, attempted]`")
158+
- Where retries are used, they're consistent with the SUT's idempotency contract
155159
- `Reachable(...)` markers are attached to distinct outcomes or branch results, not redundant early path-entry locations on the same straight-line flow
156160
- For bounded inputs in test commands, draws come from property-specific value menus (boundary values for the input type plus configured-limit families from the property's code paths) rather than arbitrary ranges, or the test command documents that the menu axis is not applicable. See `interesting-values.md`.
157161
- Test commands exist under `antithesis/test/` and use valid prefixes (`parallel_driver_`, `singleton_driver_`, `serial_driver_`, `first_`, `eventually_`, `finally_`, `anytime_`)

antithesis-workload/references/assertions.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,12 +62,26 @@ These fit properties where the individual boolean inputs matter on their own, fo
6262

6363
When available, the SDK adds the named boolean inputs to assertion details under their names, which makes triage more informative than an anonymous combined expression.
6464

65+
## Assert Bounds, Not Exact Values
66+
67+
When a workload runs under fault injection, it can't directly observe everything that happened — requests fail in flight, acknowledgments are lost. The workload constructs bounds (see `component-implementation.md`, "Construct bounds, don't claim exact knowledge"), and the assertion's job is to check that observed state falls within those bounds.
68+
69+
Use rich numeric assertions to encode the bound. If a workload sent 100 increment requests to a counter and 80 were acknowledged:
70+
71+
- `AlwaysGreaterThanOrEqualTo(observed_delta, 80, "counter reflects all acknowledged increments")`
72+
- `AlwaysLessThanOrEqualTo(observed_delta, 100, "counter reflects no more than attempted increments")`
73+
74+
Two assertions, not one, so triage shows which bound failed.
75+
76+
Don't write `Always(observed_delta == 100, ...)` for the same property — that assertion will fire legitimately whenever the environment dropped requests, drowning real bugs in false positives.
77+
6578
## Anti-Rules
6679

6780
- Do not use `Sometimes(true, ...)` in normal workload or SUT code. If the condition is constant true, use `Reachable(...)` instead.
6881
- Do not use `Sometimes(cond, ...)` when the only thing you care about is that execution hit a path. Use `Reachable(...)`.
6982
- Do not reuse one assertion message across multiple unrelated callsites. Every assertion message should be unique in the codebase.
7083
- Do not stack broad early `Reachable(...)` markers on a straight-line flow when a later, more specific outcome marker already proves the path was exercised.
84+
- Do not assert exact equality on values affected by transient errors. Use bounded assertions (see "Assert Bounds, Not Exact Values").
7185

7286
## Good and Bad Uses
7387

antithesis-workload/references/component-implementation.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,27 @@ This is the most open-ended part of the skill. Start with the simplest component
2121

2222
## Fault Tolerance in Workload Code
2323

24-
If you are writing code that connects to a service over the network (e.g., test commands), ensure it handles all forms of temporary network faults. Read the Fault injection documentation to learn which faults your code needs to handle:
24+
Workloads run under fault injection. Transient errors are expected, not exceptional — the workload's job is to keep making progress toward its goal, not to bail on the first failure.
25+
26+
### Design for goals, not procedures
27+
28+
Write workload operations as loops that drive toward an objective, not fixed sequences of attempts. A workload that tracks "I tried 100 times" has nothing useful to assert against in a faulty environment; one that tracks progress against a goal can operate while the environment drops requests, partitions the network, or restarts nodes.
29+
30+
### Construct bounds, don't claim exact knowledge
31+
32+
Under fault injection a workload doesn't have perfect visibility — requests fail in flight, acknowledgments get dropped, clocks drift. The workload's job is to record enough of what happened to construct bounds the SUT must satisfy. Anything outside those bounds has probability zero of being correct, and that's how bugs surface.
33+
34+
The most common way to construct a bound is to track attempts and acknowledgments separately. If a workload sends 100 increment requests to a counter and 80 are acknowledged, the counter must have changed by some number between 80 and 100 — unacknowledged requests may have succeeded anyway. A later read showing 75 or 120 is a bug, in the SUT or in the workload itself. A workload that records only "publish() called 100 times" has no bound to assert against.
35+
36+
See `assertions.md`, "Assert Bounds, Not Exact Values" for how to express these bounds as assertions.
37+
38+
### Retries are inputs to the system
39+
40+
Retries are inputs the workload feeds into the SUT, not free additions — they shape what bugs you find. Retrying after an acknowledgment is lost can surface real idempotency bugs (good — that's a production scenario). Retrying past the SUT's idempotency contract creates faulty client behavior that produces false-positive assertion failures (bad). Decide what the SUT promises about idempotency before designing retry behavior.
41+
42+
### Faults to handle
43+
44+
Read the Fault injection documentation to learn what your code needs to tolerate:
2545

2646
- Process kill/restart
2747
- Network partition (full and partial)

0 commit comments

Comments
 (0)