You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make workload fault-tolerance mindset explicit (#150)
The skill listed fault types to handle but didn't frame the underlying
mindset — that workloads run under fault injection where transient
errors are expected, and design should retry toward goals rather than
bail. LLMs using the skill have been inconsistent at producing this
behavior; a colleague confirmed the pattern.
Source: Bug Bash podcast discussion (Marco Preey + Sean Allen).
Copy file name to clipboardExpand all lines: antithesis-workload/SKILL.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -131,6 +131,7 @@ Use the `antithesis-documentation` skill to access these pages. Prefer `snouty d
131
131
132
132
- Keep Antithesis-only code out of production paths. If you must touch shared code, make the change surgical and easy to wall off.
133
133
- Prefer simple workload code over highly configurable abstractions.
134
+
- Design workloads to be fault-tolerant. Unlike happy-path tests where errors signal bugs, workloads run under fault injection and concurrent activity — transient errors are expected, not exceptional. Make progress toward a goal rather than bail on the first failure.
134
135
- Assume `antithesis-setup` has already made the system runnable in a mostly idle state; this skill owns what the workload does once the system is up.
135
136
- Assume `antithesis-setup` has already installed the relevant SDK and added one minimal bootstrap assertion in the SUT. This skill owns the broader property catalog beyond that initial integration check.
136
137
- Write test commands in the project's language, not Bash, so they can reuse the project's clients, helpers, and libraries.
@@ -152,6 +153,9 @@ Review criteria:
152
153
-`Sometimes(true, ...)` assertions should be rewritten as `Reachable(...)`.
153
154
- Assertion messages are unique across the touched code; no broad property is implemented by reusing one message at multiple unrelated callsites
154
155
- Workload-only instrumentation was not used where surgical SUT-side assertions would provide materially better search guidance for rare, dangerous, or timing-sensitive internal states
156
+
- Workload code makes progress toward stated goals under transient errors rather than bailing on first failure
157
+
- The workload records both attempted and acknowledged operations so later assertions can check bounds (e.g., "counter changed by some value in `[acknowledged, attempted]`")
158
+
- Where retries are used, they're consistent with the SUT's idempotency contract
155
159
-`Reachable(...)` markers are attached to distinct outcomes or branch results, not redundant early path-entry locations on the same straight-line flow
156
160
- For bounded inputs in test commands, draws come from property-specific value menus (boundary values for the input type plus configured-limit families from the property's code paths) rather than arbitrary ranges, or the test command documents that the menu axis is not applicable. See `interesting-values.md`.
157
161
- Test commands exist under `antithesis/test/` and use valid prefixes (`parallel_driver_`, `singleton_driver_`, `serial_driver_`, `first_`, `eventually_`, `finally_`, `anytime_`)
Copy file name to clipboardExpand all lines: antithesis-workload/references/assertions.md
+14Lines changed: 14 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,12 +62,26 @@ These fit properties where the individual boolean inputs matter on their own, fo
62
62
63
63
When available, the SDK adds the named boolean inputs to assertion details under their names, which makes triage more informative than an anonymous combined expression.
64
64
65
+
## Assert Bounds, Not Exact Values
66
+
67
+
When a workload runs under fault injection, it can't directly observe everything that happened — requests fail in flight, acknowledgments are lost. The workload constructs bounds (see `component-implementation.md`, "Construct bounds, don't claim exact knowledge"), and the assertion's job is to check that observed state falls within those bounds.
68
+
69
+
Use rich numeric assertions to encode the bound. If a workload sent 100 increment requests to a counter and 80 were acknowledged:
70
+
71
+
-`AlwaysGreaterThanOrEqualTo(observed_delta, 80, "counter reflects all acknowledged increments")`
72
+
-`AlwaysLessThanOrEqualTo(observed_delta, 100, "counter reflects no more than attempted increments")`
73
+
74
+
Two assertions, not one, so triage shows which bound failed.
75
+
76
+
Don't write `Always(observed_delta == 100, ...)` for the same property — that assertion will fire legitimately whenever the environment dropped requests, drowning real bugs in false positives.
77
+
65
78
## Anti-Rules
66
79
67
80
- Do not use `Sometimes(true, ...)` in normal workload or SUT code. If the condition is constant true, use `Reachable(...)` instead.
68
81
- Do not use `Sometimes(cond, ...)` when the only thing you care about is that execution hit a path. Use `Reachable(...)`.
69
82
- Do not reuse one assertion message across multiple unrelated callsites. Every assertion message should be unique in the codebase.
70
83
- Do not stack broad early `Reachable(...)` markers on a straight-line flow when a later, more specific outcome marker already proves the path was exercised.
84
+
- Do not assert exact equality on values affected by transient errors. Use bounded assertions (see "Assert Bounds, Not Exact Values").
Copy file name to clipboardExpand all lines: antithesis-workload/references/component-implementation.md
+21-1Lines changed: 21 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,27 @@ This is the most open-ended part of the skill. Start with the simplest component
21
21
22
22
## Fault Tolerance in Workload Code
23
23
24
-
If you are writing code that connects to a service over the network (e.g., test commands), ensure it handles all forms of temporary network faults. Read the Fault injection documentation to learn which faults your code needs to handle:
24
+
Workloads run under fault injection. Transient errors are expected, not exceptional — the workload's job is to keep making progress toward its goal, not to bail on the first failure.
25
+
26
+
### Design for goals, not procedures
27
+
28
+
Write workload operations as loops that drive toward an objective, not fixed sequences of attempts. A workload that tracks "I tried 100 times" has nothing useful to assert against in a faulty environment; one that tracks progress against a goal can operate while the environment drops requests, partitions the network, or restarts nodes.
29
+
30
+
### Construct bounds, don't claim exact knowledge
31
+
32
+
Under fault injection a workload doesn't have perfect visibility — requests fail in flight, acknowledgments get dropped, clocks drift. The workload's job is to record enough of what happened to construct bounds the SUT must satisfy. Anything outside those bounds has probability zero of being correct, and that's how bugs surface.
33
+
34
+
The most common way to construct a bound is to track attempts and acknowledgments separately. If a workload sends 100 increment requests to a counter and 80 are acknowledged, the counter must have changed by some number between 80 and 100 — unacknowledged requests may have succeeded anyway. A later read showing 75 or 120 is a bug, in the SUT or in the workload itself. A workload that records only "publish() called 100 times" has no bound to assert against.
35
+
36
+
See `assertions.md`, "Assert Bounds, Not Exact Values" for how to express these bounds as assertions.
37
+
38
+
### Retries are inputs to the system
39
+
40
+
Retries are inputs the workload feeds into the SUT, not free additions — they shape what bugs you find. Retrying after an acknowledgment is lost can surface real idempotency bugs (good — that's a production scenario). Retrying past the SUT's idempotency contract creates faulty client behavior that produces false-positive assertion failures (bad). Decide what the SUT promises about idempotency before designing retry behavior.
41
+
42
+
### Faults to handle
43
+
44
+
Read the Fault injection documentation to learn what your code needs to tolerate:
0 commit comments