Skip to content

Commit dcd2e86

Browse files
committed
fix: fix wallclock recovery path
Refine TLA+ model Add more tests
1 parent aeef6c2 commit dcd2e86

19 files changed

Lines changed: 915 additions & 236 deletions

File tree

docs/recovery/README.md

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -117,20 +117,22 @@ This gives us three regimes:
117117
```
118118

119119
- **Before the danger zone**: batches are young. Nothing to do.
120-
- **In the danger zone**: batches might land stale, or might still make it. This is the window of uncertainty. The flush resolves it by forcing every `w_nonce` slot to finalize (batch wins or no-op wins). After the flush, the sequencer reads the scheduler's finalized state and cascades if needed.
121-
- **Past MAX_WAIT**: all unresolved batches are guaranteed stale by L1 monotonicity (`inclusion_block >= current_safe_block >= safe_block + MAX_WAIT`). Staleness self-resolves -- the L1 outcome doesn't matter because every possible inclusion is stale. This means the flush could in principle be skipped: just wait for all slots to be consumed (which happens naturally as L1 progresses), then read the scheduler's state. In the implementation, the flush is still recommended for all cases (it's cheap when past MAX_WAIT since all competing batches are stale anyway), but the self-resolution property is what makes the design robust to long outages.
120+
- **In the danger zone**: batches might land stale, or might still make it. This is the window of uncertainty. For **closed unresolved batches**, the flush resolves it by forcing every `w_nonce` slot to finalize (batch wins or no-op wins). After the flush, the sequencer reads the scheduler's finalized state and cascades if needed. An **open Tip** has no `w_nonce` slot yet, so it is not part of this uncertainty set.
121+
- **Past MAX_WAIT**: all unresolved batches are guaranteed stale by L1 monotonicity (`inclusion_block >= current_safe_block >= safe_block + MAX_WAIT`). For closed unresolved batches, the L1 outcome no longer matters because every eventual inclusion is stale, but wallet-nonce slots may still need to be flushed (or naturally consumed) before recovery can reconstruct the scheduler frontier. For an aging open Tip, there is no L1-slot uncertainty at all, so startup recovery can invalidate it directly.
122122

123123
**What TLA+ proves vs external reasoning**: the TLA+ model ([`preemptive.tla`](preemptive.tla)) proves that after all `w_nonce` slots are resolved (however that happens), ZombieSafety holds. It does not model the danger threshold or the passage of time. The claim that "past MAX_WAIT, staleness self-resolves" is an external argument from L1 monotonicity (`inclusion_block >= current_safe_block`), not something TLA+ checks.
124124

125125
Any recovery design must wait out this uncertainty. The question is how. The preemptive design (implemented here) forces resolution by going offline and flushing. An alternative optimistic design lets the uncertainty resolve naturally but keeps serving soft confirmations -- see [`history/`](history/) for that approach and why we preferred preemptive.
126126

127-
## Silver-Only Detection
127+
## Silver-Only for Submitted Batches
128128

129-
Recovery must only cascade-invalidate when the frontier batch is **Silver** (safe on L1). This constraint is shared by all recovery designs and is critical for correctness.
129+
The Silver-only constraint applies to **submitted batches whose L1 slot outcome is still relevant**. This is the zombie path, and it is where the optimistic-design counterexample from [`history/`](history/) still matters.
130130

131131
A Silver batch's L1 entry is permanent -- no mempool competition can kill it. The scheduler **will** see it, at a `w_nonce` lower than any recovery batch, and be poisoned. This ordering guarantee is what makes nonce poisoning reliable.
132132

133-
Detecting staleness on Pending or Bronze batches is unsafe: a recovery batch can take the frontier's L1 slot via wallet-nonce mutual exclusion, preventing the scheduler from ever seeing the stale frontier, and allowing non-frontier dead batches to pass the nonce check. TLA+ model checking found this bug; see [`history/`](history/) for the counterexample.
133+
Detecting staleness on Pending or Bronze submitted batches *before wallet-nonce uncertainty is resolved* is unsafe: a recovery batch can take the frontier's L1 slot via wallet-nonce mutual exclusion, preventing the scheduler from ever seeing the stale frontier, and allowing non-frontier dead batches to pass the nonce check. TLA+ model checking found this bug; see [`history/`](history/) for the counterexample.
134+
135+
The open Tip is different. It has no L1 transaction yet, so there is no `w_nonce` competition and no zombie risk. Once `current_safe_block - first_frame_safe_block >= MAX_WAIT_BLOCKS`, startup recovery can invalidate the stale Tip directly and open a fresh one. Likewise, after a preemptive flush has resolved all competing `w_nonce` slots for closed batches, the atomic recovery transaction can safely use **current staleness** on the oldest unresolved batch (closed or open).
134136

135137
## Preemptive Recovery Design
136138

@@ -166,7 +168,7 @@ There are no more mempool entries. All uncertainty is resolved.
166168
This is an atomic SQLite transaction operating on fully-finalized L1 state:
167169

168170
1. **Populate gold frontier** (`populate_safe_accepted_batches`): scan L1 safe inputs, simulate scheduler acceptance logic. Learn `schedulerExpected` -- the next batch nonce the scheduler needs.
169-
2. **Detect staleness**: if the first unaccepted batch is stale by inclusion, cascade-invalidate it and all successors (set `invalidated_at_ms` on each). If nothing is stale, skip to step 6 (Resume).
171+
2. **Detect staleness**: find the oldest unresolved batch (first closed batch past the accepted frontier, otherwise the open Tip). If its **current staleness** (`current_safe_block - first_frame_safe_block`) has reached `MAX_WAIT_BLOCKS`, cascade-invalidate it and all successors (set `invalidated_at_ms` on each). Closed-batch cascades rely on the preceding flush/safe-head sync to remove wallet-nonce uncertainty; Tip cascades need no flush because the Tip has no L1 slot yet. If nothing is stale, skip to step 6 (Resume).
170172
3. **Open recovery batch**: fresh batch whose `parent_batch_index` is the last valid ancestor. Its `nonce` is structurally `parent.nonce + 1`, which equals `schedulerExpected`. Re-drain direct inputs from invalidated batches.
171173

172174
### Step 6: Resume
@@ -175,23 +177,26 @@ Restart the batch submitter and user-op acceptance. The sequencer is back online
175177

176178
### Startup behavior
177179

178-
On startup, the sequencer doesn't know whether it was a preemptive shutdown, a spurious restart, or coming online after a long outage. It runs the same detection logic:
180+
On startup, the sequencer doesn't know whether it was a preemptive shutdown, a spurious restart, or coming online after a long outage. It therefore splits the check in two:
181+
182+
1. **Closed unresolved frontier batch in danger**: run the zombie-path check (`check_danger_zone`). If the first closed batch past the accepted frontier has entered the danger zone, flush (step 3), wait for finality (step 4), then run recovery (step 5).
183+
2. **No closed batch in danger**: skip the flush and run the atomic recovery transaction directly. This is the normal path on a clean restart, and it is also how startup handles an open Tip that has already crossed `MAX_WAIT_BLOCKS`.
179184

180-
1. **Before the danger zone**: no action needed. Continue normally.
181-
2. **In the danger zone**: flush (step 3), wait for finality (step 4), then run recovery (step 5).
182-
3. **Past MAX_WAIT**: staleness has self-resolved, but `w_nonce` slots may still be unresolved (batches pending in the mempool). Flush (step 3) to resolve slots, then run recovery (step 5). The flush is cheap here -- all competing batches are stale anyway.
185+
This means "danger at startup" is not one unified flow:
183186

184-
Cases 2 and 3 differ in *why* batches are stale (danger zone: they might land stale; past MAX_WAIT: they're guaranteed stale) but follow the same procedure. The flush in case 3 is an optimization concern, not a safety concern: even without flushing, any batch that eventually lands will be stale, so ZombieSafety holds. But `populate_safe_accepted_batches` needs to see all safe L1 entries to compute `schedulerExpected` accurately, so waiting for slot resolution (via flush or naturally) is needed for correct recovery.
187+
- **Closed unresolved batches** still need the flush because their `w_nonce` slots may contain zombie uncertainty.
188+
- **An aging open Tip** can be recovered directly because there is no L1 slot to resolve.
189+
- **Closed unresolved batches already past `MAX_WAIT_BLOCKS`** are guaranteed stale by monotonicity, but the sequencer still flushes before recovery so `populate_safe_accepted_batches` can reconstruct the scheduler frontier from fully resolved safe inputs.
185190

186-
**What TLA+ proves here**: the model does not distinguish these three cases. It proves ZombieSafety assuming all `w_nonce` slots are eventually resolved. The claim that past MAX_WAIT the flush can be replaced by waiting for natural slot resolution is external reasoning from L1 monotonicity.
191+
**What TLA+ proves here**: the model still abstracts away the full startup cutover/flush decision. It proves ZombieSafety once wallet-nonce slots resolve, and separately models direct recovery of an aging open Tip. The claim that past `MAX_WAIT`, closed-batch staleness self-resolves is external reasoning from L1 monotonicity.
187192

188193
### L1 unreachability
189194

190195
The danger zone check and the flush both require L1. If L1 is unreachable, the sequencer must decide whether to proceed (before danger zone) or block (in danger zone).
191196

192-
**At startup**: the sequencer attempts to sync the safe head from L1. If this fails, it falls back to a **wall-clock danger estimate**: read the oldest valid batch's `created_at_ms` from the DB, compute `wall_clock_age = (now - created_at) / seconds_per_block`, and compare against the danger threshold. If the estimate is before the danger zone, the sequencer proceeds with stale DB data — the input reader and batch submitter will catch up when L1 returns. If the estimate is in or past the danger zone, the sequencer refuses to start (it can't safely issue soft confirmations without knowing L1 state).
197+
**At startup**: the sequencer attempts to sync the safe head from L1. If this fails, it falls back to a **wall-clock danger estimate** based on the persisted last-L1-sync marker: compute `estimated_missed_blocks = (now - last_l1_sync_ms) / seconds_per_block`, adjust the danger threshold downward by that estimate, and run the unresolved-batch danger check against the stale DB view. If the estimate is before the danger zone, the sequencer proceeds with stale DB data — the input reader and batch submitter will catch up when L1 returns. If the estimate is in or past the danger zone, the sequencer refuses to start (it can't safely issue soft confirmations without knowing L1 state).
193198

194-
**At runtime**: the batch submitter retries on L1 errors (provider failures). On each retry, it runs the same wall-clock estimate: `estimated_missed_blocks = (now - last_l1_success) / seconds_per_block`. It adjusts the danger threshold downward by this estimate. If the adjusted check triggers, the batch submitter crashes for recovery. This ensures the sequencer doesn't keep issuing soft confirmations while disconnected from L1 long enough to cross the danger zone.
199+
**At runtime**: the batch submitter retries on L1 errors (provider failures). On each retry, it runs the same wall-clock estimate: `estimated_missed_blocks = (now - last_l1_sync_ms) / seconds_per_block`. It adjusts the danger threshold downward by this estimate. If the adjusted check triggers, the batch submitter crashes for recovery. This ensures the sequencer doesn't keep issuing soft confirmations while disconnected from L1 long enough to cross the danger zone.
195200

196201
**Other workers during L1 outages**: the inclusion lane and API are purely local (SQLite) and continue operating. The input reader retries L1 polling with error logging. All L1-dependent workers log errors at the `error` level to alert operators.
197202

@@ -230,9 +235,9 @@ The recovery design is verified with bounded TLA+ model checking. The canonical
230235

231236
### `preemptive.tla` -- Slot-level safety under adversarial flush
232237

233-
Models the core slot-level mechanics of preemptive recovery. At every `w_nonce` slot, L1 non-deterministically includes the spine batch OR a flush no-op (killing the batch). This covers the case where the frontier batch itself is killed during flush.
238+
Models the core slot-level mechanics of preemptive recovery. At every `w_nonce` slot, L1 non-deterministically includes the spine batch OR a flush no-op (killing the batch). This covers the case where the frontier batch itself is killed during flush. The model also treats the open Tip's `safe_block` as meaningful, so it can explicitly recover an aging Tip that has no L1 footprint yet.
234239

235-
The model is a **safety over-approximation**: it allows `AdvanceTip` and `SubmitBatch` to interleave freely with recovery, which the real protocol prevents (the sequencer goes offline). This makes the proof stronger -- if `ZombieSafety` holds under more interleavings, it holds under fewer. However, the model does not verify the sequential protocol phases (cutover, flush, wait, recover, resume) described above.
240+
The model is a **safety over-approximation**: it allows `AdvanceTip` and `SubmitBatch` to interleave freely with recovery, which the real protocol prevents (the sequencer goes offline). This makes the proof stronger -- if `ZombieSafety` holds under more interleavings, it holds under fewer. However, the model does not verify the full sequential protocol phases (cutover, flush, wait, recover, resume) described above; in particular, the startup decision of whether a closed unresolved batch must flush before recovery remains an external argument layered on top of the slot-level proof.
236241

237242
**Verified**: 157M states, 0 violations.
238243

docs/recovery/preemptive.tla

Lines changed: 40 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@
1111
* so schedulerExpected stays stuck at its batch_nonce. All subsequent
1212
* batches — whether alive on L1 or dead — have wrong nonces.
1313
* Recovery resubmits the killed batch; if stale by inclusion, Resolve
14-
* cascades; if fresh, the scheduler accepts it.
14+
* cascades; if fresh, the scheduler accepts it. Resolve can also
15+
* discard an aging open Tip whose current-safe-block age has reached
16+
* MAX_WAIT_BLOCKS.
1517
*
1618
* Colors on the spine: Gold* Silver* Bronze* Pending* Tip
1719
* During flush, SpineOrdering can be temporarily violated (a killed
@@ -31,7 +33,7 @@
3133
* AdvanceSafeBlock -- L1 safe block advances, Bronze -> Silver
3234
* SchedulerStep -- scheduler processes next safe entry -> Gold
3335
* SchedulerSkip -- scheduler skips gap (no-op slot)
34-
* Resolve -- Silver frontier stale -> cascade, recover
36+
* Resolve -- stale unresolved frontier -> cascade, recover
3537
*)
3638

3739
EXTENDS Integers, Sequences, FiniteSets
@@ -107,6 +109,8 @@ SilverAtBN(s, bn) ==
107109

108110
IsStaleByInclusion(b) == b.inclusion_block - b.safe_block >= MAX_WAIT_BLOCKS
109111

112+
IsStaleByCurrent(b) == currentSafeBlock - b.safe_block >= MAX_WAIT_BLOCKS
113+
110114
---------------------------------------------------------------------------
111115
(* Invariants *)
112116

@@ -163,6 +167,10 @@ Inv ==
163167
* This is a modeling technique that eliminates the nonce-0 edge
164168
* case, allowing Resolve to use uniform logic. The implementation
165169
* can handle nonce-0 however is simplest (see README.md).
170+
*
171+
* Tip.safe_block models the first frame's safe_block of the open batch.
172+
* Keeping it meaningful lets the spec represent a Tip that ages past
173+
* MAX_WAIT_BLOCKS before ever getting an L1 transaction.
166174
*)
167175
Init ==
168176
/\ spine = <<[index |-> 0, color |-> Gold, safe_block |-> 0,
@@ -188,26 +196,26 @@ AdvanceTip ==
188196
/\ nextIndex <= MaxBatchIndex
189197
/\ LET tipPos == Len(spine) IN
190198
/\ spine[tipPos].color = Tip
191-
/\ \E sb \in 0..currentSafeBlock :
192-
/\ (tipPos > 1 => sb >= spine[tipPos - 1].safe_block)
193-
/\ spine' = [i \in 1..Len(spine) + 1 |->
194-
IF i < tipPos THEN spine[i]
195-
ELSE IF i = tipPos
196-
THEN [index |-> spine[tipPos].index,
197-
color |-> Pending,
198-
safe_block |-> sb,
199-
inclusion_block |-> 0,
200-
w_nonce |-> NONE,
201-
batch_nonce |-> tipPos - 1]
202-
ELSE [index |-> nextIndex,
203-
color |-> Tip,
204-
safe_block |-> 0,
205-
inclusion_block |-> 0,
206-
w_nonce |-> NONE,
207-
batch_nonce |-> 0]]
208-
/\ invalid' = [i \in 1..Len(spine) + 1 |->
209-
IF i <= Len(spine) THEN invalid[i] ELSE 0]
210-
/\ nextIndex' = nextIndex + 1
199+
/\ spine[tipPos].safe_block <= currentSafeBlock
200+
/\ (tipPos > 1 => spine[tipPos].safe_block >= spine[tipPos - 1].safe_block)
201+
/\ spine' = [i \in 1..Len(spine) + 1 |->
202+
IF i < tipPos THEN spine[i]
203+
ELSE IF i = tipPos
204+
THEN [index |-> spine[tipPos].index,
205+
color |-> Pending,
206+
safe_block |-> spine[tipPos].safe_block,
207+
inclusion_block |-> 0,
208+
w_nonce |-> NONE,
209+
batch_nonce |-> tipPos - 1]
210+
ELSE [index |-> nextIndex,
211+
color |-> Tip,
212+
safe_block |-> currentSafeBlock,
213+
inclusion_block |-> 0,
214+
w_nonce |-> NONE,
215+
batch_nonce |-> 0]]
216+
/\ invalid' = [i \in 1..Len(spine) + 1 |->
217+
IF i <= Len(spine) THEN invalid[i] ELSE 0]
218+
/\ nextIndex' = nextIndex + 1
211219
/\ UNCHANGED <<currentSafeBlock, walletNonce, nextL1Slot,
212220
l1Included, schedulerCursor, schedulerExpected,
213221
deadBatches>>
@@ -369,12 +377,14 @@ SchedulerSkip ==
369377

370378
---------------------------------------------------------------------------
371379
(*
372-
* Resolve: the frontier Silver is stale -> cascade-invalidate.
380+
* Resolve: the oldest unresolved batch is definitely stale ->
381+
* cascade-invalidate.
373382
*
374-
* The frontier must be Silver (safe on L1). After the flush, this
375-
* is either the first unaccepted batch (it survived the flush but
376-
* is stale by inclusion), or a resubmitted batch that was killed
377-
* during flush and resubmitted.
383+
* Two cases are modeled:
384+
* 1. the frontier unresolved batch is Silver and stale by inclusion
385+
* (the submitted-batch zombie path), or
386+
* 2. the frontier unresolved batch is Tip and stale by currentSafeBlock
387+
* (the aging open-batch path).
378388
*
379389
* Cascade-invalidated batches already on L1 (Silver/Bronze) remain
380390
* in l1Included. Submitted Pendings become dead batches.
@@ -390,8 +400,8 @@ Resolve ==
390400
/\ nextIndex <= MaxBatchIndex
391401
/\ LET fng == FirstNonGold(spine) IN
392402
/\ fng > 1
393-
/\ spine[fng].color = Silver
394-
/\ IsStaleByInclusion(spine[fng])
403+
/\ ((spine[fng].color = Silver /\ IsStaleByInclusion(spine[fng]))
404+
\/ (spine[fng].color = Tip /\ IsStaleByCurrent(spine[fng])))
395405
/\ LET newLen == fng
396406
newDead ==
397407
{[batch_nonce |-> spine[i].batch_nonce,
@@ -404,7 +414,7 @@ Resolve ==
404414
IF i < fng THEN spine[i]
405415
ELSE [index |-> nextIndex,
406416
color |-> Tip,
407-
safe_block |-> 0,
417+
safe_block |-> currentSafeBlock,
408418
inclusion_block |-> 0,
409419
w_nonce |-> NONE,
410420
batch_nonce |-> 0]]

0 commit comments

Comments
 (0)