Skip to content

Commit d879b3b

Browse files
authored
execution/commitment: remove inert parallel-commitment warmup; refresh design doc (erigontech#22042)
- Remove the inert `EnvWarmupParallelProcess` warmup from `ParallelPatriciaHashed.Process` (never fed — `WarmKey` runs only on the sequential `HashSort` path), plus the dead flag. - Refresh `docs/design/parallel-patricia-hashed.md` for the erigontech#21945 deep storage fold.
1 parent bf63430 commit d879b3b

3 files changed

Lines changed: 109 additions & 63 deletions

File tree

common/dbg/experiments.go

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -46,10 +46,6 @@ var (
4646

4747
StagesOnlyBlocks = EnvBool("STAGES_ONLY_BLOCKS", false)
4848

49-
// EnvWarmupParallelProcess gates branch-cache warmup inside the parallel
50-
// commitment Process. Off by default; opt in to prefetch under measurement.
51-
EnvWarmupParallelProcess = EnvBool("ERIGON_WARMUP_PARALLEL_PROCESS", false)
52-
5349
MdbxLockInRam = EnvBool("MDBX_LOCK_IN_RAM", false)
5450
MdbxNoSync = EnvBool("MDBX_NO_FSYNC", false)
5551
MdbxNoSyncUnsafe = EnvBool("MDBX_NO_FSYNC_UNSAFE", false)

docs/design/parallel-patricia-hashed.md

Lines changed: 109 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -27,20 +27,29 @@ nibble that folds that child's subtree into a single cell concurrently, and
2727
re-folds the merged root row on the main goroutine — a *mount/fold* model driven
2828
from the touched-key prefix trie.
2929

30-
The fold is **single-level**: one worker per touched root nibble (≤ 16). A whole
31-
child subtree — for example one account's entire storage — folds on a single
32-
worker; nested mounting that would parallelise within a subtree is future work
33-
(§11).
30+
The top level mounts one worker per touched root nibble (≤ 16). A second level
31+
handles the case where the work concentrates inside one subtree: when
32+
a worker reaches a *big-storage account* (> `deepStorageThreshold` touched storage
33+
keys across ≥ 2 first-storage nibbles) it folds that account's storage subtree
34+
concurrently — one worker per touched first-storage nibble — instead of streaming
35+
it serially. This *deep storage fold* (§4.1.1) is the same mount/fold primitive
36+
applied one level down. Splitting deeper than the first storage nibble is future
37+
work (§11).
3438

3539
## 2. Preliminaries *(informative)*
3640

3741
`HexPatriciaHashed` keeps a `grid[128][16]cell` (one row per nibble depth) and a
38-
`currentKey`. Per sorted batch it unfolds down to the next key (loading branches
42+
`currentKey`. Account keys occupy depths 0–64 (the leaf carries the storage root);
43+
storage keys continue to depths 64–128. The same unfold/apply/fold codepath drives
44+
both — there is no separate storage trie. Per sorted batch it unfolds down to the
45+
next key (loading branches
3946
from the `PatriciaContext`), applies the update, and folds completed rows upward,
4047
hashing each branch and writing it via `PatriciaContext.PutBranch`; the final fold
4148
to row 0 yields the root. A branch hash mixes **every present nibble** of the
4249
branch, not only the touched ones — the property §4 must preserve under
43-
partitioning, here by unfolding the shared root row from the DB before mounting.
50+
partitioning, by unfolding any row a fold collapses from the DB first: the shared
51+
root row before mounting, and a whale's storage-root row before the deep fold
52+
(§4.1.1).
4453

4554
## 3. Data structures
4655

@@ -83,22 +92,24 @@ is guarded by `deferredMu` (`appendDeferred`).
8392
### 3.3 `ParallelPatriciaHashed` (`parallel_patricia_hashed.go`)
8493

8594
Holds a configuration/base `template *HexPatriciaHashed`, a `sync.Pool` of worker
86-
tries, a `TrieContextFactory`, `numWorkers`, the published `rootHash`, and — for
87-
the deferred path — a `leaveDeferredForCaller` flag with a `deferredForCaller`
88-
hand-off slice. The `template` doubles as the **mount base** during `Process`: it
89-
is unfolded to the root branch, the workers' folded cells are dropped into its row
90-
0, and it folds the merged root. (Outside `Process` it exposes
91-
ctx/cache/metrics/trace configuration only.)
95+
tries, a `TrieContextFactory`, the `cfg TrieConfig` and `accountKeyLen` used to mint
96+
pooled workers, `numWorkers`, the published `rootHash`, and — for the deferred path
97+
— a `leaveDeferredForCaller` flag with a `deferredForCaller` hand-off slice. An
98+
optional `streaming *StreamingCommitter`: when set, `Process` delegates to it
99+
(`processStreaming`, §10) and the mount path below is not used. The `template`
100+
doubles as the **mount base** during `Process`: it is unfolded to the root branch,
101+
the workers' folded cells are dropped into its row 0, and it folds the merged root.
102+
(Outside `Process` it exposes ctx/cache/metrics/trace configuration only.)
92103

93104
## 4. Pipeline
94105

95106
| phase | site | action |
96107
| --- | --- | --- |
97108
| 1. Touch | `Updates.TouchPlainKey` (ModeParallel) | insert each hashed key into the prefix trie, carrying its `plainKey`/`update` on the terminating node; no ETL collectors are used |
98-
| 2. Mount + fold | `processMounted`, concurrent | unfold the base to the root branch; mount a worker per touched root nibble; each folds its child subtree into a cell; drop the cells back into the base row and fold the merged root |
109+
| 2. Mount + fold | `processMounted`, concurrent | unfold the base to the root branch; mount a worker per touched root nibble; each folds its child subtree into a cell (a big-storage account's storage folds concurrently, §4.1.1); drop the cells back into the base row and fold the merged root |
99110
| 3. Commit | `Process` end | apply (or hand off) the merged deferred branch updates; publish the root |
100111

101-
### 4.1 Phase 2 — Mount and fold (`processMounted`, `dfsSubtree`)
112+
### 4.1 Phase 2 — Mount and fold (`processMounted`, `dfsSubtreeDeep`)
102113

103114
1. **Unfold the base.** `processMounted` unfolds `template` down to the root
104115
branch (`needUnfolding`/`unfold` on the zero prefix), loading the on-disk root
@@ -110,17 +121,24 @@ ctx/cache/metrics/trace configuration only.)
110121
`*HexPatriciaHashed`, calls `mountTo(base, nibble)` — inheriting a copy of the
111122
base's unfolded grid and sharing the base root cell read-only — and binds it to
112123
a fresh factory `PatriciaContext` with deferred branch writes enabled.
113-
3. **Build.** `dfsSubtree(child, [nibble]+child.ext)` walks the nibble's subtree in
114-
nibble-ascending order. At each terminating node it reconstructs the full hashed
115-
key, reads the `plainKey`/`update` off the node, and calls
124+
3. **Build.** `dfsSubtreeDeep(child, [nibble]+child.ext)` walks the nibble's subtree
125+
in nibble-ascending order. At each terminating node it reconstructs the full
126+
hashed key, reads the `plainKey`/`update` off the node, and calls
116127
`followAndUpdate`. A node MUST emit its own key **before** descending to its
117128
children, so an account at depth 64 precedes its storage keys — the sorted order
118129
the fold state machine requires (I4). A terminating node with a nil `plainKey`
119130
and no children is unsupported and MUST raise an error rather than be skipped.
120-
4. **Fold the mount.** `foldMounted(nibble)` folds the worker's subtree into a
121-
single cell at the nibble's child depth, stopping before it would absorb the
122-
shared base root row. The worker's deferred branch updates are appended to the
123-
shared accumulator and the worker is returned to the pool.
131+
When a node is a *big-storage account* (`isDeepStorageAccount`: depth 64, plain
132+
key set, ≥ 2 first-storage nibbles, `subtreeCount > deepStorageThreshold`) the
133+
walk does **not** descend into its storage children: it computes the storage root
134+
via the deep storage fold (§4.1.1) and injects it into the account leaf
135+
(`setAccountStorageRoot`).
136+
4. **Fold the mount.** `foldMounted(nibble)` folds the worker's subtree upward,
137+
stopping when it reaches `mountWall` — the depth `mountTo` records as
138+
`split-depth + 1`, so the top-level mount stops at depth 1, before it would absorb
139+
the shared base root row — and returns `grid[0][nibble]`. The worker's deferred
140+
branch updates are appended to the shared accumulator and the worker is returned
141+
to the pool.
124142
5. **Merge and fold root.** On the main goroutine, after `errgroup.Wait`, each
125143
folded cell is dropped into `base.grid[0][nibble]` (stripping the leading nibble
126144
a hash-only sub-branch carries in its extension) and the touch/after maps are
@@ -132,6 +150,39 @@ the unfolded grid; only the main goroutine mutates the base after `Wait`. There
132150
no fold-time barrier and no cross-worker synchronisation beyond the deferred-update
133151
mutex.
134152

153+
### 4.1.1 Deep storage fold (`foldStorageRoot`, `streaming_deep_fold.go`)
154+
155+
A big-storage account's storage subtree (a "whale") would otherwise fold serially on
156+
its top-nibble worker. `foldStorageRoot` folds it concurrently, applying the §4.1
157+
mount/fold model one level down at depth 64. It runs the same primitives
158+
(`mountTo`/`foldMounted`/`followAndUpdate`) and is shared verbatim by the streaming
159+
variant (§10).
160+
161+
1. **Unfold the storage-root branch.** `unfoldStorageBase(base, accHash[:64])` seeds a
162+
base worker by reading the account's on-disk storage-root branch
163+
(`branchFromCacheOrDB` + `decodeBranchIntoRow` — the same decode the account unfold
164+
`unfoldBranchNode` uses, entered manually at depth 64 instead of by recursive
165+
descent). This is I2 applied at depth 64:
166+
untouched on-disk first-storage-nibble siblings MUST be present before the storage
167+
root is folded, or they are dropped and the storage root — hence the state root —
168+
diverges (see I2).
169+
2. **Fold per first-storage nibble.** One errgroup worker per touched first-storage
170+
nibble: `foldStorageLeaf` mounts the shared base at that nibble (`mountWall = 65`),
171+
streams the nibble's sorted slots, and `foldMounted` returns the depth-65 child
172+
cell. Workers defer their branch writes into the shared accumulator. They own
173+
disjoint storage prefixes, so concurrent reads of the shared base are race-free.
174+
3. **Aggregate.** `aggregateMountedStorageRoot` overlays the folded child cells onto
175+
the unfolded base row (setting/clearing each touched present bit, leaving untouched
176+
on-disk siblings intact) and folds once to the account's storage-root cell.
177+
4. **Inject.** `setAccountStorageRoot` writes that hash into the account leaf
178+
(`cell.hash`, `hashLen = 32`); `computeCellHash` uses it as the storageRoot for an
179+
account whose storage cell was not streamed, so the leaf hashes identically to the
180+
serial path. The DFS then skips the account's storage children.
181+
182+
Below `deepStorageThreshold`, or with storage in a single first nibble, the account
183+
streams inline as in §4.1 — the per-account setup cost (a pooled worker, a fresh
184+
context, the storage-root unfold) only pays off for genuinely large storage.
185+
135186
### 4.2 Phase 3 — Commit and root publication
136187

137188
Workers accumulate `DeferredBranchUpdate`s rather than writing branches. After the
@@ -155,9 +206,11 @@ primary enforcement of I1.
155206
- **I1 — Equal root.** The published root equals the sequential root for every
156207
input (R1).
157208
- **I2 — Untouched-nibble preservation.** Because a branch hash mixes all present
158-
nibbles, the shared root row MUST be unfolded from `ctx.Branch` before mounting,
159-
so untouched on-disk siblings are present in `grid[0]` when the merged root is
160-
folded; workers write only their own touched subtree.
209+
nibbles, every branch row a fold collapses MUST first be unfolded from `ctx.Branch`
210+
so untouched on-disk siblings are present. This holds at two depths: the shared
211+
root row before mounting (`processMounted`), and each big-storage account's
212+
storage-root branch before the deep fold (`unfoldStorageBase`, §4.1.1). Dropping
213+
either unfold drops untouched siblings and diverges the root.
161214
- **I3 — `plainKey` follows the split.** `prefixTrie.Insert` MUST route a
162215
terminator's `plainKey` to the correct node across path-compression splits (§3.1).
163216
A misroute is a wrong DB read and a diverged root.
@@ -172,6 +225,11 @@ primary enforcement of I1.
172225
base root cell read-only; the merged base row is folded only on the main
173226
goroutine after `errgroup.Wait`, so concurrent structure/`plainKey` reads and the
174227
final fold are race-free.
228+
- **I7 — Deep fold equals inline stream.** For a big-storage account, the storage
229+
root from `foldStorageRoot` injected via `setAccountStorageRoot` MUST equal the
230+
root the serial inline stream would produce. Its per-first-nibble workers own
231+
disjoint storage prefixes, share the unfolded storage base read-only, and each
232+
defers its own branch writes (I5).
175233

176234
## 6. Integration contract
177235

@@ -212,16 +270,15 @@ substitution of the as-of reader is validated at runtime by the block-root check
212270
| --- | --- | --- |
213271
| `--experimental.parallel-commitment` | off | selects `VariantParallelHexPatricia` (`execctx.PickTrieVariant`) |
214272
| `--experimental.streaming-commitment` | off | selects `VariantStreamingHexPatricia` (`StreamingCommitter`); takes precedence over `--experimental.parallel-commitment` |
215-
| `ERIGON_WARMUP_PARALLEL_PROCESS` | off (env) | opt-in branch-cache prefetch inside the parallel/streaming `Process`; intended for measurement |
216-
| `deepStorageThreshold` | 1000 | touched-slot count above which an account's storage subtree folds concurrently (split at the first storage nibble); mitigates the whale bottleneck of §11 |
273+
| `deepStorageThreshold` | 1000 | compile-time const (not a runtime flag): per-account touched-storage-key count above which the storage subtree folds concurrently (§4.1.1); mitigates the whale bottleneck of §11 |
217274
| `numWorkers` | `runtime.NumCPU()` | worker-pool size and errgroup limit; override via `SetNumWorkers` |
218275

219276
## 8. Failure modes
220277

221278
| condition | behaviour |
222279
| --- | --- |
223280
| empty update set | return the template's existing root (matches the sequential no-op) |
224-
| base root carries an extension (`root.ext != 0`) | return an error — not yet supported by the single-level mount |
281+
| base root carries an extension (`root.ext != 0`) | return an error — not yet supported by the mount path |
225282
| terminating node with nil `plainKey` and no children | return an error (only reachable via a hashed-only `TouchHashedKey`; that path is not wired for the parallel trie) |
226283
| deferred apply failure (inline path) | discard the staged root; never surface an unpersisted root |
227284
| worker error mid-fold | cancel the group; return pooled deferred entries |
@@ -245,8 +302,8 @@ in scheduling.
245302
| --- | --- | --- |
246303
| flag | (default) | `--experimental.parallel-commitment` |
247304
| `Updates` mode | `ModeDirect` / `ModeUpdate` | `ModeParallel` |
248-
| parallel unit | none | one worker per **touched** top nibble (≤16) |
249-
| split granularity | none | touched top nibbles at depth 1 (single-level) |
305+
| parallel unit | none | one worker per **touched** top nibble (≤16), plus one per first-storage nibble inside a big-storage account |
306+
| split granularity | none | touched top nibbles at depth 1, and first-storage nibbles at depth 64 for big-storage accounts (§4.1.1) |
250307
| merge | single bottom-up fold | per-mount cells dropped into the base row, single root fold |
251308
| branch writes | inline | deferred, applied once or handed to the caller |
252309
| key delivery | one sorted stream | prefix trie carrying `plainKey`/`update` |
@@ -262,29 +319,37 @@ set, never a persistent per-split hph mutated by touches — that would break th
262319
monotonic `followAndUpdate` contract). It uses the `streaming` flag on `Updates`
263320
(not a new `Mode`).
264321

265-
Big-storage accounts run a streaming-local deep walk (`dfsDeepLocal`) instead of
266-
calling the parallel path's deep fold — `parallel_mount.go` is left untouched for
267-
isolation. See `execution/commitment/streaming_commitment.go`.
322+
Big-storage accounts take the **same** deep storage fold (§4.1.1): each split's
323+
`foldSplit` runs `dfsSubtreeDeep` with `foldStorageRoot`, shared verbatim from
324+
`streaming_deep_fold.go` — the deep-fold logic is not duplicated. See
325+
`execution/commitment/streaming_commitment.go`.
268326

269327
## 11. Performance characteristics *(informative)*
270328

271-
The fold is single-level: parallelism is bounded by the number of **touched** root
272-
nibbles (≤ 16) and, within each, by the cost of folding that child's whole subtree
273-
on one worker. Batches whose work concentrates inside one subtree — for example a
274-
single "whale" account with hundreds of thousands of storage slots — used to fold
275-
serially on one worker; accounts above `deepStorageThreshold` touched slots now
276-
fold their storage subtree concurrently (one worker per first storage nibble).
277-
Splitting deeper than the first storage nibble is future work. The benchmark `MockState`
278-
serializes reads on a shared lock and therefore under-reports the parallel speedup
279-
relative to production's independent per-worker MDBX readers. These figures are for
280-
inspection, not a CI gate.
329+
Top-level parallelism is bounded by the number of **touched** root nibbles (≤ 16);
330+
within a big-storage account the deep fold (§4.1.1) adds a second level bounded by
331+
its touched first-storage nibbles (≤ 16), so a single "whale" account with hundreds
332+
of thousands of storage slots folds across up to 16 workers instead of one.
333+
334+
At `numWorkers = NumCPU` the parallel commitment is effectively core-bound: worker
335+
budget beyond NumCPU buys little, and lowering `deepStorageThreshold` to detach
336+
medium accounts costs more in per-account setup (a pooled worker, a fresh context,
337+
the storage-root unfold) than the extra split saves. Splitting deeper than the first
338+
storage nibble, and detaching storage below the whale threshold, are not currently
339+
worthwhile.
340+
341+
The benchmark `MockState` serializes reads on a shared lock and therefore
342+
under-reports the parallel speedup relative to production's independent per-worker
343+
MDBX readers; figures are for inspection, not a CI gate.
281344

282345
## 12. Source map
283346

284347
| file | contents |
285348
| --- | --- |
286-
| `execution/commitment/parallel_patricia_hashed.go` | `ParallelPatriciaHashed`, `Process`, `dfsSubtree`, deferred apply and hand-off |
287-
| `execution/commitment/parallel_mount.go` | `processMounted` — unfold, per-nibble mount/fold, merged root fold; `mountTo` mount primitive |
349+
| `execution/commitment/parallel_patricia_hashed.go` | `ParallelPatriciaHashed`, `Process` (routes to `processStreaming` when a committer is set), `dfsSubtree`, deferred apply and hand-off |
350+
| `execution/commitment/parallel_mount.go` | `processMounted` — unfold, per-nibble mount/fold via `dfsSubtreeDeep`, merged root fold; `mountTo`; `setAccountStorageRoot`; `deepStorageThreshold` |
351+
| `execution/commitment/streaming_deep_fold.go` | the deep storage fold shared by the parallel and streaming paths: `dfsSubtreeDeep`, `isDeepStorageAccount`, `foldStorageRoot`, `unfoldStorageBase`, `foldStorageLeaf`, `aggregateMountedStorageRoot` |
352+
| `execution/commitment/hex_patricia_hashed.go` | sequential engine; `foldMounted` and the `mountWall` stop used by both fold levels |
288353
| `execution/commitment/parallel_update.go` | `parallelUpdate`, `plainKeyArena`, `Insert`/deferred accumulation |
289354
| `execution/commitment/prefix_trie.go` | path-compressed prefix trie + slab arena; `Insert` `plainKey` placement |
290355
| `execution/commitment/commitment.go` | `Updates` (ModeParallel carries keys in the prefix trie), `InitializeTrieAndUpdates` |

execution/commitment/parallel_patricia_hashed.go

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,6 @@ import (
2424
"runtime"
2525
"sync"
2626
"sync/atomic"
27-
28-
"github.com/erigontech/erigon/common/dbg"
2927
)
3028

3129
// ParallelPatriciaHashed is the trie-side of the parallel commitment pipeline.
@@ -259,16 +257,6 @@ func (p *ParallelPatriciaHashed) Process(
259257
return rh, nil
260258
}
261259

262-
var warmuper *Warmuper
263-
if warmup.Enabled && dbg.EnvWarmupParallelProcess {
264-
if warmup.CtxFactory == nil {
265-
warmup.CtxFactory = p.trieCtxFactory
266-
}
267-
warmuper = NewWarmuper(ctx, warmup)
268-
warmuper.Start()
269-
defer warmuper.CloseAndWait()
270-
}
271-
272260
rh, mErr := p.processMounted(ctx, updates)
273261
if mErr != nil {
274262
pu.deferredMu.Lock()
@@ -292,9 +280,6 @@ func (p *ParallelPatriciaHashed) Process(
292280
out := make([]byte, len(rh))
293281
copy(out, rh)
294282
p.rootHash.Store(&out)
295-
if warmuper != nil {
296-
warmuper.DrainPending()
297-
}
298283
flushTrieStateRates()
299284
return out, nil
300285
}

0 commit comments

Comments
 (0)