Skip to content

Commit d32025c

Browse files
committed
Merge branch 'main' of github.com:erigontech/erigon into worktree-spec-test-ci
2 parents 6ca4231 + 047644d commit d32025c

31 files changed

Lines changed: 2026 additions & 509 deletions
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
---
2+
name: bal-devnet-ab-test
3+
description: A/B test erigon's BAL parallel-execution scheduling on any BAL devnet (bal-devnet-N). Uses launch-devnet to bring up the primary instance, then spins up a second instance with IGNORE_BAL=true on a different port offset so throughput can be compared head-to-head. Use this when the user wants to compare BAL vs non-BAL parallel execution.
4+
allowed-tools: Bash, Read, Write, Edit, Glob
5+
allowed-prompts:
6+
- tool: Bash
7+
prompt: start, stop, and manage two parallel erigon+CL instances for BAL A/B testing
8+
---
9+
10+
# BAL A/B Testing Wrapper
11+
12+
Compare erigon's parallel-execution throughput with vs without BAL (Block Access List) scheduling on a `bal-devnet-N` network. The flow is:
13+
14+
- **Instance A (BAL)** — default behaviour. Uses BAL hints from blocks to pre-populate version maps and schedule transactions optimistically.
15+
- **Instance B (No-BAL)** — same code, but with `IGNORE_BAL=true` set in the environment, forcing the dependency-tracking scheduling path.
16+
17+
Both instances sync the same chain so `gas/s`, `repeat%`, `abort`, and `invalid` counters can be compared directly at chain tip.
18+
19+
## Prerequisite: bring up Instance A
20+
21+
Use the [`launch-devnet`](../launch-devnet/SKILL.md) skill with the BAL devnet URL the user provides (e.g. `https://bal-devnet-3.ethpandaops.io`). That skill discovers chain id, forks, bootnodes, and CL image from the network's config service, generates start/stop/clean scripts, and monitors progress. **Do not duplicate that logic here.**
22+
23+
When `launch-devnet` finishes, you'll have:
24+
- `$WORKDIR/` — Instance A working directory (default: `~/<devnet-name>/`)
25+
- `$WORKDIR/devnet-info.txt` — chain id, fork schedule, port offset chosen for Instance A
26+
- `$WORKDIR/inventory.json` — bootnodes & peer endpoints (reused by Instance B)
27+
- `$WORKDIR/genesis.json`, `$WORKDIR/testnet-config/` — config artefacts (reused by Instance B)
28+
- A running erigon (BAL) + CL pair
29+
30+
Confirm Instance A is past genesis and importing payloads before starting Instance B.
31+
32+
## Set up Instance B (No-BAL)
33+
34+
`launch-devnet` writes Instance A's chosen offset to `$WORKDIR/devnet-info.txt` as `port_offset: <N>` (key/value format). Read it instead of guessing, then derive Instance B's offset by adding `+300` (leaves room for `+200`/`+300` ephemeral nodes the user might also be running). Verify the candidate ports with `lsof` (cross-platform — `ss` is Linux-only) before committing to the offset.
35+
36+
```bash
37+
NOBAL_DIR="${WORKDIR}-nobal"
38+
39+
OFFSET_A=$(awk -F': *' '$1=="port_offset"{print $2}' "$WORKDIR/devnet-info.txt")
40+
OFFSET_B=$(( OFFSET_A + 300 ))
41+
42+
# Re-check the +OFFSET_B family (TCP + UDP). Bump by +100 and retry on conflict.
43+
for p in <each B port at +OFFSET_B>; do
44+
lsof -nP -iTCP:$p -sTCP:LISTEN >/dev/null 2>&1 && echo "TCP $p in use"
45+
done
46+
for p in <each B UDP port at +OFFSET_B>; do
47+
lsof -nP -iUDP:$p >/dev/null 2>&1 && echo "UDP $p in use"
48+
done
49+
50+
mkdir -p "$NOBAL_DIR/erigon-data" "$NOBAL_DIR/cl-data"
51+
cp "$WORKDIR/genesis.json" "$NOBAL_DIR/"
52+
cp -r "$WORKDIR/testnet-config" "$NOBAL_DIR/"
53+
cp "$WORKDIR/inventory.json" "$NOBAL_DIR/"
54+
cp "$WORKDIR/devnet-info.txt" "$NOBAL_DIR/devnet-info.txt"
55+
# Update Instance B's metadata to record the new offset and IGNORE_BAL flag
56+
sed -i.bak "s/^port_offset:.*/port_offset: $OFFSET_B/" "$NOBAL_DIR/devnet-info.txt" && rm "$NOBAL_DIR/devnet-info.txt.bak"
57+
echo "ignore_bal: true" >> "$NOBAL_DIR/devnet-info.txt"
58+
59+
./build/bin/erigon init --datadir "$NOBAL_DIR/erigon-data" "$NOBAL_DIR/genesis.json"
60+
```
61+
62+
Generate Instance B's start scripts the same way `launch-devnet` does, with these changes:
63+
64+
1. **`start-erigon.sh`** — add `export IGNORE_BAL=true` at the top of the script, before the `erigon` invocation. All other env vars stay the same as Instance A. Keep the `exec ./build/bin/erigon …` ending so the captured PID is the erigon PID (used by Cleanup).
65+
2. **All ports** — use the `+OFFSET_B` family computed above (or whichever offset survived port-conflict checks). Update both the erigon flags and the CL flags.
66+
3. **CL container name** — use `${DEVNET}-nobal-cl` to avoid colliding with Instance A's container.
67+
4. **CL `--execution-endpoint`** — point at Instance B's authrpc port, not A's.
68+
5. **CL `--disable-enr-auto-update`** — required when running two CLs on the same host.
69+
70+
Start in the same order: erigon first, save its PID, **poll** for `jwt.hex` and authrpc bind (same loop as `launch-devnet` Step 8 — do not `sleep`), then CL.
71+
72+
```bash
73+
cd "$NOBAL_DIR" && nohup bash start-erigon.sh > erigon-console.log 2>&1 &
74+
echo $! > "$NOBAL_DIR/erigon.pid"
75+
76+
AUTHRPC_B=<authrpc port at +OFFSET_B>
77+
for i in $(seq 1 60); do
78+
if [ -f "$NOBAL_DIR/erigon-data/jwt.hex" ] \
79+
&& lsof -nP -iTCP:"$AUTHRPC_B" -sTCP:LISTEN >/dev/null 2>&1; then
80+
break
81+
fi
82+
sleep 1
83+
done
84+
[ -f "$NOBAL_DIR/erigon-data/jwt.hex" ] || { echo "Instance B jwt.hex not produced after 60s"; exit 1; }
85+
86+
cd "$NOBAL_DIR" && nohup bash start-cl.sh > cl-console.log 2>&1 &
87+
```
88+
89+
## Compare at chain tip
90+
91+
Wait until both instances are caught up (compare `eth_blockNumber` against the public RPC). Once both are at-head, the steady-state metrics in the execution log lines are what matter:
92+
93+
| Metric | Source | What it measures |
94+
|--------|--------|------------------|
95+
| `gas/s` | erigon execution log lines | Raw execution throughput |
96+
| `repeat%` | erigon execution log lines | Speculative re-execution rate (lower = better dependency prediction) |
97+
| `abort` | erigon execution log lines | Transactions aborted per batch |
98+
| `invalid` | erigon execution log lines | Transactions invalidated by conflict detection |
99+
| `blk/s` | erigon execution log lines | Block processing rate |
100+
101+
**Expected**: BAL instance should have lower `repeat%` and `abort` because BAL pre-populates the version map, reducing false conflicts. The `gas/s` delta is the net throughput impact.
102+
103+
Side-by-side log snapshot:
104+
105+
```bash
106+
echo "=== BAL (A) ===" && \
107+
grep -E "parallel (executed|done)" "$WORKDIR/erigon-data/logs/erigon.log" | tail -3
108+
echo
109+
echo "=== No-BAL (B) ===" && \
110+
grep -E "parallel (executed|done)" "${WORKDIR}-nobal/erigon-data/logs/erigon.log" | tail -3
111+
```
112+
113+
For longer comparisons, sample both logs at fixed intervals and aggregate (mean ± stdev) rather than eyeballing the tail.
114+
115+
## Cleanup of Instance B only
116+
117+
```bash
118+
NOBAL_DIR="${WORKDIR}-nobal"
119+
120+
docker stop "${DEVNET}-nobal-cl" 2>/dev/null || true
121+
docker rm "${DEVNET}-nobal-cl" 2>/dev/null || true
122+
123+
# Stop Instance B's erigon by PID (saved at start time) — avoids pkill -f regex risks.
124+
if [ -f "$NOBAL_DIR/erigon.pid" ]; then
125+
PID=$(cat "$NOBAL_DIR/erigon.pid")
126+
kill "$PID" 2>/dev/null || true
127+
rm -f "$NOBAL_DIR/erigon.pid"
128+
fi
129+
130+
# Optional — wipe Instance B's working directory.
131+
# Refuse if WORKDIR is empty, NOBAL_DIR doesn't end with `-nobal`, or
132+
# the directory doesn't carry the marker file we wrote at setup time.
133+
# These guards prevent `rm -rf` from acting on an unrelated path if a
134+
# variable is unset or has been overwritten.
135+
if [ -n "$WORKDIR" ] \
136+
&& [[ "$NOBAL_DIR" == *-nobal ]] \
137+
&& [ -f "$NOBAL_DIR/devnet-info.txt" ] \
138+
&& grep -q '^ignore_bal: true' "$NOBAL_DIR/devnet-info.txt"; then
139+
rm -rf "$NOBAL_DIR"
140+
else
141+
echo "refusing to wipe '$NOBAL_DIR' — sanity check failed; clean it up by hand"
142+
fi
143+
```
144+
145+
Instance A is unaffected; use its own `stop.sh` / `clean.sh` to manage it.
146+
147+
## Investigating discrepancies
148+
149+
If A and B disagree on block contents or state root (rather than just on throughput), this is **not** a BAL-vs-no-BAL throughput question — it's a correctness divergence. Stop the comparison and follow the failure-investigation flow in [`launch-devnet`](../launch-devnet/SKILL.md) ("Investigating failures — finding the absolute truth"). The two instances are running the same erigon binary with one env-var difference, so a state-root diff between them points squarely at the BAL scheduling path.

0 commit comments

Comments
 (0)