|
| 1 | +--- |
| 2 | +name: bal-devnet-ab-test |
| 3 | +description: A/B test erigon's BAL parallel-execution scheduling on any BAL devnet (bal-devnet-N). Uses launch-devnet to bring up the primary instance, then spins up a second instance with IGNORE_BAL=true on a different port offset so throughput can be compared head-to-head. Use this when the user wants to compare BAL vs non-BAL parallel execution. |
| 4 | +allowed-tools: Bash, Read, Write, Edit, Glob |
| 5 | +allowed-prompts: |
| 6 | + - tool: Bash |
| 7 | + prompt: start, stop, and manage two parallel erigon+CL instances for BAL A/B testing |
| 8 | +--- |
| 9 | + |
| 10 | +# BAL A/B Testing Wrapper |
| 11 | + |
| 12 | +Compare erigon's parallel-execution throughput with vs without BAL (Block Access List) scheduling on a `bal-devnet-N` network. The flow is: |
| 13 | + |
| 14 | +- **Instance A (BAL)** — default behaviour. Uses BAL hints from blocks to pre-populate version maps and schedule transactions optimistically. |
| 15 | +- **Instance B (No-BAL)** — same code, but with `IGNORE_BAL=true` set in the environment, forcing the dependency-tracking scheduling path. |
| 16 | + |
| 17 | +Both instances sync the same chain so `gas/s`, `repeat%`, `abort`, and `invalid` counters can be compared directly at chain tip. |
| 18 | + |
| 19 | +## Prerequisite: bring up Instance A |
| 20 | + |
| 21 | +Use the [`launch-devnet`](../launch-devnet/SKILL.md) skill with the BAL devnet URL the user provides (e.g. `https://bal-devnet-3.ethpandaops.io`). That skill discovers chain id, forks, bootnodes, and CL image from the network's config service, generates start/stop/clean scripts, and monitors progress. **Do not duplicate that logic here.** |
| 22 | + |
| 23 | +When `launch-devnet` finishes, you'll have: |
| 24 | +- `$WORKDIR/` — Instance A working directory (default: `~/<devnet-name>/`) |
| 25 | +- `$WORKDIR/devnet-info.txt` — chain id, fork schedule, port offset chosen for Instance A |
| 26 | +- `$WORKDIR/inventory.json` — bootnodes & peer endpoints (reused by Instance B) |
| 27 | +- `$WORKDIR/genesis.json`, `$WORKDIR/testnet-config/` — config artefacts (reused by Instance B) |
| 28 | +- A running erigon (BAL) + CL pair |
| 29 | + |
| 30 | +Confirm Instance A is past genesis and importing payloads before starting Instance B. |
| 31 | + |
| 32 | +## Set up Instance B (No-BAL) |
| 33 | + |
| 34 | +`launch-devnet` writes Instance A's chosen offset to `$WORKDIR/devnet-info.txt` as `port_offset: <N>` (key/value format). Read it instead of guessing, then derive Instance B's offset by adding `+300` (leaves room for `+200`/`+300` ephemeral nodes the user might also be running). Verify the candidate ports with `lsof` (cross-platform — `ss` is Linux-only) before committing to the offset. |
| 35 | + |
| 36 | +```bash |
| 37 | +NOBAL_DIR="${WORKDIR}-nobal" |
| 38 | + |
| 39 | +OFFSET_A=$(awk -F': *' '$1=="port_offset"{print $2}' "$WORKDIR/devnet-info.txt") |
| 40 | +OFFSET_B=$(( OFFSET_A + 300 )) |
| 41 | + |
| 42 | +# Re-check the +OFFSET_B family (TCP + UDP). Bump by +100 and retry on conflict. |
| 43 | +for p in <each B port at +OFFSET_B>; do |
| 44 | + lsof -nP -iTCP:$p -sTCP:LISTEN >/dev/null 2>&1 && echo "TCP $p in use" |
| 45 | +done |
| 46 | +for p in <each B UDP port at +OFFSET_B>; do |
| 47 | + lsof -nP -iUDP:$p >/dev/null 2>&1 && echo "UDP $p in use" |
| 48 | +done |
| 49 | + |
| 50 | +mkdir -p "$NOBAL_DIR/erigon-data" "$NOBAL_DIR/cl-data" |
| 51 | +cp "$WORKDIR/genesis.json" "$NOBAL_DIR/" |
| 52 | +cp -r "$WORKDIR/testnet-config" "$NOBAL_DIR/" |
| 53 | +cp "$WORKDIR/inventory.json" "$NOBAL_DIR/" |
| 54 | +cp "$WORKDIR/devnet-info.txt" "$NOBAL_DIR/devnet-info.txt" |
| 55 | +# Update Instance B's metadata to record the new offset and IGNORE_BAL flag |
| 56 | +sed -i.bak "s/^port_offset:.*/port_offset: $OFFSET_B/" "$NOBAL_DIR/devnet-info.txt" && rm "$NOBAL_DIR/devnet-info.txt.bak" |
| 57 | +echo "ignore_bal: true" >> "$NOBAL_DIR/devnet-info.txt" |
| 58 | + |
| 59 | +./build/bin/erigon init --datadir "$NOBAL_DIR/erigon-data" "$NOBAL_DIR/genesis.json" |
| 60 | +``` |
| 61 | + |
| 62 | +Generate Instance B's start scripts the same way `launch-devnet` does, with these changes: |
| 63 | + |
| 64 | +1. **`start-erigon.sh`** — add `export IGNORE_BAL=true` at the top of the script, before the `erigon` invocation. All other env vars stay the same as Instance A. Keep the `exec ./build/bin/erigon …` ending so the captured PID is the erigon PID (used by Cleanup). |
| 65 | +2. **All ports** — use the `+OFFSET_B` family computed above (or whichever offset survived port-conflict checks). Update both the erigon flags and the CL flags. |
| 66 | +3. **CL container name** — use `${DEVNET}-nobal-cl` to avoid colliding with Instance A's container. |
| 67 | +4. **CL `--execution-endpoint`** — point at Instance B's authrpc port, not A's. |
| 68 | +5. **CL `--disable-enr-auto-update`** — required when running two CLs on the same host. |
| 69 | + |
| 70 | +Start in the same order: erigon first, save its PID, **poll** for `jwt.hex` and authrpc bind (same loop as `launch-devnet` Step 8 — do not `sleep`), then CL. |
| 71 | + |
| 72 | +```bash |
| 73 | +cd "$NOBAL_DIR" && nohup bash start-erigon.sh > erigon-console.log 2>&1 & |
| 74 | +echo $! > "$NOBAL_DIR/erigon.pid" |
| 75 | + |
| 76 | +AUTHRPC_B=<authrpc port at +OFFSET_B> |
| 77 | +for i in $(seq 1 60); do |
| 78 | + if [ -f "$NOBAL_DIR/erigon-data/jwt.hex" ] \ |
| 79 | + && lsof -nP -iTCP:"$AUTHRPC_B" -sTCP:LISTEN >/dev/null 2>&1; then |
| 80 | + break |
| 81 | + fi |
| 82 | + sleep 1 |
| 83 | +done |
| 84 | +[ -f "$NOBAL_DIR/erigon-data/jwt.hex" ] || { echo "Instance B jwt.hex not produced after 60s"; exit 1; } |
| 85 | + |
| 86 | +cd "$NOBAL_DIR" && nohup bash start-cl.sh > cl-console.log 2>&1 & |
| 87 | +``` |
| 88 | + |
| 89 | +## Compare at chain tip |
| 90 | + |
| 91 | +Wait until both instances are caught up (compare `eth_blockNumber` against the public RPC). Once both are at-head, the steady-state metrics in the execution log lines are what matter: |
| 92 | + |
| 93 | +| Metric | Source | What it measures | |
| 94 | +|--------|--------|------------------| |
| 95 | +| `gas/s` | erigon execution log lines | Raw execution throughput | |
| 96 | +| `repeat%` | erigon execution log lines | Speculative re-execution rate (lower = better dependency prediction) | |
| 97 | +| `abort` | erigon execution log lines | Transactions aborted per batch | |
| 98 | +| `invalid` | erigon execution log lines | Transactions invalidated by conflict detection | |
| 99 | +| `blk/s` | erigon execution log lines | Block processing rate | |
| 100 | + |
| 101 | +**Expected**: BAL instance should have lower `repeat%` and `abort` because BAL pre-populates the version map, reducing false conflicts. The `gas/s` delta is the net throughput impact. |
| 102 | + |
| 103 | +Side-by-side log snapshot: |
| 104 | + |
| 105 | +```bash |
| 106 | +echo "=== BAL (A) ===" && \ |
| 107 | + grep -E "parallel (executed|done)" "$WORKDIR/erigon-data/logs/erigon.log" | tail -3 |
| 108 | +echo |
| 109 | +echo "=== No-BAL (B) ===" && \ |
| 110 | + grep -E "parallel (executed|done)" "${WORKDIR}-nobal/erigon-data/logs/erigon.log" | tail -3 |
| 111 | +``` |
| 112 | + |
| 113 | +For longer comparisons, sample both logs at fixed intervals and aggregate (mean ± stdev) rather than eyeballing the tail. |
| 114 | + |
| 115 | +## Cleanup of Instance B only |
| 116 | + |
| 117 | +```bash |
| 118 | +NOBAL_DIR="${WORKDIR}-nobal" |
| 119 | + |
| 120 | +docker stop "${DEVNET}-nobal-cl" 2>/dev/null || true |
| 121 | +docker rm "${DEVNET}-nobal-cl" 2>/dev/null || true |
| 122 | + |
| 123 | +# Stop Instance B's erigon by PID (saved at start time) — avoids pkill -f regex risks. |
| 124 | +if [ -f "$NOBAL_DIR/erigon.pid" ]; then |
| 125 | + PID=$(cat "$NOBAL_DIR/erigon.pid") |
| 126 | + kill "$PID" 2>/dev/null || true |
| 127 | + rm -f "$NOBAL_DIR/erigon.pid" |
| 128 | +fi |
| 129 | + |
| 130 | +# Optional — wipe Instance B's working directory. |
| 131 | +# Refuse if WORKDIR is empty, NOBAL_DIR doesn't end with `-nobal`, or |
| 132 | +# the directory doesn't carry the marker file we wrote at setup time. |
| 133 | +# These guards prevent `rm -rf` from acting on an unrelated path if a |
| 134 | +# variable is unset or has been overwritten. |
| 135 | +if [ -n "$WORKDIR" ] \ |
| 136 | + && [[ "$NOBAL_DIR" == *-nobal ]] \ |
| 137 | + && [ -f "$NOBAL_DIR/devnet-info.txt" ] \ |
| 138 | + && grep -q '^ignore_bal: true' "$NOBAL_DIR/devnet-info.txt"; then |
| 139 | + rm -rf "$NOBAL_DIR" |
| 140 | +else |
| 141 | + echo "refusing to wipe '$NOBAL_DIR' — sanity check failed; clean it up by hand" |
| 142 | +fi |
| 143 | +``` |
| 144 | + |
| 145 | +Instance A is unaffected; use its own `stop.sh` / `clean.sh` to manage it. |
| 146 | + |
| 147 | +## Investigating discrepancies |
| 148 | + |
| 149 | +If A and B disagree on block contents or state root (rather than just on throughput), this is **not** a BAL-vs-no-BAL throughput question — it's a correctness divergence. Stop the comparison and follow the failure-investigation flow in [`launch-devnet`](../launch-devnet/SKILL.md) ("Investigating failures — finding the absolute truth"). The two instances are running the same erigon binary with one env-var difference, so a state-root diff between them points squarely at the BAL scheduling path. |
0 commit comments