Skip to content

Commit 074f0cd

Browse files
taratorioclaude
andauthored
claude: add generic launch-devnet skill (#21024)
## Summary Replaces the devnet-specific `launch-bal-devnet-3` Claude skill with two reusable skills: - **`launch-devnet`** — generic launcher for any ethpandaops devnet. Takes only a landing-page URL (e.g. `https://bal-devnet-3.ethpandaops.io`) and discovers everything else at runtime: chain id and fork schedule from `el/genesis.json`, CL fork epochs from `cl/config.yaml`, EL/CL bootnodes and client image tags from the inventory API, public RPC/beacon/checkpoint URLs from the host convention. Auto-detects port conflicts and bumps the offset, generates `start-erigon.sh` / `start-cl.sh` / `stop.sh` / `clean.sh`, starts erigon → waits for `jwt.hex` → starts the CL, then monitors EL head vs network head, peer counts, and CL sync status. - **`bal-devnet-ab-test`** — slim wrapper that reuses `launch-devnet` for the primary instance and adds a second instance with `IGNORE_BAL=true` on a `+400` port offset for head-to-head throughput comparison (`gas/s`, `repeat%`, `abort`, `invalid`). The old `launch-bal-devnet-3` skill is deleted; nothing else in the repo referenced it. ### Key design point: failure investigation `launch-devnet` includes an explicit "finding the absolute truth" section. The default assumption is **not** that erigon is wrong — on a multi-client devnet, erigon may be spec-correct while another client is buggy, the spec itself may be ambiguous (clients split into factions), or the network/genesis may be broken. The skill instructs the agent to: 1. Cross-check any divergence against ≥2 non-erigon ELs from the inventory. 2. Drill down to a specific block/slot/account/storage diff. 3. Treat the EIP text as authoritative over what other clients do. 4. Surface findings to the user with concrete data (specific block, account, EIP quote) rather than "please advise" — only after the diff is reproducible across a restart and at least one independent client supports the alternative result. A "common false-positive signals" list (optimistic head, first-newPayload timeouts, transient `peers: 0`) keeps the agent from escalating noise. ## Files ``` + .claude/skills/launch-devnet/SKILL.md # new — generic launcher + .claude/skills/bal-devnet-ab-test/SKILL.md # new — BAL A/B testing wrapper - .claude/skills/launch-bal-devnet-3/SKILL.md # removed — superseded ``` ## Test plan - [ ] Invoke `/launch-devnet https://bal-devnet-3.ethpandaops.io` and confirm it discovers chain id `7098917910`, the Amsterdam fork timestamp, and ≥10 EL/CL bootnodes from the inventory. - [ ] Confirm port-conflict detection bumps the offset when `+100` ports are already bound. - [ ] Confirm erigon syncs past genesis on bal-devnet-3 with the generated scripts (no hardcoded values). - [ ] Invoke `/launch-devnet` against a different devnet (e.g. `fusaka-devnet-N`) and confirm the same flow works without code changes. - [ ] Invoke `/bal-devnet-ab-test` after a successful `/launch-devnet` run and confirm Instance B starts on `+400` ports with `IGNORE_BAL=true` exported. - [ ] Trigger a synthetic state-root divergence and confirm the skill cross-checks against ≥2 non-erigon ELs before reporting a root cause. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent af78f19 commit 074f0cd

3 files changed

Lines changed: 503 additions & 283 deletions

File tree

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
---
2+
name: bal-devnet-ab-test
3+
description: A/B test erigon's BAL parallel-execution scheduling on any BAL devnet (bal-devnet-N). Uses launch-devnet to bring up the primary instance, then spins up a second instance with IGNORE_BAL=true on a different port offset so throughput can be compared head-to-head. Use this when the user wants to compare BAL vs non-BAL parallel execution.
4+
allowed-tools: Bash, Read, Write, Edit, Glob
5+
allowed-prompts:
6+
- tool: Bash
7+
prompt: start, stop, and manage two parallel erigon+CL instances for BAL A/B testing
8+
---
9+
10+
# BAL A/B Testing Wrapper
11+
12+
Compare erigon's parallel-execution throughput with vs without BAL (Block Access List) scheduling on a `bal-devnet-N` network. The flow is:
13+
14+
- **Instance A (BAL)** — default behaviour. Uses BAL hints from blocks to pre-populate version maps and schedule transactions optimistically.
15+
- **Instance B (No-BAL)** — same code, but with `IGNORE_BAL=true` set in the environment, forcing the dependency-tracking scheduling path.
16+
17+
Both instances sync the same chain so `gas/s`, `repeat%`, `abort`, and `invalid` counters can be compared directly at chain tip.
18+
19+
## Prerequisite: bring up Instance A
20+
21+
Use the [`launch-devnet`](../launch-devnet/SKILL.md) skill with the BAL devnet URL the user provides (e.g. `https://bal-devnet-3.ethpandaops.io`). That skill discovers chain id, forks, bootnodes, and CL image from the network's config service, generates start/stop/clean scripts, and monitors progress. **Do not duplicate that logic here.**
22+
23+
When `launch-devnet` finishes, you'll have:
24+
- `$WORKDIR/` — Instance A working directory (default: `~/<devnet-name>/`)
25+
- `$WORKDIR/devnet-info.txt` — chain id, fork schedule, port offset chosen for Instance A
26+
- `$WORKDIR/inventory.json` — bootnodes & peer endpoints (reused by Instance B)
27+
- `$WORKDIR/genesis.json`, `$WORKDIR/testnet-config/` — config artefacts (reused by Instance B)
28+
- A running erigon (BAL) + CL pair
29+
30+
Confirm Instance A is past genesis and importing payloads before starting Instance B.
31+
32+
## Set up Instance B (No-BAL)
33+
34+
`launch-devnet` writes Instance A's chosen offset to `$WORKDIR/devnet-info.txt` as `port_offset: <N>` (key/value format). Read it instead of guessing, then derive Instance B's offset by adding `+300` (leaves room for `+200`/`+300` ephemeral nodes the user might also be running). Verify the candidate ports with `lsof` (cross-platform — `ss` is Linux-only) before committing to the offset.
35+
36+
```bash
37+
NOBAL_DIR="${WORKDIR}-nobal"
38+
39+
OFFSET_A=$(awk -F': *' '$1=="port_offset"{print $2}' "$WORKDIR/devnet-info.txt")
40+
OFFSET_B=$(( OFFSET_A + 300 ))
41+
42+
# Re-check the +OFFSET_B family (TCP + UDP). Bump by +100 and retry on conflict.
43+
for p in <each B port at +OFFSET_B>; do
44+
lsof -nP -iTCP:$p -sTCP:LISTEN >/dev/null 2>&1 && echo "TCP $p in use"
45+
done
46+
for p in <each B UDP port at +OFFSET_B>; do
47+
lsof -nP -iUDP:$p >/dev/null 2>&1 && echo "UDP $p in use"
48+
done
49+
50+
mkdir -p "$NOBAL_DIR/erigon-data" "$NOBAL_DIR/cl-data"
51+
cp "$WORKDIR/genesis.json" "$NOBAL_DIR/"
52+
cp -r "$WORKDIR/testnet-config" "$NOBAL_DIR/"
53+
cp "$WORKDIR/inventory.json" "$NOBAL_DIR/"
54+
cp "$WORKDIR/devnet-info.txt" "$NOBAL_DIR/devnet-info.txt"
55+
# Update Instance B's metadata to record the new offset and IGNORE_BAL flag
56+
sed -i.bak "s/^port_offset:.*/port_offset: $OFFSET_B/" "$NOBAL_DIR/devnet-info.txt" && rm "$NOBAL_DIR/devnet-info.txt.bak"
57+
echo "ignore_bal: true" >> "$NOBAL_DIR/devnet-info.txt"
58+
59+
./build/bin/erigon init --datadir "$NOBAL_DIR/erigon-data" "$NOBAL_DIR/genesis.json"
60+
```
61+
62+
Generate Instance B's start scripts the same way `launch-devnet` does, with these changes:
63+
64+
1. **`start-erigon.sh`** — add `export IGNORE_BAL=true` at the top of the script, before the `erigon` invocation. All other env vars stay the same as Instance A. Keep the `exec ./build/bin/erigon …` ending so the captured PID is the erigon PID (used by Cleanup).
65+
2. **All ports** — use the `+OFFSET_B` family computed above (or whichever offset survived port-conflict checks). Update both the erigon flags and the CL flags.
66+
3. **CL container name** — use `${DEVNET}-nobal-cl` to avoid colliding with Instance A's container.
67+
4. **CL `--execution-endpoint`** — point at Instance B's authrpc port, not A's.
68+
5. **CL `--disable-enr-auto-update`** — required when running two CLs on the same host.
69+
70+
Start in the same order: erigon first, save its PID, **poll** for `jwt.hex` and authrpc bind (same loop as `launch-devnet` Step 8 — do not `sleep`), then CL.
71+
72+
```bash
73+
cd "$NOBAL_DIR" && nohup bash start-erigon.sh > erigon-console.log 2>&1 &
74+
echo $! > "$NOBAL_DIR/erigon.pid"
75+
76+
AUTHRPC_B=<authrpc port at +OFFSET_B>
77+
for i in $(seq 1 60); do
78+
if [ -f "$NOBAL_DIR/erigon-data/jwt.hex" ] \
79+
&& lsof -nP -iTCP:"$AUTHRPC_B" -sTCP:LISTEN >/dev/null 2>&1; then
80+
break
81+
fi
82+
sleep 1
83+
done
84+
[ -f "$NOBAL_DIR/erigon-data/jwt.hex" ] || { echo "Instance B jwt.hex not produced after 60s"; exit 1; }
85+
86+
cd "$NOBAL_DIR" && nohup bash start-cl.sh > cl-console.log 2>&1 &
87+
```
88+
89+
## Compare at chain tip
90+
91+
Wait until both instances are caught up (compare `eth_blockNumber` against the public RPC). Once both are at-head, the steady-state metrics in the execution log lines are what matter:
92+
93+
| Metric | Source | What it measures |
94+
|--------|--------|------------------|
95+
| `gas/s` | erigon execution log lines | Raw execution throughput |
96+
| `repeat%` | erigon execution log lines | Speculative re-execution rate (lower = better dependency prediction) |
97+
| `abort` | erigon execution log lines | Transactions aborted per batch |
98+
| `invalid` | erigon execution log lines | Transactions invalidated by conflict detection |
99+
| `blk/s` | erigon execution log lines | Block processing rate |
100+
101+
**Expected**: BAL instance should have lower `repeat%` and `abort` because BAL pre-populates the version map, reducing false conflicts. The `gas/s` delta is the net throughput impact.
102+
103+
Side-by-side log snapshot:
104+
105+
```bash
106+
echo "=== BAL (A) ===" && \
107+
grep -E "parallel (executed|done)" "$WORKDIR/erigon-data/logs/erigon.log" | tail -3
108+
echo
109+
echo "=== No-BAL (B) ===" && \
110+
grep -E "parallel (executed|done)" "${WORKDIR}-nobal/erigon-data/logs/erigon.log" | tail -3
111+
```
112+
113+
For longer comparisons, sample both logs at fixed intervals and aggregate (mean ± stdev) rather than eyeballing the tail.
114+
115+
## Cleanup of Instance B only
116+
117+
```bash
118+
NOBAL_DIR="${WORKDIR}-nobal"
119+
120+
docker stop "${DEVNET}-nobal-cl" 2>/dev/null || true
121+
docker rm "${DEVNET}-nobal-cl" 2>/dev/null || true
122+
123+
# Stop Instance B's erigon by PID (saved at start time) — avoids pkill -f regex risks.
124+
if [ -f "$NOBAL_DIR/erigon.pid" ]; then
125+
PID=$(cat "$NOBAL_DIR/erigon.pid")
126+
kill "$PID" 2>/dev/null || true
127+
rm -f "$NOBAL_DIR/erigon.pid"
128+
fi
129+
130+
# Optional — wipe Instance B's working directory.
131+
# Refuse if WORKDIR is empty, NOBAL_DIR doesn't end with `-nobal`, or
132+
# the directory doesn't carry the marker file we wrote at setup time.
133+
# These guards prevent `rm -rf` from acting on an unrelated path if a
134+
# variable is unset or has been overwritten.
135+
if [ -n "$WORKDIR" ] \
136+
&& [[ "$NOBAL_DIR" == *-nobal ]] \
137+
&& [ -f "$NOBAL_DIR/devnet-info.txt" ] \
138+
&& grep -q '^ignore_bal: true' "$NOBAL_DIR/devnet-info.txt"; then
139+
rm -rf "$NOBAL_DIR"
140+
else
141+
echo "refusing to wipe '$NOBAL_DIR' — sanity check failed; clean it up by hand"
142+
fi
143+
```
144+
145+
Instance A is unaffected; use its own `stop.sh` / `clean.sh` to manage it.
146+
147+
## Investigating discrepancies
148+
149+
If A and B disagree on block contents or state root (rather than just on throughput), this is **not** a BAL-vs-no-BAL throughput question — it's a correctness divergence. Stop the comparison and follow the failure-investigation flow in [`launch-devnet`](../launch-devnet/SKILL.md) ("Investigating failures — finding the absolute truth"). The two instances are running the same erigon binary with one env-var difference, so a state-root diff between them points squarely at the BAL scheduling path.

0 commit comments

Comments
 (0)