Skip to content

Commit 0561270

Browse files
echonet: Add echonet resync investigation skill (#14401)
1 parent 87b5bdd commit 0561270

1 file changed

Lines changed: 205 additions & 0 deletions

File tree

  • .claude/skills/investigate-echonet-resync
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
---
2+
name: investigate-echonet-resync
3+
description: Investigate why an Echonet resync triggered. Use when someone mentions an Echonet resync, a failing block in an `echonet-*` namespace, or a `block_hash_mismatch` / `transaction_commitment_mismatch` / `echonet_only_revert` / `gateway_error`. The skill sets up port-forwards, prefers GCP Logs Explorer for history, cross-checks the produced block against mainnet, and recognizes recurring failure patterns before reporting back.
4+
---
5+
6+
# Investigate an Echonet Resync
7+
8+
Echonet is a Starknet replay layer. It pulls mainnet blocks from the public feeder gateway, forwards their txs into a local Apollo sequencer pod, and compares the locally-produced block against mainnet. When something diverges — a gateway reject, an echonet-only revert, a block-hash mismatch, a transaction-commitment mismatch — Echonet scales the sequencer to zero, rewinds state, and restarts. **Your job is to figure out what triggered this specific resync.**
9+
10+
You'll be given a namespace and a failure block number. Everything else you discover yourself.
11+
12+
## 1. Discover the environment
13+
14+
Echonet namespaces live in the cluster whose kubeconfig context contains `sequencer-dev` (or whatever the current dev cluster is called). The K8s objects use canonical names:
15+
16+
- **Service:** `echonet` (port `80`)
17+
- **Deployment:** `echonet`
18+
- **Apollo sequencer:** a separate deployment in the same namespace; find it with `kubectl -n <ns> get pods | grep -iE 'apollo|sequencer'`.
19+
20+
```bash
21+
kubectl config current-context # confirm you're on the dev cluster
22+
kubectl get ns | grep -i echonet # list all echonet-* namespaces
23+
kubectl -n <ns> get deploy,svc,pods # what's actually running
24+
```
25+
26+
There can be several echonet namespaces in flight at once (e.g. `echonet-committer3`, `echonet-14-2`); always confirm which one the user named.
27+
28+
## 2. Set up a port-forward (if one isn't already running)
29+
30+
Pick any free local port — the canonical local ports are just convention.
31+
32+
```bash
33+
LOCAL=18083 # or any free port
34+
kubectl -n <ns> port-forward svc/echonet ${LOCAL}:80 \
35+
>/tmp/pf-${LOCAL}.log 2>&1 &
36+
curl -fsS "http://127.0.0.1:${LOCAL}/echonet/report/ui" >/dev/null && echo OK
37+
```
38+
39+
If the port is occupied by a stale process, find it (`ss -lntp | grep ${LOCAL}` or `lsof -iTCP:${LOCAL}`) and kill it before retrying. Don't reuse a port without confirming what's on it.
40+
41+
## 3. Echonet's HTTP surface
42+
43+
Useful endpoints (all under `http://127.0.0.1:${LOCAL}`):
44+
45+
**Reports / triage**
46+
- `/echonet/report/ui` — interactive UI; lists past resyncs, lets you drill into pre-resync snapshots.
47+
- `/echonet/report/ui/download` — same data as a static archive.
48+
- `/echonet/report` — JSON form of the current live report (post-last-resync state).
49+
- `/echonet/report/text` — plain-text version.
50+
- `/echonet/block_dump?block_number=<N>&kind=<blob|block|state_update>` — the raw payload Echonet stored for block N. Use `blob` for the cende blob, `block` for the feeder-gateway-shaped block, `state_update` for the matching state update.
51+
- `/echonet/get_block_metadata?block_number=<N>` — Echonet's view of block metadata.
52+
- `/echonet/get_tx_block_metadata?tx_hash=<H>` — which mainnet block a given tx came from.
53+
- `/echonet/get_starknet_version` — the version Echonet thinks it's on.
54+
55+
Echonet sometimes evicts old blocks from in-memory storage. If a `/feeder_gateway/get_block` query returns no result, the block may already be archived on the PVC (see §5) or, post-resync, replaced with the now-successful re-run. In that case, the snapshot taken **before** the resync (§5) is the source of truth.
56+
57+
## 4. Logs — prefer GCP, not kubectl
58+
59+
`kubectl logs` only retains a short window (often hours) and is wiped when a pod restarts. Since a resync intentionally restarts the sequencer pod, the logs you actually need are usually already gone from kubectl. **Default to GCP Logs Explorer.**
60+
61+
### GCP Logs Explorer
62+
63+
The cluster's GCP project is whatever project hosts the current kubectl context. Look it up rather than hardcoding:
64+
65+
```bash
66+
# Project of the current cluster:
67+
gcloud container clusters list \
68+
--filter="name=$(kubectl config current-context | awk -F_ '{print $NF}')" \
69+
--format='value(name,location,resourceLabels)'
70+
# Or, if you know the context format `gke_<project>_<region>_<cluster>`:
71+
kubectl config current-context | awk -F_ '{print "project="$2, "region="$3, "cluster="$4}'
72+
```
73+
74+
Then either open the Logs Explorer UI in the browser, or query from the CLI:
75+
76+
```bash
77+
PROJECT=<gcp-project>
78+
NS=<namespace> # e.g. echonet-committer3
79+
START="2026-06-07T08:00:00Z" # window around the resync
80+
END="2026-06-07T10:00:00Z"
81+
82+
# Echonet (Flask) logs
83+
gcloud logging read \
84+
"resource.type=\"k8s_container\"
85+
resource.labels.namespace_name=\"${NS}\"
86+
resource.labels.container_name=\"echonet\"
87+
timestamp>=\"${START}\" timestamp<=\"${END}\"" \
88+
--project="${PROJECT}" --limit=2000 --format='value(timestamp,jsonPayload.message,textPayload)' \
89+
> /tmp/echonet-${NS}.log
90+
91+
# Apollo sequencer logs — container name varies; check `kubectl -n <ns> get pod <p> -o yaml | grep -A1 'containers:'`
92+
gcloud logging read \
93+
"resource.type=\"k8s_container\"
94+
resource.labels.namespace_name=\"${NS}\"
95+
resource.labels.container_name=~\"apollo|sequencer\"
96+
timestamp>=\"${START}\" timestamp<=\"${END}\"" \
97+
--project="${PROJECT}" --limit=5000 --format='value(timestamp,jsonPayload.message,textPayload)' \
98+
> /tmp/sequencer-${NS}.log
99+
```
100+
101+
Useful filters to layer on:
102+
- Resync triggers: `textPayload=~"Resync triggered|record_resync_cause|gateway_error|Forward failed|mismatch|429"`
103+
- Block builder: `textPayload=~"block_builder|propose_block|consensus|batcher|cende_recorder"`
104+
- Specific tx: `textPayload=~"<tx_hash>"`
105+
- Specific block: `textPayload=~"block.{0,8}<block_number>"`
106+
107+
A useful UI link template (substitute the four bracketed values):
108+
```
109+
https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_container%22%0Aresource.labels.namespace_name%3D%22<NS>%22%0Aresource.labels.container_name%3D%22echonet%22?project=<PROJECT>
110+
```
111+
112+
### kubectl logs (fallback only)
113+
114+
Only useful if the resync is very recent and the pod hasn't restarted:
115+
116+
```bash
117+
kubectl -n <ns> logs deploy/echonet --since=2h | \
118+
grep -E 'Resync triggered|gateway_error|Forward failed|mismatch|429'
119+
kubectl -n <ns> logs <sequencer-pod> --since=2h
120+
```
121+
122+
## 5. Pre-resync report files (PVC)
123+
124+
Before every resync, Echonet writes a snapshot of its live state to disk. These are the most trustworthy record of what the system observed right before it gave up — they survive the resync itself, while in-memory state does not.
125+
126+
```bash
127+
POD=$(kubectl -n <ns> get pod -l app.kubernetes.io/name=echonet -o jsonpath='{.items[0].metadata.name}')
128+
kubectl -n <ns> exec "${POD}" -- ls -la /data/echonet/reports/ # find the snapshot near your timestamp
129+
kubectl -n <ns> cp "${POD}:/data/echonet/reports/<file>" /tmp/ # pull it locally
130+
```
131+
132+
Archived blocks evicted from memory live under the same PVC; if `/feeder_gateway/get_block` returns nothing, check there:
133+
134+
```bash
135+
kubectl -n <ns> exec "${POD}" -- ls /data/echonet/ # discover the archive dir name
136+
```
137+
138+
## 6. Compare Echonet's block to real mainnet
139+
140+
```bash
141+
N=<failure_block_number>
142+
PORT=<your_local_port>
143+
144+
curl "https://feeder.alpha-mainnet.starknet.io/feeder_gateway/get_block?blockNumber=${N}" > /tmp/mainnet_block_${N}.json
145+
curl "https://feeder.alpha-mainnet.starknet.io/feeder_gateway/get_state_update?blockNumber=${N}" > /tmp/mainnet_su_${N}.json
146+
curl "http://127.0.0.1:${PORT}/feeder_gateway/get_block?blockNumber=${N}" > /tmp/echo_block_${N}.json
147+
curl "http://127.0.0.1:${PORT}/echonet/block_dump?block_number=${N}&kind=block" > /tmp/echo_dump_block_${N}.json
148+
curl "http://127.0.0.1:${PORT}/echonet/block_dump?block_number=${N}&kind=state_update" > /tmp/echo_dump_su_${N}.json
149+
```
150+
151+
Diff `transaction_commitment`, `event_commitment`, `receipt_commitment`, `state_diff_commitment`, `block_hash`, then drill into receipts (`revert_error`, `actual_fee`, `events`, `messages_sent`) and the state diff itself. The differing commitment narrows down which subsystem produced the divergence.
152+
153+
## 7. Code map (anchors in this repo)
154+
155+
- `echonet/transaction_sender.py` — pulls feeder blocks, forwards txs to the local sequencer, evaluates resync policy.
156+
- `echonet/resync.py``ResyncPolicy.evaluate` (trigger logic), `ResyncExecutor.execute` (scale-down → wipe → scale-up).
157+
- `echonet/shared_context.py` — all shared mutable state: resync causes, mismatch tracking, block storage.
158+
- `echonet/echo_center.py` — Flask handlers: cende write_blob, `/l1` RPC mock, `/feeder_gateway/*`, report UI.
159+
- `echonet/l1_logic/l1_manager.py`, `l1_blocks.py`, `l1_client.py` — L1_HANDLER lookup against Alchemy.
160+
- Rust sequencer side: `crates/apollo_mempool/`, `crates/apollo_batcher/`, `crates/apollo_consensus_*/`, `crates/apollo_gateway/`.
161+
162+
Look at recent commits on the active echonet branch for context:
163+
164+
```bash
165+
git log --oneline -30 --all -- echonet/
166+
```
167+
168+
## 8. Known failure patterns — check these first
169+
170+
If the symptom matches one of these, name the pattern explicitly in the report and confirm the deployed image actually contains the fix:
171+
172+
```bash
173+
kubectl -n <ns> get pods -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u
174+
git log --all --oneline -- echonet/ crates/apollo_mempool/ crates/apollo_batcher/ # cross-reference SHAs
175+
```
176+
177+
Each namespace can run a different image; "fix already merged" doesn't mean "fix already deployed here."
178+
179+
1. **Cairo-native vs CASM revert traces.** `echonet_only_revert` whose `revert_error` differs from mainnet's only by the topmost VM frame (typically missing or extra `pc=…`). The resync clears the cairo-native cache so the second pass falls back to CASM and matches. Not actually a bug in Echonet — a Blockifier divergence.
180+
2. **Alchemy 429 → L1_HANDLER drop.** `transaction_commitment` mismatch on a block that contains an `L1_HANDLER` tx. Look for HTTP `429` in Echonet logs near the failure window. Root cause: `l1_manager.set_new_tx` silently returns when `find_l1_block_for_tx` fails, so the L1_HANDLER never reaches the local block.
181+
182+
## 9. Investigation workflow
183+
184+
1. Open the UI report, find the entry for the failing block. Note the **trigger tx_hash** and **reason** (`gateway_error`, `echonet_only_revert`, `block_hash_mismatch`, `transaction_commitment_mismatch`).
185+
2. Branch on the reason:
186+
- **`block_hash_mismatch` / `transaction_commitment_mismatch`** — diff Echonet's block vs mainnet's (§6). Isolate the differing field; drill into the txs / events / state diff that produced it.
187+
- **`gateway_error`** — pull the tx, grep Echonet + sequencer GCP logs (§4) for the tx_hash, identify the gateway response code and message.
188+
- **`echonet_only_revert`** — fetch `revert_error` from both Echonet (block dump) and mainnet (feeder); diff them. Often the cairo-native pattern (§8.1).
189+
3. Verify the deployed image actually contains any fix you intend to invoke as "explained" (§8).
190+
4. Cross-check against known patterns before assuming a new bug.
191+
5. If the resync already replayed cleanly, that's still worth noting — but a successful retry doesn't make the original divergence "harmless"; it just means it's non-deterministic.
192+
193+
## 10. Safety constraints (hard rules)
194+
195+
- **Never read** secret files: `*secret*`, `*keys.json`, `.env*`, `echonet/k8s/echonet/secret.yaml`.
196+
- **Don't restart pods, scale deployments, redeploy, or trigger resyncs** without explicit approval. The whole point of investigating a resync is preserving the evidence of what caused it.
197+
- Read-only `kubectl logs`, `kubectl exec -- cat|ls|stat`, `kubectl cp` out of the pod, port-forwards, `gcloud logging read`, and curling mainnet's public feeder are all fine without asking.
198+
199+
## 11. Reporting back
200+
201+
Give the user:
202+
- **Trigger tx hash and reason.**
203+
- **Which field/commitment differed**, with both values (if applicable).
204+
- **Whether it matches a known pattern** (cite the section above) or appears novel.
205+
- **Where the bug lives** (`file:line`) and a concrete fix suggestion if you have one, including whether the namespace in question is running an image that already contains a known fix.

0 commit comments

Comments
 (0)