|
| 1 | +--- |
| 2 | +name: investigate-echonet-resync |
| 3 | +description: Investigate why an Echonet resync triggered. Use when someone mentions an Echonet resync, a failing block in an `echonet-*` namespace, or a `block_hash_mismatch` / `transaction_commitment_mismatch` / `echonet_only_revert` / `gateway_error`. The skill sets up port-forwards, prefers GCP Logs Explorer for history, cross-checks the produced block against mainnet, and recognizes recurring failure patterns before reporting back. |
| 4 | +--- |
| 5 | + |
| 6 | +# Investigate an Echonet Resync |
| 7 | + |
| 8 | +Echonet is a Starknet replay layer. It pulls mainnet blocks from the public feeder gateway, forwards their txs into a local Apollo sequencer pod, and compares the locally-produced block against mainnet. When something diverges — a gateway reject, an echonet-only revert, a block-hash mismatch, a transaction-commitment mismatch — Echonet scales the sequencer to zero, rewinds state, and restarts. **Your job is to figure out what triggered this specific resync.** |
| 9 | + |
| 10 | +You'll be given a namespace and a failure block number. Everything else you discover yourself. |
| 11 | + |
| 12 | +## 1. Discover the environment |
| 13 | + |
| 14 | +Echonet namespaces live in the cluster whose kubeconfig context contains `sequencer-dev` (or whatever the current dev cluster is called). The K8s objects use canonical names: |
| 15 | + |
| 16 | +- **Service:** `echonet` (port `80`) |
| 17 | +- **Deployment:** `echonet` |
| 18 | +- **Apollo sequencer:** a separate deployment in the same namespace; find it with `kubectl -n <ns> get pods | grep -iE 'apollo|sequencer'`. |
| 19 | + |
| 20 | +```bash |
| 21 | +kubectl config current-context # confirm you're on the dev cluster |
| 22 | +kubectl get ns | grep -i echonet # list all echonet-* namespaces |
| 23 | +kubectl -n <ns> get deploy,svc,pods # what's actually running |
| 24 | +``` |
| 25 | + |
| 26 | +There can be several echonet namespaces in flight at once (e.g. `echonet-committer3`, `echonet-14-2`); always confirm which one the user named. |
| 27 | + |
| 28 | +## 2. Set up a port-forward (if one isn't already running) |
| 29 | + |
| 30 | +Pick any free local port — the canonical local ports are just convention. |
| 31 | + |
| 32 | +```bash |
| 33 | +LOCAL=18083 # or any free port |
| 34 | +kubectl -n <ns> port-forward svc/echonet ${LOCAL}:80 \ |
| 35 | + >/tmp/pf-${LOCAL}.log 2>&1 & |
| 36 | +curl -fsS "http://127.0.0.1:${LOCAL}/echonet/report/ui" >/dev/null && echo OK |
| 37 | +``` |
| 38 | + |
| 39 | +If the port is occupied by a stale process, find it (`ss -lntp | grep ${LOCAL}` or `lsof -iTCP:${LOCAL}`) and kill it before retrying. Don't reuse a port without confirming what's on it. |
| 40 | + |
| 41 | +## 3. Echonet's HTTP surface |
| 42 | + |
| 43 | +Useful endpoints (all under `http://127.0.0.1:${LOCAL}`): |
| 44 | + |
| 45 | +**Reports / triage** |
| 46 | +- `/echonet/report/ui` — interactive UI; lists past resyncs, lets you drill into pre-resync snapshots. |
| 47 | +- `/echonet/report/ui/download` — same data as a static archive. |
| 48 | +- `/echonet/report` — JSON form of the current live report (post-last-resync state). |
| 49 | +- `/echonet/report/text` — plain-text version. |
| 50 | +- `/echonet/block_dump?block_number=<N>&kind=<blob|block|state_update>` — the raw payload Echonet stored for block N. Use `blob` for the cende blob, `block` for the feeder-gateway-shaped block, `state_update` for the matching state update. |
| 51 | +- `/echonet/get_block_metadata?block_number=<N>` — Echonet's view of block metadata. |
| 52 | +- `/echonet/get_tx_block_metadata?tx_hash=<H>` — which mainnet block a given tx came from. |
| 53 | +- `/echonet/get_starknet_version` — the version Echonet thinks it's on. |
| 54 | + |
| 55 | +Echonet sometimes evicts old blocks from in-memory storage. If a `/feeder_gateway/get_block` query returns no result, the block may already be archived on the PVC (see §5) or, post-resync, replaced with the now-successful re-run. In that case, the snapshot taken **before** the resync (§5) is the source of truth. |
| 56 | + |
| 57 | +## 4. Logs — prefer GCP, not kubectl |
| 58 | + |
| 59 | +`kubectl logs` only retains a short window (often hours) and is wiped when a pod restarts. Since a resync intentionally restarts the sequencer pod, the logs you actually need are usually already gone from kubectl. **Default to GCP Logs Explorer.** |
| 60 | + |
| 61 | +### GCP Logs Explorer |
| 62 | + |
| 63 | +The cluster's GCP project is whatever project hosts the current kubectl context. Look it up rather than hardcoding: |
| 64 | + |
| 65 | +```bash |
| 66 | +# Project of the current cluster: |
| 67 | +gcloud container clusters list \ |
| 68 | + --filter="name=$(kubectl config current-context | awk -F_ '{print $NF}')" \ |
| 69 | + --format='value(name,location,resourceLabels)' |
| 70 | +# Or, if you know the context format `gke_<project>_<region>_<cluster>`: |
| 71 | +kubectl config current-context | awk -F_ '{print "project="$2, "region="$3, "cluster="$4}' |
| 72 | +``` |
| 73 | + |
| 74 | +Then either open the Logs Explorer UI in the browser, or query from the CLI: |
| 75 | + |
| 76 | +```bash |
| 77 | +PROJECT=<gcp-project> |
| 78 | +NS=<namespace> # e.g. echonet-committer3 |
| 79 | +START="2026-06-07T08:00:00Z" # window around the resync |
| 80 | +END="2026-06-07T10:00:00Z" |
| 81 | + |
| 82 | +# Echonet (Flask) logs |
| 83 | +gcloud logging read \ |
| 84 | + "resource.type=\"k8s_container\" |
| 85 | + resource.labels.namespace_name=\"${NS}\" |
| 86 | + resource.labels.container_name=\"echonet\" |
| 87 | + timestamp>=\"${START}\" timestamp<=\"${END}\"" \ |
| 88 | + --project="${PROJECT}" --limit=2000 --format='value(timestamp,jsonPayload.message,textPayload)' \ |
| 89 | + > /tmp/echonet-${NS}.log |
| 90 | + |
| 91 | +# Apollo sequencer logs — container name varies; check `kubectl -n <ns> get pod <p> -o yaml | grep -A1 'containers:'` |
| 92 | +gcloud logging read \ |
| 93 | + "resource.type=\"k8s_container\" |
| 94 | + resource.labels.namespace_name=\"${NS}\" |
| 95 | + resource.labels.container_name=~\"apollo|sequencer\" |
| 96 | + timestamp>=\"${START}\" timestamp<=\"${END}\"" \ |
| 97 | + --project="${PROJECT}" --limit=5000 --format='value(timestamp,jsonPayload.message,textPayload)' \ |
| 98 | + > /tmp/sequencer-${NS}.log |
| 99 | +``` |
| 100 | + |
| 101 | +Useful filters to layer on: |
| 102 | +- Resync triggers: `textPayload=~"Resync triggered|record_resync_cause|gateway_error|Forward failed|mismatch|429"` |
| 103 | +- Block builder: `textPayload=~"block_builder|propose_block|consensus|batcher|cende_recorder"` |
| 104 | +- Specific tx: `textPayload=~"<tx_hash>"` |
| 105 | +- Specific block: `textPayload=~"block.{0,8}<block_number>"` |
| 106 | + |
| 107 | +A useful UI link template (substitute the four bracketed values): |
| 108 | +``` |
| 109 | +https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_container%22%0Aresource.labels.namespace_name%3D%22<NS>%22%0Aresource.labels.container_name%3D%22echonet%22?project=<PROJECT> |
| 110 | +``` |
| 111 | + |
| 112 | +### kubectl logs (fallback only) |
| 113 | + |
| 114 | +Only useful if the resync is very recent and the pod hasn't restarted: |
| 115 | + |
| 116 | +```bash |
| 117 | +kubectl -n <ns> logs deploy/echonet --since=2h | \ |
| 118 | + grep -E 'Resync triggered|gateway_error|Forward failed|mismatch|429' |
| 119 | +kubectl -n <ns> logs <sequencer-pod> --since=2h |
| 120 | +``` |
| 121 | + |
| 122 | +## 5. Pre-resync report files (PVC) |
| 123 | + |
| 124 | +Before every resync, Echonet writes a snapshot of its live state to disk. These are the most trustworthy record of what the system observed right before it gave up — they survive the resync itself, while in-memory state does not. |
| 125 | + |
| 126 | +```bash |
| 127 | +POD=$(kubectl -n <ns> get pod -l app.kubernetes.io/name=echonet -o jsonpath='{.items[0].metadata.name}') |
| 128 | +kubectl -n <ns> exec "${POD}" -- ls -la /data/echonet/reports/ # find the snapshot near your timestamp |
| 129 | +kubectl -n <ns> cp "${POD}:/data/echonet/reports/<file>" /tmp/ # pull it locally |
| 130 | +``` |
| 131 | + |
| 132 | +Archived blocks evicted from memory live under the same PVC; if `/feeder_gateway/get_block` returns nothing, check there: |
| 133 | + |
| 134 | +```bash |
| 135 | +kubectl -n <ns> exec "${POD}" -- ls /data/echonet/ # discover the archive dir name |
| 136 | +``` |
| 137 | + |
| 138 | +## 6. Compare Echonet's block to real mainnet |
| 139 | + |
| 140 | +```bash |
| 141 | +N=<failure_block_number> |
| 142 | +PORT=<your_local_port> |
| 143 | + |
| 144 | +curl "https://feeder.alpha-mainnet.starknet.io/feeder_gateway/get_block?blockNumber=${N}" > /tmp/mainnet_block_${N}.json |
| 145 | +curl "https://feeder.alpha-mainnet.starknet.io/feeder_gateway/get_state_update?blockNumber=${N}" > /tmp/mainnet_su_${N}.json |
| 146 | +curl "http://127.0.0.1:${PORT}/feeder_gateway/get_block?blockNumber=${N}" > /tmp/echo_block_${N}.json |
| 147 | +curl "http://127.0.0.1:${PORT}/echonet/block_dump?block_number=${N}&kind=block" > /tmp/echo_dump_block_${N}.json |
| 148 | +curl "http://127.0.0.1:${PORT}/echonet/block_dump?block_number=${N}&kind=state_update" > /tmp/echo_dump_su_${N}.json |
| 149 | +``` |
| 150 | + |
| 151 | +Diff `transaction_commitment`, `event_commitment`, `receipt_commitment`, `state_diff_commitment`, `block_hash`, then drill into receipts (`revert_error`, `actual_fee`, `events`, `messages_sent`) and the state diff itself. The differing commitment narrows down which subsystem produced the divergence. |
| 152 | + |
| 153 | +## 7. Code map (anchors in this repo) |
| 154 | + |
| 155 | +- `echonet/transaction_sender.py` — pulls feeder blocks, forwards txs to the local sequencer, evaluates resync policy. |
| 156 | +- `echonet/resync.py` — `ResyncPolicy.evaluate` (trigger logic), `ResyncExecutor.execute` (scale-down → wipe → scale-up). |
| 157 | +- `echonet/shared_context.py` — all shared mutable state: resync causes, mismatch tracking, block storage. |
| 158 | +- `echonet/echo_center.py` — Flask handlers: cende write_blob, `/l1` RPC mock, `/feeder_gateway/*`, report UI. |
| 159 | +- `echonet/l1_logic/l1_manager.py`, `l1_blocks.py`, `l1_client.py` — L1_HANDLER lookup against Alchemy. |
| 160 | +- Rust sequencer side: `crates/apollo_mempool/`, `crates/apollo_batcher/`, `crates/apollo_consensus_*/`, `crates/apollo_gateway/`. |
| 161 | + |
| 162 | +Look at recent commits on the active echonet branch for context: |
| 163 | + |
| 164 | +```bash |
| 165 | +git log --oneline -30 --all -- echonet/ |
| 166 | +``` |
| 167 | + |
| 168 | +## 8. Known failure patterns — check these first |
| 169 | + |
| 170 | +If the symptom matches one of these, name the pattern explicitly in the report and confirm the deployed image actually contains the fix: |
| 171 | + |
| 172 | +```bash |
| 173 | +kubectl -n <ns> get pods -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u |
| 174 | +git log --all --oneline -- echonet/ crates/apollo_mempool/ crates/apollo_batcher/ # cross-reference SHAs |
| 175 | +``` |
| 176 | + |
| 177 | +Each namespace can run a different image; "fix already merged" doesn't mean "fix already deployed here." |
| 178 | + |
| 179 | +1. **Cairo-native vs CASM revert traces.** `echonet_only_revert` whose `revert_error` differs from mainnet's only by the topmost VM frame (typically missing or extra `pc=…`). The resync clears the cairo-native cache so the second pass falls back to CASM and matches. Not actually a bug in Echonet — a Blockifier divergence. |
| 180 | +2. **Alchemy 429 → L1_HANDLER drop.** `transaction_commitment` mismatch on a block that contains an `L1_HANDLER` tx. Look for HTTP `429` in Echonet logs near the failure window. Root cause: `l1_manager.set_new_tx` silently returns when `find_l1_block_for_tx` fails, so the L1_HANDLER never reaches the local block. |
| 181 | + |
| 182 | +## 9. Investigation workflow |
| 183 | + |
| 184 | +1. Open the UI report, find the entry for the failing block. Note the **trigger tx_hash** and **reason** (`gateway_error`, `echonet_only_revert`, `block_hash_mismatch`, `transaction_commitment_mismatch`). |
| 185 | +2. Branch on the reason: |
| 186 | + - **`block_hash_mismatch` / `transaction_commitment_mismatch`** — diff Echonet's block vs mainnet's (§6). Isolate the differing field; drill into the txs / events / state diff that produced it. |
| 187 | + - **`gateway_error`** — pull the tx, grep Echonet + sequencer GCP logs (§4) for the tx_hash, identify the gateway response code and message. |
| 188 | + - **`echonet_only_revert`** — fetch `revert_error` from both Echonet (block dump) and mainnet (feeder); diff them. Often the cairo-native pattern (§8.1). |
| 189 | +3. Verify the deployed image actually contains any fix you intend to invoke as "explained" (§8). |
| 190 | +4. Cross-check against known patterns before assuming a new bug. |
| 191 | +5. If the resync already replayed cleanly, that's still worth noting — but a successful retry doesn't make the original divergence "harmless"; it just means it's non-deterministic. |
| 192 | + |
| 193 | +## 10. Safety constraints (hard rules) |
| 194 | + |
| 195 | +- **Never read** secret files: `*secret*`, `*keys.json`, `.env*`, `echonet/k8s/echonet/secret.yaml`. |
| 196 | +- **Don't restart pods, scale deployments, redeploy, or trigger resyncs** without explicit approval. The whole point of investigating a resync is preserving the evidence of what caused it. |
| 197 | +- Read-only `kubectl logs`, `kubectl exec -- cat|ls|stat`, `kubectl cp` out of the pod, port-forwards, `gcloud logging read`, and curling mainnet's public feeder are all fine without asking. |
| 198 | + |
| 199 | +## 11. Reporting back |
| 200 | + |
| 201 | +Give the user: |
| 202 | +- **Trigger tx hash and reason.** |
| 203 | +- **Which field/commitment differed**, with both values (if applicable). |
| 204 | +- **Whether it matches a known pattern** (cite the section above) or appears novel. |
| 205 | +- **Where the bug lives** (`file:line`) and a concrete fix suggestion if you have one, including whether the namespace in question is running an image that already contains a known fix. |
0 commit comments