Problem
TestSyncToTipMocha has a second, distinct failure mode where mochaChain.Start() at test/docker-e2e/e2e_sync_to_tip_test.go:80 fails before Phase 1 (state sync) even begins. The node logs a storm of peer-dial errors and the test terminates at ~145s with Error: 130 (SIGINT).
This is distinct from #7138, which tracks a KPI-timeout failure mode inside Phase 2 (block sync).
Symptoms
Example from 2026-04-13 (run 24323756159):
e2e_sync_to_tip_test.go:76: Starting mocha sync-to-tip node
e2e_sync_to_tip_test.go:80: <error>
3:13AM ERR Error dialing peer err="dial tcp 65.109.83.40:28656: connect: connection refused" module=p2p
3:14AM ERR Error dialing peer err="dial tcp 152.53.33.96:12056: connect: connection refused" module=p2p
3:14AM ERR Error dialing peer err="dial tcp 177.54.156.69:26656: i/o timeout" module=p2p
3:14AM ERR Error dialing peer err="dial tcp 65.109.116.42:11656: i/o timeout" module=p2p
...
Error: 130
--- FAIL: TestCelestiaTestSuite/TestSyncToTipMocha (149.06s)
All listed peer errors are either connection refused or i/o timeout — the node cannot establish a connection to any mocha peer.
Affected runs (in the last 20 nightly runs)
Short-failure (~145s) runs that look like this mode:
8 failures out of 20 runs. The 04-01..04-09 set may overlap with #7017 (closed 2026-04-10) but failures continued after the fix, suggesting #7017 addressed a related but distinct issue (BlockWaitTimeout) rather than the root peer-dial problem.
Hypotheses
- Seed/peer list is stale or intermittent: mocha peer addresses in
NewMochaConfig() may contain nodes that are offline or behind firewalls that block GitHub Actions runners' egress IPs.
- Mocha network instability at ~03:00 UTC: all nightly runs start around 03:00 UTC, which may coincide with a routine on mocha (e.g., chain upgrade window, validator maintenance).
- GitHub runner network egress restrictions: outbound TCP to arbitrary ports (11656, 12056, 26656, 26676, 28656, 43656) may be rate-limited or shaped.
- Connection-count limit: the node may give up after exhausting the configured peer list without establishing enough connections.
Next steps
- Check what timeout is being hit at ~145s — could be inside
Chain.Start() or container-level.
- Review
networks.NewMochaConfig() peer/seed freshness; consider using a curated reliable-peers list.
- Add logging of peer-connection state during
Chain.Start() to diagnose future failures.
- Consider a retry-with-backoff for
Chain.Start() that re-resolves peers.
Related
Problem
TestSyncToTipMochahas a second, distinct failure mode wheremochaChain.Start()attest/docker-e2e/e2e_sync_to_tip_test.go:80fails before Phase 1 (state sync) even begins. The node logs a storm of peer-dial errors and the test terminates at ~145s withError: 130(SIGINT).This is distinct from #7138, which tracks a KPI-timeout failure mode inside Phase 2 (block sync).
Symptoms
Example from 2026-04-13 (run 24323756159):
All listed peer errors are either
connection refusedori/o timeout— the node cannot establish a connection to any mocha peer.Affected runs (in the last 20 nightly runs)
Short-failure (~145s) runs that look like this mode:
8 failures out of 20 runs. The 04-01..04-09 set may overlap with #7017 (closed 2026-04-10) but failures continued after the fix, suggesting #7017 addressed a related but distinct issue (
BlockWaitTimeout) rather than the root peer-dial problem.Hypotheses
NewMochaConfig()may contain nodes that are offline or behind firewalls that block GitHub Actions runners' egress IPs.Next steps
Chain.Start()or container-level.networks.NewMochaConfig()peer/seed freshness; consider using a curated reliable-peers list.Chain.Start()to diagnose future failures.Chain.Start()that re-resolves peers.Related
WithBlockWaitTimeout, addressed a related 120s timeout)TestAllUpgrades)