Skip to content

fix: TestSyncToTipMocha — Chain.Start() fails with peer dial errors before state sync begins #7147

@rootulp

Description

@rootulp

Problem

TestSyncToTipMocha has a second, distinct failure mode where mochaChain.Start() at test/docker-e2e/e2e_sync_to_tip_test.go:80 fails before Phase 1 (state sync) even begins. The node logs a storm of peer-dial errors and the test terminates at ~145s with Error: 130 (SIGINT).

This is distinct from #7138, which tracks a KPI-timeout failure mode inside Phase 2 (block sync).

Symptoms

Example from 2026-04-13 (run 24323756159):

e2e_sync_to_tip_test.go:76: Starting mocha sync-to-tip node
e2e_sync_to_tip_test.go:80: <error>
3:13AM ERR Error dialing peer err="dial tcp 65.109.83.40:28656: connect: connection refused" module=p2p
3:14AM ERR Error dialing peer err="dial tcp 152.53.33.96:12056: connect: connection refused" module=p2p
3:14AM ERR Error dialing peer err="dial tcp 177.54.156.69:26656: i/o timeout" module=p2p
3:14AM ERR Error dialing peer err="dial tcp 65.109.116.42:11656: i/o timeout" module=p2p
...
Error: 130
--- FAIL: TestCelestiaTestSuite/TestSyncToTipMocha (149.06s)

All listed peer errors are either connection refused or i/o timeout — the node cannot establish a connection to any mocha peer.

Affected runs (in the last 20 nightly runs)

Short-failure (~145s) runs that look like this mode:

Date Run Duration
2026-04-01 23829926363 144s
2026-04-02 23881709964 145s
2026-04-03 23931900878 145s
2026-04-04 23969952391 145s
2026-04-09 24170163213 145s
2026-04-11 24273208413 147s
2026-04-12 24297392661 147s
2026-04-13 24323756159 149s

8 failures out of 20 runs. The 04-01..04-09 set may overlap with #7017 (closed 2026-04-10) but failures continued after the fix, suggesting #7017 addressed a related but distinct issue (BlockWaitTimeout) rather than the root peer-dial problem.

Hypotheses

  1. Seed/peer list is stale or intermittent: mocha peer addresses in NewMochaConfig() may contain nodes that are offline or behind firewalls that block GitHub Actions runners' egress IPs.
  2. Mocha network instability at ~03:00 UTC: all nightly runs start around 03:00 UTC, which may coincide with a routine on mocha (e.g., chain upgrade window, validator maintenance).
  3. GitHub runner network egress restrictions: outbound TCP to arbitrary ports (11656, 12056, 26656, 26676, 28656, 43656) may be rate-limited or shaped.
  4. Connection-count limit: the node may give up after exhausting the configured peer list without establishing enough connections.

Next steps

  1. Check what timeout is being hit at ~145s — could be inside Chain.Start() or container-level.
  2. Review networks.NewMochaConfig() peer/seed freshness; consider using a curated reliable-peers list.
  3. Add logging of peer-connection state during Chain.Start() to diagnose future failures.
  4. Consider a retry-with-backoff for Chain.Start() that re-resolves peers.

Related

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Needs Triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions